Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] net/sched: act_pedit: limit negative offset
From: David Miller @ 2016-11-28  5:49 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: amir, netdev, jhs, ogerlitz, hadarh, jiri
In-Reply-To: <CAM_iQpX1at1w=3OfRFchn3STTDJGXj640tGetxXToR=EyddpEw@mail.gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Sun, 27 Nov 2016 21:39:33 -0800

> On Sun, Nov 27, 2016 at 7:58 AM, Amir Vadai <amir@vadai.me> wrote:
>> Should not allow setting a negative offset that goes below the skb head.
> ...
>> diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
>> index b54d56d4959b..e79e8a88f2d2 100644
>> --- a/net/sched/act_pedit.c
>> +++ b/net/sched/act_pedit.c
>> @@ -154,8 +154,11 @@ static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a,
>>                         }
>>
>>                         ptr = skb_header_pointer(skb, off + offset, 4, &_data);
>> -                       if (!ptr)
>> +                       if ((unsigned char *)ptr < skb->head) {
> 
> 
> ptr returned could be &_data, which is on stack, so why this comparison
> makes sense for this case?

Indeed, this will definitely do the wrong thing when the on-stack area
passed back to ptr.

^ permalink raw reply

* Re: [PATCH] geneve: fix ip_hdr_len reserved for geneve6 tunnel.
From: Pravin Shelar @ 2016-11-28  5:57 UTC (permalink / raw)
  To: Haishuang Yan
  Cc: David S. Miller, Hannes Frederic Sowa, Alexander Duyck,
	Pravin B Shelar, Jiri Benc, Linux Kernel Network Developers,
	linux-kernel
In-Reply-To: <1480310818-78456-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

On Sun, Nov 27, 2016 at 9:26 PM, Haishuang Yan
<yanhaishuang@cmss.chinamobile.com> wrote:
> It shold reserved sizeof(ipv6hdr) for geneve in ipv6 tunnel.
>
> Fixes: c3ef5aa5e5 ('geneve: Merge ipv4 and ipv6 geneve_build_skb()')
>
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Thanks for fix.

Acked-by: Pravin B Shelar <pshelar@ovn.org>

^ permalink raw reply

* RE: [net,v2] neigh: fix the loop index error in neigh dump
From: 张胜举 @ 2016-11-28  6:28 UTC (permalink / raw)
  To: 'David Ahern', netdev
In-Reply-To: <7e3623c4-a55e-5809-ca9a-2dbd59cda871@cumulusnetworks.com>

> -----Original Message-----
> From: David Ahern [mailto:dsa@cumulusnetworks.com]
> Sent: Monday, November 28, 2016 1:07 PM
> To: 张胜举 <zhangshengju@cmss.chinamobile.com>;
> netdev@vger.kernel.org
> Subject: Re: [net,v2] neigh: fix the loop index error in neigh dump
> 
> On 11/27/16 9:50 PM, 张胜举 wrote:
> > No, when dump request must be processed by multiple 'recv/recvmsg'
> > system calls, idx stores which dev/neigh the previous call have
> > processed, so that next call will scan from the right place.
> 
> I have tested multiple calls and I do not see redundant information or
missing
> information.
> 
> >
> > So no matter whether the dev/neigh is filtered, the idx should be
> > increased anyway.
> 
> No, it does not. Again, idx is the index in the list of devices/ of
interest. It is
> NOT a device index nor is it the absolute index in the list. It is a
relative index.
> The filter is the same across recvmsg calls so the idx count is absolutely
fine.
> 
> Produce a test case that fails.
David, I know your point. And I agree with you that this will not make 
redundant or missing link information.

But this will cause the filtered out device be scanned multiple times. 

For example, assume that netlink message can only store two devices info.

And eth2-eth5 are filtered out.

For the first loop, idx will point to eth2, but the code already scan to
eth6.
eth0->eth1->eth2(out)->eth3(out)-> eth4(out)->eth5(out)->eth6->eth7
                         ^
The next loop, the code will start to scan from eth2 to eth8, but eth2-eth5 
already scanned by previous loop. After this loop, idx will point to eth4.
eth0->eth1->eth2(out)->eth3(out)->eth4(out)->eth5(out)->eth6->eth7->eth8
                                                                  ^
So this will cause the same device to be scanned multiple times.

Almost all other dump functions treat idx as the absolute index in the list,

and will not have the above problem. 

We don't treat this a bugfix, but i think we'd better in line with other 
dump functions.

^ permalink raw reply

* [patch net] net: dsa: fix unbalanced dsa_switch_tree reference counting
From: Nikita Yushchenko @ 2016-11-28  6:48 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Chris Healy, Andrew Lunn, linux-kernel, Nikita Yushchenko

_dsa_register_switch() gets a dsa_switch_tree object either via
dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref
in returned object (resulting into caller not owning a reference),
while later path does create a new object (resulting into caller owning
a reference).

The rest of _dsa_register_switch() assumes that it owns a reference, and
calls dsa_put_dst().

This causes a memory breakage if first switch in the tree initialized
successfully, but second failed to initialize. In particular, freed
dsa_swith_tree object is left referenced by switch that was initialized,
and later access to sysfs attributes of that switch cause OOPS.

To fix, need to add kref_get() call to dsa_get_dst().

Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 net/dsa/dsa2.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index f8a7d9aab437..5fff951a0a49 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -28,8 +28,10 @@ static struct dsa_switch_tree *dsa_get_dst(u32 tree)
 	struct dsa_switch_tree *dst;

 	list_for_each_entry(dst, &dsa_switch_trees, list)
-		if (dst->tree == tree)
+		if (dst->tree == tree) {
+			kref_get(&dst->refcount);
 			return dst;
+		}
 	return NULL;
 }

-- 
2.1.4

^ permalink raw reply related

* Re: Crash due to mutex genl_lock called from RCU context
From: Cong Wang @ 2016-11-28  6:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Subash Abhinov Kasiviswanathan, Thomas Graf,
	Linux Kernel Network Developers, Herbert Xu
In-Reply-To: <1480263824.18162.44.camel@edumazet-glaptop3.roam.corp.google.com>

On Sun, Nov 27, 2016 at 8:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2016-11-26 at 22:28 -0800, Cong Wang wrote:
>> On Sat, Nov 26, 2016 at 6:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Are you telling me inet_release() is called when we close() the first
>> > file descriptor ?
>> >
>> > fd1 = socket()
>> > fd2 = dup(fd1);
>> > close(fd2) -> release() ???
>>
>> Sorry, I didn't express myself clearly, I meant your change,
>> if exclude the SOCK_RCU_FREE part, basically reverts this commit:
>>
>> commit 3f660d66dfbc13ea4b61d3865851b348444c24b4
>> Author: Herbert Xu <herbert@gondor.apana.org.au>
>> Date:   Thu May 3 03:17:14 2007 -0700
>>
>>     [NETLINK]: Kill CB only when socket is unused
>>
>> IOW, ->release() is called when the last sock fd ref is gone, but ->destructor()
>> is called with the last sock ref is gone. They are very different.
>
> Hmm...
>
>
>> I am confused, what Subash reported is a kernel warning which can
>> surely be fixed by removing genl lock (if it is correct, I need to double
>> check), so why for net-next?
>
> Because Subash pointed to a buggy commit.
>
> We want to fix all issues bring by this commit, not only the immediate
> problem about mutex.
>
> I have no idea if we can safely remove the mutex from genl_lock_done() :

I meant removing it only for the destructor case, we definitely can't remove
it for the dump case.

>
> The genl_lock() is not only protecting the socket itself, it might
> protect global data as well, or protect some kind of lock ordering among
> multiple mutexes.
>
> Have you checked all genl users, down to linux-4.0 , point where commit
> 21e4902aea80ef35a was added ?
>

I just took a deeper look, some user calls rhashtable_destroy() in ->done(),
so even removing that genl lock is not enough, perhaps we should just
move it to a work struct like what Daniel does for the tcf_proto, but that is
ugly... I don't know if RCU provides any API to execute the callback in process
context.

^ permalink raw reply

* RE: BALANCE PAYMENT
From: coral @ 2016-11-28  6:03 UTC (permalink / raw)


Dear Sir/s,

Please see attached.


Thanks and regards,

Accounts Department
Al Omraniya Trading Co. LLC
P.O. Box: 10757, Al Khabaisi Area,
Deira 2, Dubai, U.A.E.
Tel: +971 4 268 2730 / Fax: +971 4 268 4117

^ permalink raw reply

* Re: [PATCH net] net, sched: respect rcu grace period on cls destruction
From: Cong Wang @ 2016-11-28  6:57 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David Miller, John Fastabend, Roi Dayan, ast,
	Hannes Frederic Sowa, Jiri Pirko, Linux Kernel Network Developers,
	Paul E. McKenney
In-Reply-To: <0d6d89f885033f1739e97f7f3372ae6e1db72892.1480204343.git.daniel@iogearbox.net>

On Sat, Nov 26, 2016 at 4:18 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> Roi reported a crash in flower where tp->root was NULL in ->classify()
> callbacks. Reason is that in ->destroy() tp->root is set to NULL via
> RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
> this doesn't respect RCU grace period for them, and as a result, still
> outstanding readers from tc_classify() will try to blindly dereference
> a NULL tp->root.
>
> The tp->root object is strictly private to the classifier implementation
> and holds internal data the core such as tc_ctl_tfilter() doesn't know
> about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
> is only checked for NULL in ->get() callback, but nowhere else. This is
> misleading and seemed to be copied from old classifier code that was not
> cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
> fix NULL pointer dereference") moved tp->root initialization into ->init()
> routine, where before it was part of ->change(), so ->get() had to deal
> with tp->root being NULL back then, so that was indeed a valid case, after
> d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
> ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
> in packet classifiers"); but the NULLifying was reintroduced with the
> RCUification, but it's not correct for every classifier implementation.
>
> In the cases that are fixed here with one exception of cls_cgroup, tp->root
> object is allocated and initialized inside ->init() callback, which is always
> performed at a point in time after we allocate a new tp, which means tp and
> thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
> Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
> handler, same for the tp which is kfree_rcu()'ed right when we return
> from ->destroy() in tcf_destroy(). This means, the head object's lifetime
> for such classifiers is always tied to the tp lifetime. The RCU callback
> invocation for the two kfree_rcu() could be out of order, but that's fine
> since both are independent.
>
> Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
> means that 1) we don't need a useless NULL check in fast-path and, 2) that
> outstanding readers of that tp in tc_classify() can still execute under
> respect with RCU grace period as it is actually expected.
>
> Things that haven't been touched here: cls_fw and cls_route. They each
> handle tp->root being NULL in ->classify() path for historic reasons, so
> their ->destroy() implementation can stay as is. If someone actually
> cares, they could get cleaned up at some point to avoid the test in fast
> path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
> !head should anyone actually be using/testing it, so it at least aligns with
> cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
> destruction (to a sleepable context) after RCU grace period as concurrent
> readers might still access it. (Note that in this case we need to hold module
> reference to keep work callback address intact, since we only wait on module
> unload for all call_rcu()s to finish.)
>
> This fixes one race to bring RCU grace period guarantees back. Next step
> as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
> proto tp when all filters are gone") to get the order of unlinking the tp
> in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
> RCU_INIT_POINTER() before tcf_destroy() and let the notification for
> removal be done through the prior ->delete() callback. Both are independant
> issues. Once we have that right, we can then clean tp->root up for a number
> of classifiers by not making them RCU pointers, which requires a new callback
> (->uninit) that is triggered from tp's RCU callback, where we just kfree()
> tp->root from there.

Looks good to my eyes,

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

The ugly part is the work struct, I am not an RCU expert so don't know if we
have any API to execute an RCU callback in process context. Paul?

Thanks.

^ permalink raw reply

* RE: BALANCE PAYMENT
From: coral @ 2016-11-28  6:03 UTC (permalink / raw)


Dear Sir/s,

Please see attached.


Thanks and regards,

Accounts Department
Al Omraniya Trading Co. LLC
P.O. Box: 10757, Al Khabaisi Area,
Deira 2, Dubai, U.A.E.
Tel: +971 4 268 2730 / Fax: +971 4 268 4117

^ permalink raw reply

* [PATCH] vxlan: fix a potential issue when create a new vxlan fdb entry.
From: Haishuang Yan @ 2016-11-28  7:02 UTC (permalink / raw)
  To: David S. Miller, Jiri Benc, Hannes Frederic Sowa, Pravin B Shelar
  Cc: netdev, linux-kernel, Haishuang Yan

vxlan_fdb_append may return error, so add the proper check,
otherwise it will cause memory leak.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
---
 drivers/net/vxlan.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 21e92be..3b7b237 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -611,6 +611,7 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
 	struct vxlan_rdst *rd = NULL;
 	struct vxlan_fdb *f;
 	int notify = 0;
+	int rc = 0;
 
 	f = __vxlan_find_mac(vxlan, mac);
 	if (f) {
@@ -641,8 +642,7 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
 		if ((flags & NLM_F_APPEND) &&
 		    (is_multicast_ether_addr(f->eth_addr) ||
 		     is_zero_ether_addr(f->eth_addr))) {
-			int rc = vxlan_fdb_append(f, ip, port, vni, ifindex,
-						  &rd);
+			rc = vxlan_fdb_append(f, ip, port, vni, ifindex, &rd);
 
 			if (rc < 0)
 				return rc;
@@ -673,7 +673,11 @@ static int vxlan_fdb_create(struct vxlan_dev *vxlan,
 		INIT_LIST_HEAD(&f->remotes);
 		memcpy(f->eth_addr, mac, ETH_ALEN);
 
-		vxlan_fdb_append(f, ip, port, vni, ifindex, &rd);
+		rc = vxlan_fdb_append(f, ip, port, vni, ifindex, &rd);
+		if (rc < 0) {
+			kfree(f);
+			return rc;
+		}
 
 		++vxlan->addrcnt;
 		hlist_add_head_rcu(&f->hlist,
-- 
1.8.3.1

^ permalink raw reply related

* RE: BALANCE PAYMENT
From: coral @ 2016-11-28  6:01 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 207 bytes --]

Dear Sir/s,

Please see attached.


Thanks and regards,

Accounts Department
Al Omraniya Trading Co. LLC
P.O. Box: 10757, Al Khabaisi Area,
Deira 2, Dubai, U.A.E.
Tel: +971 4 268 2730 / Fax: +971 4 268 4117

[-- Attachment #2: Bank details for payment                                                                                                                                                   .zip --]
[-- Type: application/zip, Size: 1203214 bytes --]

^ permalink raw reply

* Re: [PATCH] net: fec: turn on device when extracting statistics
From: Nikita Yushchenko @ 2016-11-28  7:06 UTC (permalink / raw)
  To: David Miller
  Cc: fugang.duan, troy.kisky, andrew, eric, tremyfr, johannes, netdev,
	linux-kernel, cphealy, Fabio Estevam
In-Reply-To: <20161127.202945.1759992980026862076.davem@davemloft.net>



28.11.2016 04:29, David Miller пишет:
> From: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
> Date: Fri, 25 Nov 2016 13:02:00 +0300
> 
>> +	int i, ret;
>> +
>> +	ret = pm_runtime_get_sync(&fep->pdev->dev);
>> +	if (IS_ERR_VALUE(ret)) {
>> +		memset(data, 0, sizeof(*data) * ARRAY_SIZE(fec_stats));
>> +		return;
>> +	}
> 
> This really isn't the way to do this.
> 
> When the device is suspended and the clocks are going to be stopped,
> you must fetch the statistic values into a software copy and provide
> those if the device is suspended when statistics are requested.

Ok, can do that, although can't see what's wrong with waking device
here. The situation of requesting stats on down device isn't something
widely used, thus keeping handling of that as local as possible looks
better for me.

^ permalink raw reply

* [PATCH net-next v2 0/6] tcp: sender chronographs instrumentation
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng

This patch set provides instrumentation on TCP sender limitations.
While developing the BBR congestion control, we noticed that TCP
sending process is often limited by factors unrelated to congestion
control: insufficient sender buffer and/or insufficient receive
window/buffer to saturate the network bandwidth. Unfortunately these
limits are not visible to the users and often the poor performance
is attributed to the congestion control of choice.

Thie patch aims to help users get the high level understanding of
where sending process is limited by, similar to the TCP_INFO design.
It is not to replace detailed kernel tracing and instrumentation
facilities.

In addition this patch set provide a new option to the timestamping
work to instrument these limits on application data unit. For exampe,
one can use SO_TIMESTAMPING and this patch set to measure the how
long a particular HTTP response is limited by small receive window.

Patch set was initially written by Francis Yan then polished
by Yuchung Cheng, with lots of help from Eric Dumazet and Soheil
Hassas Yeganeh.

Francis Yan (6):
  tcp: instrument tcp sender limits chronographs
  tcp: instrument how long TCP is busy sending
  tcp: instrument how long TCP is limited by receive window
  tcp: instrument how long TCP is limited by insufficient send buffer
  tcp: export sender limits chronographs to TCP_INFO
  tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING

 Documentation/networking/timestamping.txt | 10 +++++
 arch/alpha/include/uapi/asm/socket.h      |  2 +
 arch/frv/include/uapi/asm/socket.h        |  2 +
 arch/ia64/include/uapi/asm/socket.h       |  2 +
 arch/m32r/include/uapi/asm/socket.h       |  2 +
 arch/mips/include/uapi/asm/socket.h       |  2 +
 arch/mn10300/include/uapi/asm/socket.h    |  2 +
 arch/parisc/include/uapi/asm/socket.h     |  2 +
 arch/powerpc/include/uapi/asm/socket.h    |  2 +
 arch/s390/include/uapi/asm/socket.h       |  2 +
 arch/sparc/include/uapi/asm/socket.h      |  2 +
 arch/xtensa/include/uapi/asm/socket.h     |  2 +
 include/linux/tcp.h                       |  9 ++++-
 include/net/tcp.h                         | 20 +++++++++-
 include/uapi/asm-generic/socket.h         |  2 +
 include/uapi/linux/net_tstamp.h           |  3 +-
 include/uapi/linux/tcp.h                  | 12 ++++++
 net/core/skbuff.c                         | 14 +++++--
 net/core/sock.c                           |  7 ++++
 net/ipv4/tcp.c                            | 50 ++++++++++++++++++++++-
 net/ipv4/tcp_input.c                      |  8 +++-
 net/ipv4/tcp_output.c                     | 66 ++++++++++++++++++++++++++++++-
 net/socket.c                              |  7 +++-
 23 files changed, 217 insertions(+), 13 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH net-next v2 1/6] tcp: instrument tcp sender limits chronographs
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng
In-Reply-To: <1480316838-154141-1-git-send-email-ycheng@google.com>

From: Francis Yan <francisyyan@gmail.com>

This patch implements the skeleton of the TCP chronograph
instrumentation on sender side limits:

	1) idle (unspec)
	2) busy sending data other than 3-4 below
	3) rwnd-limited
	4) sndbuf-limited

The limits are enumerated 'tcp_chrono'. Since a connection in
theory can idle forever, we do not track the actual length of this
uninteresting idle period. For the rest we track how long the sender
spends in each limit. At any point during the life time of a
connection, the sender must be in one of the four states.

If there are multiple conditions worthy of tracking in a chronograph
then the highest priority enum takes precedence over
the other conditions. So that if something "more interesting"
starts happening, stop the previous chrono and start a new one.

The time unit is jiffy(u32) in order to save space in tcp_sock.
This implies application must sample the stats no longer than every
49 days of 1ms jiffy.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/linux/tcp.h   |  7 +++++--
 include/net/tcp.h     | 14 ++++++++++++++
 net/ipv4/tcp_output.c | 30 ++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 32a7c7e..d5d3bd8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -211,8 +211,11 @@ struct tcp_sock {
 		u8 reord;    /* reordering detected */
 	} rack;
 	u16	advmss;		/* Advertised MSS			*/
-	u8	rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
-		unused:7;
+	u32	chrono_start;	/* Start time in jiffies of a TCP chrono */
+	u32	chrono_stat[3];	/* Time in jiffies for chrono_stat stats */
+	u8	chrono_type:2,	/* current chronograph type */
+		rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
+		unused:5;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de8073..e5ff408 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1516,6 +1516,20 @@ struct tcp_fastopen_context {
 	struct rcu_head		rcu;
 };
 
+/* Latencies incurred by various limits for a sender. They are
+ * chronograph-like stats that are mutually exclusive.
+ */
+enum tcp_chrono {
+	TCP_CHRONO_UNSPEC,
+	TCP_CHRONO_BUSY, /* Actively sending data (non-empty write queue) */
+	TCP_CHRONO_RWND_LIMITED, /* Stalled by insufficient receive window */
+	TCP_CHRONO_SNDBUF_LIMITED, /* Stalled by insufficient send buffer */
+	__TCP_CHRONO_MAX,
+};
+
+void tcp_chrono_start(struct sock *sk, const enum tcp_chrono type);
+void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type);
+
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 19105b4..34f7517 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2081,6 +2081,36 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
 	return false;
 }
 
+static void tcp_chrono_set(struct tcp_sock *tp, const enum tcp_chrono new)
+{
+	const u32 now = tcp_time_stamp;
+
+	if (tp->chrono_type > TCP_CHRONO_UNSPEC)
+		tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
+	tp->chrono_start = now;
+	tp->chrono_type = new;
+}
+
+void tcp_chrono_start(struct sock *sk, const enum tcp_chrono type)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	/* If there are multiple conditions worthy of tracking in a
+	 * chronograph then the highest priority enum takes precedence over
+	 * the other conditions. So that if something "more interesting"
+	 * starts happening, stop the previous chrono and start a new one.
+	 */
+	if (type > tp->chrono_type)
+		tcp_chrono_set(tp, type);
+}
+
+void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tcp_chrono_set(tp, TCP_CHRONO_UNSPEC);
+}
+
 /* This routine writes packets to the network.  It advances the
  * send_head.  This happens as incoming acks open up the remote
  * window for us.
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next v2 2/6] tcp: instrument how long TCP is busy sending
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng
In-Reply-To: <1480316838-154141-1-git-send-email-ycheng@google.com>

From: Francis Yan <francisyyan@gmail.com>

This patch measures TCP busy time, which is defined as the period
of time when sender has data (or FIN) to send. The time starts when
data is buffered and stops when the write queue is flushed by ACKs
or error events.

Note the busy time does not include SYN time, unless data is
included in SYN (i.e. Fast Open). It does include FIN time even
if the FIN carries no payload. Excluding pure FIN is possible but
would incur one additional test in the fast path, which may not
be worth it.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h     |  6 +++++-
 net/ipv4/tcp_input.c  |  3 +++
 net/ipv4/tcp_output.c | 19 ++++++++++++++++---
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e5ff408..3e097e3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1535,6 +1535,7 @@ static inline void tcp_write_queue_purge(struct sock *sk)
 {
 	struct sk_buff *skb;
 
+	tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
 	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
 		sk_wmem_free_skb(sk, skb);
 	sk_mem_reclaim(sk);
@@ -1593,8 +1594,10 @@ static inline void tcp_advance_send_head(struct sock *sk, const struct sk_buff *
 
 static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unlinked)
 {
-	if (sk->sk_send_head == skb_unlinked)
+	if (sk->sk_send_head == skb_unlinked) {
 		sk->sk_send_head = NULL;
+		tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
+	}
 	if (tcp_sk(sk)->highest_sack == skb_unlinked)
 		tcp_sk(sk)->highest_sack = NULL;
 }
@@ -1616,6 +1619,7 @@ static inline void tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb
 	/* Queue it, remembering where we must start sending. */
 	if (sk->sk_send_head == NULL) {
 		sk->sk_send_head = skb;
+		tcp_chrono_start(sk, TCP_CHRONO_BUSY);
 
 		if (tcp_sk(sk)->highest_sack == NULL)
 			tcp_sk(sk)->highest_sack = skb;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 22e6a20..a5d1727 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3178,6 +3178,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			tp->lost_skb_hint = NULL;
 	}
 
+	if (!skb)
+		tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
+
 	if (likely(between(tp->snd_up, prior_snd_una, tp->snd_una)))
 		tp->snd_up = tp->snd_una;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 34f7517..e8ea584 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2096,8 +2096,8 @@ void tcp_chrono_start(struct sock *sk, const enum tcp_chrono type)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	/* If there are multiple conditions worthy of tracking in a
-	 * chronograph then the highest priority enum takes precedence over
-	 * the other conditions. So that if something "more interesting"
+	 * chronograph then the highest priority enum takes precedence
+	 * over the other conditions. So that if something "more interesting"
 	 * starts happening, stop the previous chrono and start a new one.
 	 */
 	if (type > tp->chrono_type)
@@ -2108,7 +2108,18 @@ void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	tcp_chrono_set(tp, TCP_CHRONO_UNSPEC);
+
+	/* There are multiple conditions worthy of tracking in a
+	 * chronograph, so that the highest priority enum takes
+	 * precedence over the other conditions (see tcp_chrono_start).
+	 * If a condition stops, we only stop chrono tracking if
+	 * it's the "most interesting" or current chrono we are
+	 * tracking and starts busy chrono if we have pending data.
+	 */
+	if (tcp_write_queue_empty(sk))
+		tcp_chrono_set(tp, TCP_CHRONO_UNSPEC);
+	else if (type == tp->chrono_type)
+		tcp_chrono_set(tp, TCP_CHRONO_BUSY);
 }
 
 /* This routine writes packets to the network.  It advances the
@@ -3328,6 +3339,8 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 	fo->copied = space;
 
 	tcp_connect_queue_skb(sk, syn_data);
+	if (syn_data->len)
+		tcp_chrono_start(sk, TCP_CHRONO_BUSY);
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next v2 3/6] tcp: instrument how long TCP is limited by receive window
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng
In-Reply-To: <1480316838-154141-1-git-send-email-ycheng@google.com>

From: Francis Yan <francisyyan@gmail.com>

This patch measures the total time when the TCP stops sending because
the receiver's advertised window is not large enough. Note that
once the limit is lifted we are likely in the busy status if we
have data pending.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_output.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e8ea584..b74444c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2144,7 +2144,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	unsigned int tso_segs, sent_pkts;
 	int cwnd_quota;
 	int result;
-	bool is_cwnd_limited = false;
+	bool is_cwnd_limited = false, is_rwnd_limited = false;
 	u32 max_segs;
 
 	sent_pkts = 0;
@@ -2181,8 +2181,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
+		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
+			is_rwnd_limited = true;
 			break;
+		}
 
 		if (tso_segs == 1) {
 			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
@@ -2227,6 +2229,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			break;
 	}
 
+	if (is_rwnd_limited)
+		tcp_chrono_start(sk, TCP_CHRONO_RWND_LIMITED);
+	else
+		tcp_chrono_stop(sk, TCP_CHRONO_RWND_LIMITED);
+
 	if (likely(sent_pkts)) {
 		if (tcp_in_cwnd_reduction(sk))
 			tp->prr_out += sent_pkts;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next v2 4/6] tcp: instrument how long TCP is limited by insufficient send buffer
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng
In-Reply-To: <1480316838-154141-1-git-send-email-ycheng@google.com>

From: Francis Yan <francisyyan@gmail.com>

This patch measures the amount of time when TCP runs out of new data
to send to the network due to insufficient send buffer, while TCP
is still busy delivering (i.e. write queue is not empty). The goal
is to indicate either the send buffer autotuning or user SO_SNDBUF
setting has resulted network under-utilization.

The measurement starts conservatively by checking various conditions
to minimize false claims (i.e. under-estimation is more likely).
The measurement stops when the SOCK_NOSPACE flag is cleared. But it
does not account the time elapsed till the next application write.
Also the measurement only starts if the sender is still busy sending
data, s.t. the limit accounted is part of the total busy time.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp.c        | 10 ++++++++--
 net/ipv4/tcp_input.c  |  5 ++++-
 net/ipv4/tcp_output.c | 12 ++++++++++++
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 913f9bb..259ffb5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -996,8 +996,11 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		goto out;
 out_err:
 	/* make sure we wake any epoll edge trigger waiter */
-	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN))
+	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+		     err == -EAGAIN)) {
 		sk->sk_write_space(sk);
+		tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
+	}
 	return sk_stream_error(sk, flags, err);
 }
 
@@ -1331,8 +1334,11 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 out_err:
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
-	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN))
+	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
+		     err == -EAGAIN)) {
 		sk->sk_write_space(sk);
+		tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
+	}
 	release_sock(sk);
 	return err;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a5d1727..56fe736 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5059,8 +5059,11 @@ static void tcp_check_space(struct sock *sk)
 		/* pairs with tcp_poll() */
 		smp_mb__after_atomic();
 		if (sk->sk_socket &&
-		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
 			tcp_new_space(sk);
+			if (!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+				tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
+		}
 	}
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b74444c..d3545d0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1514,6 +1514,18 @@ static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
 		if (sysctl_tcp_slow_start_after_idle &&
 		    (s32)(tcp_time_stamp - tp->snd_cwnd_stamp) >= inet_csk(sk)->icsk_rto)
 			tcp_cwnd_application_limited(sk);
+
+		/* The following conditions together indicate the starvation
+		 * is caused by insufficient sender buffer:
+		 * 1) just sent some data (see tcp_write_xmit)
+		 * 2) not cwnd limited (this else condition)
+		 * 3) no more data to send (null tcp_send_head )
+		 * 4) application is hitting buffer limit (SOCK_NOSPACE)
+		 */
+		if (!tcp_send_head(sk) && sk->sk_socket &&
+		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags) &&
+		    (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
+			tcp_chrono_start(sk, TCP_CHRONO_SNDBUF_LIMITED);
 	}
 }
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next v2 5/6] tcp: export sender limits chronographs to TCP_INFO
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng
In-Reply-To: <1480316838-154141-1-git-send-email-ycheng@google.com>

From: Francis Yan <francisyyan@gmail.com>

This patch exports all the sender chronograph measurements collected
in the previous patches to TCP_INFO interface. Note that busy time
exported includes all the other sending limits (rwnd-limited,
sndbuf-limited). Internally the time unit is jiffy but externally
the measurements are in microseconds for future extensions.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/uapi/linux/tcp.h |  4 ++++
 net/ipv4/tcp.c           | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 73ac0db..2863b66 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -214,6 +214,10 @@ struct tcp_info {
 	__u32	tcpi_data_segs_out;	/* RFC4898 tcpEStatsDataSegsOut */
 
 	__u64   tcpi_delivery_rate;
+
+	__u64	tcpi_busy_time;      /* Time (usec) busy sending data */
+	__u64	tcpi_rwnd_limited;   /* Time (usec) limited by receive window */
+	__u64	tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 259ffb5..cdde20f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2708,6 +2708,25 @@ int compat_tcp_setsockopt(struct sock *sk, int level, int optname,
 EXPORT_SYMBOL(compat_tcp_setsockopt);
 #endif
 
+static void tcp_get_info_chrono_stats(const struct tcp_sock *tp,
+				      struct tcp_info *info)
+{
+	u64 stats[__TCP_CHRONO_MAX], total = 0;
+	enum tcp_chrono i;
+
+	for (i = TCP_CHRONO_BUSY; i < __TCP_CHRONO_MAX; ++i) {
+		stats[i] = tp->chrono_stat[i - 1];
+		if (i == tp->chrono_type)
+			stats[i] += tcp_time_stamp - tp->chrono_start;
+		stats[i] *= USEC_PER_SEC / HZ;
+		total += stats[i];
+	}
+
+	info->tcpi_busy_time = total;
+	info->tcpi_rwnd_limited = stats[TCP_CHRONO_RWND_LIMITED];
+	info->tcpi_sndbuf_limited = stats[TCP_CHRONO_SNDBUF_LIMITED];
+}
+
 /* Return information about state of tcp endpoint in API format. */
 void tcp_get_info(struct sock *sk, struct tcp_info *info)
 {
@@ -2800,6 +2819,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_bytes_acked = tp->bytes_acked;
 	info->tcpi_bytes_received = tp->bytes_received;
 	info->tcpi_notsent_bytes = max_t(int, 0, tp->write_seq - tp->snd_nxt);
+	tcp_get_info_chrono_stats(tp, info);
 
 	unlock_sock_fast(sk, slow);
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next v2 6/6] tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING
From: Yuchung Cheng @ 2016-11-28  7:07 UTC (permalink / raw)
  To: davem, soheil, francisyyan; +Cc: netdev, ncardwell, edumazet, Yuchung Cheng
In-Reply-To: <1480316838-154141-1-git-send-email-ycheng@google.com>

From: Francis Yan <francisyyan@gmail.com>

This patch exports the sender chronograph stats via the socket
SO_TIMESTAMPING channel. Currently we can instrument how long a
particular application unit of data was queued in TCP by tracking
SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
these sender chronograph stats exported simultaneously along with
these timestamps allow further breaking down the various sender
limitation.  For example, a video server can tell if a particular
chunk of video on a connection takes a long time to deliver because
TCP was experiencing small receive window. It is not possible to
tell before this patch without packet traces.

To prepare these stats, the user needs to set
SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
while requesting other SOF_TIMESTAMPING TX timestamps. When the
timestamps are available in the error queue, the stats are returned
in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
ChangeLog since v1:
 - fix build break if CONFIG_INET is not defined

 Documentation/networking/timestamping.txt | 10 ++++++++++
 arch/alpha/include/uapi/asm/socket.h      |  2 ++
 arch/frv/include/uapi/asm/socket.h        |  2 ++
 arch/ia64/include/uapi/asm/socket.h       |  2 ++
 arch/m32r/include/uapi/asm/socket.h       |  2 ++
 arch/mips/include/uapi/asm/socket.h       |  2 ++
 arch/mn10300/include/uapi/asm/socket.h    |  2 ++
 arch/parisc/include/uapi/asm/socket.h     |  2 ++
 arch/powerpc/include/uapi/asm/socket.h    |  2 ++
 arch/s390/include/uapi/asm/socket.h       |  2 ++
 arch/sparc/include/uapi/asm/socket.h      |  2 ++
 arch/xtensa/include/uapi/asm/socket.h     |  2 ++
 include/linux/tcp.h                       |  2 ++
 include/uapi/asm-generic/socket.h         |  2 ++
 include/uapi/linux/net_tstamp.h           |  3 ++-
 include/uapi/linux/tcp.h                  |  8 ++++++++
 net/core/skbuff.c                         | 14 +++++++++++---
 net/core/sock.c                           |  7 +++++++
 net/ipv4/tcp.c                            | 20 ++++++++++++++++++++
 net/socket.c                              |  7 ++++++-
 20 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 671cccf..96f5069 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -182,6 +182,16 @@ SOF_TIMESTAMPING_OPT_TSONLY:
   the timestamp even if sysctl net.core.tstamp_allow_data is 0.
   This option disables SOF_TIMESTAMPING_OPT_CMSG.
 
+SOF_TIMESTAMPING_OPT_STATS:
+
+  Optional stats that are obtained along with the transmit timestamps.
+  It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the
+  transmit timestamp is available, the stats are available in a
+  separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a
+  list of TLVs (struct nlattr) of types. These stats allow the
+  application to associate various transport layer stats with
+  the transmit timestamps, such as how long a certain block of
+  data was limited by peer's receiver window.
 
 New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
 disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 9e46d6e..afc901b 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -97,4 +97,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index afbc98f0..81e0353 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -90,5 +90,7 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 0018fad..57feb0c 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -99,4 +99,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 5fe42fc..5853f8e9 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -90,4 +90,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 2027240a..566ecdc 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -108,4 +108,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 5129f23..0e12527 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -90,4 +90,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 9c935d7..7a109b7 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -89,4 +89,6 @@
 
 #define SO_CNX_ADVICE		0x402E
 
+#define SCM_TIMESTAMPING_OPT_STATS	0x402F
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index 1672e33..44583a5 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -97,4 +97,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 41b51c2..b24a64c 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -96,4 +96,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 31aede3..a25dc32 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -86,6 +86,8 @@
 
 #define SO_CNX_ADVICE		0x0037
 
+#define SCM_TIMESTAMPING_OPT_STATS	0x0038
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 81435d9..9fdbe1f 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d5d3bd8..00e0ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -428,4 +428,6 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
 	tp->saved_syn = NULL;
 }
 
+struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk);
+
 #endif	/* _LINUX_TCP_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 67d632f..2c748dd 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -92,4 +92,6 @@
 
 #define SO_CNX_ADVICE		53
 
+#define SCM_TIMESTAMPING_OPT_STATS	54
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 264e515..464dcca 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -25,8 +25,9 @@ enum {
 	SOF_TIMESTAMPING_TX_ACK = (1<<9),
 	SOF_TIMESTAMPING_OPT_CMSG = (1<<10),
 	SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
+	SOF_TIMESTAMPING_OPT_STATS = (1<<12),
 
-	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY,
+	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_STATS,
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 2863b66..c53de26 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -220,6 +220,14 @@ struct tcp_info {
 	__u64	tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */
 };
 
+/* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
+enum {
+	TCP_NLA_PAD,
+	TCP_NLA_BUSY,		/* Time (usec) busy sending data */
+	TCP_NLA_RWND_LIMITED,	/* Time (usec) limited by receive window */
+	TCP_NLA_SNDBUF_LIMITED,	/* Time (usec) limited by send buffer */
+};
+
 /* for TCP_MD5SIG socket option */
 #define TCP_MD5SIG_MAXKEYLEN	80
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d1d1a5a..ea6fa95 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3839,10 +3839,18 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!skb_may_tx_timestamp(sk, tsonly))
 		return;
 
-	if (tsonly)
-		skb = alloc_skb(0, GFP_ATOMIC);
-	else
+	if (tsonly) {
+#ifdef CONFIG_INET
+		if ((sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS) &&
+		    sk->sk_protocol == IPPROTO_TCP &&
+		    sk->sk_type == SOCK_STREAM)
+			skb = tcp_get_timestamping_opt_stats(sk);
+		else
+#endif
+			skb = alloc_skb(0, GFP_ATOMIC);
+	} else {
 		skb = skb_clone(orig_skb, GFP_ATOMIC);
+	}
 	if (!skb)
 		return;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 14e6145..d8c7f8c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -854,6 +854,13 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 				sk->sk_tskey = 0;
 			}
 		}
+
+		if (val & SOF_TIMESTAMPING_OPT_STATS &&
+		    !(val & SOF_TIMESTAMPING_OPT_TSONLY)) {
+			ret = -EINVAL;
+			break;
+		}
+
 		sk->sk_tsflags = val;
 		if (val & SOF_TIMESTAMPING_RX_SOFTWARE)
 			sock_enable_timestamp(sk,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cdde20f..1149b48 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2841,6 +2841,26 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
 
+struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *stats;
+	struct tcp_info info;
+
+	stats = alloc_skb(3 * nla_total_size_64bit(sizeof(u64)), GFP_ATOMIC);
+	if (!stats)
+		return NULL;
+
+	tcp_get_info_chrono_stats(tp, &info);
+	nla_put_u64_64bit(stats, TCP_NLA_BUSY,
+			  info.tcpi_busy_time, TCP_NLA_PAD);
+	nla_put_u64_64bit(stats, TCP_NLA_RWND_LIMITED,
+			  info.tcpi_rwnd_limited, TCP_NLA_PAD);
+	nla_put_u64_64bit(stats, TCP_NLA_SNDBUF_LIMITED,
+			  info.tcpi_sndbuf_limited, TCP_NLA_PAD);
+	return stats;
+}
+
 static int do_tcp_getsockopt(struct sock *sk, int level,
 		int optname, char __user *optval, int __user *optlen)
 {
diff --git a/net/socket.c b/net/socket.c
index e2584c5..e631894 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -693,9 +693,14 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	    (sk->sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
 	    ktime_to_timespec_cond(shhwtstamps->hwtstamp, tss.ts + 2))
 		empty = 0;
-	if (!empty)
+	if (!empty) {
 		put_cmsg(msg, SOL_SOCKET,
 			 SCM_TIMESTAMPING, sizeof(tss), &tss);
+
+		if (skb->len && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_STATS))
+			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_OPT_STATS,
+				 skb->len, skb->data);
+	}
 }
 EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH] net: arc_emac: add dependencies on associated arches and compile test
From: Peter Robinson @ 2016-11-28  7:12 UTC (permalink / raw)
  To: Xing Zheng, Alexander Kochetkov, Philippe Reynes, David S. Miller,
	netdev
  Cc: Peter Robinson

Add dependencies on the architectures that support these devices and
add compile test to ensure ongoing code build coverage.

Signed-off-by: Peter Robinson <pbrobinson@gmail.com>
---
 drivers/net/ethernet/arc/Kconfig | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/arc/Kconfig b/drivers/net/ethernet/arc/Kconfig
index 6890451..e743ddf 100644
--- a/drivers/net/ethernet/arc/Kconfig
+++ b/drivers/net/ethernet/arc/Kconfig
@@ -17,13 +17,14 @@ if NET_VENDOR_ARC
 
 config ARC_EMAC_CORE
 	tristate
+	depends on ARC || ARCH_ROCKCHIP || COMPILE_TEST
 	select MII
 	select PHYLIB
 
 config ARC_EMAC
 	tristate "ARC EMAC support"
 	select ARC_EMAC_CORE
-	depends on OF_IRQ && OF_NET && HAS_DMA
+	depends on OF_IRQ && OF_NET && HAS_DMA && (ARC || COMPILE_TEST)
 	---help---
 	  On some legacy ARC (Synopsys) FPGA boards such as ARCAngel4/ML50x
 	  non-standard on-chip ethernet device ARC EMAC 10/100 is used.
@@ -32,7 +33,7 @@ config ARC_EMAC
 config EMAC_ROCKCHIP
 	tristate "Rockchip EMAC support"
 	select ARC_EMAC_CORE
-	depends on OF_IRQ && OF_NET && REGULATOR && HAS_DMA
+	depends on OF_IRQ && OF_NET && REGULATOR && HAS_DMA && (ARCH_ROCKCHIP || COMPILE_TEST)
 	---help---
 	  Support for Rockchip RK3036/RK3066/RK3188 EMAC ethernet controllers.
 	  This selects Rockchip SoC glue layer support for the
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next v3 0/4] Documentation: net: phy: Improve documentation
From: Jerome Brunet @ 2016-11-28  7:31 UTC (permalink / raw)
  To: Florian Fainelli, netdev
  Cc: davem, andrew, sf84, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur
In-Reply-To: <20161128024515.13070-1-f.fainelli@gmail.com>

On Sun, 2016-11-27 at 18:45 -0800, Florian Fainelli wrote:
> Hi all,
> 
> This patch series addresses discussions and feedback that was
> recently received
> on the mailing-list in the area of: flow control/pause frames,
> interpretation of
> phy_interface_t and finally add some links to useful standards
> documents.
> 
> Changes in v3:
> 
> - add Timur's feedback into patch 3
> 
> Changes in v2:
> 
> - clarify a few things in the RGMII section, add a paragraph about
> common issues
>   with RGMII delay mismatches
> 

Thanks a lot Florian. This is really helping, especially the part about
RGMII delays.

Reviewed-by: Jerome Brunet <jbrunet@baylibre.com>

> Florian Fainelli (4):
>   Documentation: net: phy: remove description of function pointers
>   Documentation: net: phy: Add a paragraph about pause frames/flow
>     control
>   Documentation: net: phy: Add blurb about RGMII
>   Documentation: net: phy: Add links to several standards documents
> 
>  Documentation/networking/phy.txt | 140
> +++++++++++++++++++++++++++++----------
>  1 file changed, 105 insertions(+), 35 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH net] net/sched: act_pedit: limit negative offset
From: Amir Vadai" @ 2016-11-28  7:51 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, netdev, jhs, ogerlitz, hadarh, jiri
In-Reply-To: <20161128.004936.2064564176474656911.davem@davemloft.net>

On Mon, Nov 28, 2016 at 12:49:36AM -0500, David Miller wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Sun, 27 Nov 2016 21:39:33 -0800
> 
> > On Sun, Nov 27, 2016 at 7:58 AM, Amir Vadai <amir@vadai.me> wrote:
> >> Should not allow setting a negative offset that goes below the skb head.
> > ...
> >> diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
> >> index b54d56d4959b..e79e8a88f2d2 100644
> >> --- a/net/sched/act_pedit.c
> >> +++ b/net/sched/act_pedit.c
> >> @@ -154,8 +154,11 @@ static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a,
> >>                         }
> >>
> >>                         ptr = skb_header_pointer(skb, off + offset, 4, &_data);
> >> -                       if (!ptr)
> >> +                       if ((unsigned char *)ptr < skb->head) {
> > 
> > 
> > ptr returned could be &_data, which is on stack, so why this comparison
> > makes sense for this case?
> 
> Indeed, this will definitely do the wrong thing when the on-stack area
> passed back to ptr.
yes - my bad. will correct it and send v1

^ permalink raw reply

* [PATCH 1/1] net: macb: ensure ordering write to re-enable RX smoothly
From: Zumeng Chen @ 2016-11-28  7:57 UTC (permalink / raw)
  To: nicolas.ferre; +Cc: davem, netdev, linux-kernel

When a hardware issue happened as described by inline comments, the register
write pattern looks like the following:

  <write ~MACB_BIT(RE)>
  + wmb();
  <write MACB_BIT(RE)>

There might be a memory barrier between these two write operations, so add wmb
to ensure an flip from 0 to 1 for NCR.

Signed-off-by: Zumeng Chen <zumeng.chen@windriver.com>
---
 drivers/net/ethernet/cadence/macb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index 533653b..2f9c5b2 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -1156,6 +1156,7 @@ static irqreturn_t macb_interrupt(int irq, void *dev_id)
 		if (status & MACB_BIT(RXUBR)) {
 			ctrl = macb_readl(bp, NCR);
 			macb_writel(bp, NCR, ctrl & ~MACB_BIT(RE));
+			wmb();
 			macb_writel(bp, NCR, ctrl | MACB_BIT(RE));
 
 			if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
-- 
2.4.11

^ permalink raw reply related

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Uwe Kleine-König @ 2016-11-28  8:09 UTC (permalink / raw)
  To: Andrew Lunn, Rob Herring, Frank Rowand
  Cc: Andreas Färber, netdev, linux-arm-kernel, Michal Hrusecki,
	Tomas Hlavacek, Bed??icha Ko??atu, Vivien Didelot,
	Florian Fainelli, linux-kernel, devicetree
In-Reply-To: <20161127231009.GA17704@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 2004 bytes --]

Hello Andrew,

On Mon, Nov 28, 2016 at 12:10:09AM +0100, Andrew Lunn wrote:
> > Try to see it from my perspective: I see that some vf610 device I don't
> > have (found via `git grep marvell,mv88e6` or so) uses
> > "marvell,mv88e6085". I then assume it has that device on board. How
> > would I know it doesn't? Same for the other boards you mention.
> > 
> > Unfortunately some of your replies are slightly cryptic. Had you simply
> > replied 'please just use "marvell,mv88e6085" instead', it would've been
> > much more clear what you want. (Same for extending the subject instead
> > of just pointing to some FAQ.)
> 
> By reading the FAQ you have learnt more than me saying put the correct
> tree in the subject line. By asking you to explain why you need a
> compatible string, i'm trying to make you think, look at the code and
> understand it. In the future, you might think and understand the code
> before posting a patch, and then we all save time.

I agree to Andreas though, that it makes an school teacher impression.
Something like:

	Please fix the subject. Check the FAQ for the details, which btw
	is worth a read completely.

is IMHO better in this regard and once you found the problem there you
don't need to ask back if it's that what was meant.

> > So are you okay with patch 1/2 documenting the compatible? Then we could
> > drop 2/2 and use "marvell,mv88e6176", "marvell,mv88e6085" instead of
> > just the latter. Or would you rather drop both and keep the actual chip
> > a comment?
> 
> A comment only please.

I still wonder (and didn't get an answer back when I asked about this)
why a comment is preferred here. For other devices I know it's usual and
requested by the maintainers to use:

	compatible = "exact name", "earlyer device to match driver";

. This is more robust, documents the situation more formally and makes
it better greppable. The price to pay is only a few bytes in the dtb
which IMO is ok.

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 1/5] net: mvneta: Use cacheable memory to store the rx buffer virtual address
From: Jisheng Zhang @ 2016-11-28  8:35 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: David S. Miller, linux-kernel, netdev, Arnd Bergmann,
	Jason Cooper, Andrew Lunn, Sebastian Hesselbarth,
	Thomas Petazzoni, linux-arm-kernel, Nadav Haklai, Marcin Wojtas,
	Dmitri Epshtein, Yelena Krivosheev
In-Reply-To: <7e6004f918d3fcde9ae71e7893d26b19086236a3.1480087510.git-series.gregory.clement@free-electrons.com>

Hi Gregory,

On Fri, 25 Nov 2016 16:30:14 +0100 Gregory CLEMENT wrote:

> Until now the virtual address of the received buffer were stored in the
> cookie field of the rx descriptor. However, this field is 32-bits only
> which prevents to use the driver on a 64-bits architecture.
> 
> With this patch the virtual address is stored in an array not shared with
> the hardware (no more need to use the DMA API). Thanks to this, it is
> possible to use cache contrary to the access of the rx descriptor member.
> 
> The change is done in the swbm path only because the hwbm uses the cookie
> field, this also means that currently the hwbm is not usable in 64-bits.
> 
> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 96 ++++++++++++++++++++++++----
>  1 file changed, 84 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 87274d4ab102..b6849f88cab7 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -561,6 +561,9 @@ struct mvneta_rx_queue {
>  	u32 pkts_coal;
>  	u32 time_coal;
>  
> +	/* Virtual address of the RX buffer */
> +	void  **buf_virt_addr;

can we store buf_phys_addr in cacheable memory as well?

> +
>  	/* Virtual address of the RX DMA descriptors array */
>  	struct mvneta_rx_desc *descs;
>  
> @@ -1573,10 +1576,14 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
>  
>  /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
>  static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
> -				u32 phys_addr, u32 cookie)
> +				u32 phys_addr, void *virt_addr,
> +				struct mvneta_rx_queue *rxq)
>  {
> -	rx_desc->buf_cookie = cookie;
> +	int i;
> +
>  	rx_desc->buf_phys_addr = phys_addr;
> +	i = rx_desc - rxq->descs;
> +	rxq->buf_virt_addr[i] = virt_addr;
>  }
>  
>  /* Decrement sent descriptors counter */
> @@ -1781,7 +1788,8 @@ EXPORT_SYMBOL_GPL(mvneta_frag_free);
>  
>  /* Refill processing for SW buffer management */
>  static int mvneta_rx_refill(struct mvneta_port *pp,
> -			    struct mvneta_rx_desc *rx_desc)
> +			    struct mvneta_rx_desc *rx_desc,
> +			    struct mvneta_rx_queue *rxq)
>  
>  {
>  	dma_addr_t phys_addr;
> @@ -1799,7 +1807,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>  		return -ENOMEM;
>  	}
>  
> -	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
> +	mvneta_rx_desc_fill(rx_desc, phys_addr, data, rxq);
>  	return 0;
>  }
>  
> @@ -1861,7 +1869,12 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  
>  	for (i = 0; i < rxq->size; i++) {
>  		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
> -		void *data = (void *)rx_desc->buf_cookie;
> +		void *data;
> +
> +		if (!pp->bm_priv)
> +			data = rxq->buf_virt_addr[i];
> +		else
> +			data = (void *)(uintptr_t)rx_desc->buf_cookie;
>  
>  		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
>  				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
> @@ -1894,12 +1907,13 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>  		unsigned char *data;
>  		dma_addr_t phys_addr;
>  		u32 rx_status, frag_size;
> -		int rx_bytes, err;
> +		int rx_bytes, err, index;
>  
>  		rx_done++;
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> -		data = (unsigned char *)rx_desc->buf_cookie;
> +		index = rx_desc - rxq->descs;
> +		data = (unsigned char *)rxq->buf_virt_addr[index];
>  		phys_addr = rx_desc->buf_phys_addr;
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
> @@ -1938,7 +1952,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>  		}
>  
>  		/* Refill processing */
> -		err = mvneta_rx_refill(pp, rx_desc);
> +		err = mvneta_rx_refill(pp, rx_desc, rxq);
>  		if (err) {
>  			netdev_err(dev, "Linux processing - Can't refill\n");
>  			rxq->missed++;
> @@ -2020,7 +2034,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_done++;
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> -		data = (unsigned char *)rx_desc->buf_cookie;
> +		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
>  		phys_addr = rx_desc->buf_phys_addr;
>  		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
> @@ -2708,6 +2722,57 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
>  	return rx_done;
>  }
>  
> +/* Refill processing for HW buffer management */
> +static int mvneta_rx_hwbm_refill(struct mvneta_port *pp,
> +				 struct mvneta_rx_desc *rx_desc)
> +
> +{
> +	dma_addr_t phys_addr;
> +	void *data;
> +
> +	data = mvneta_frag_alloc(pp->frag_size);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	phys_addr = dma_map_single(pp->dev->dev.parent, data,
> +				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
> +				   DMA_FROM_DEVICE);
> +	if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
> +		mvneta_frag_free(pp->frag_size, data);
> +		return -ENOMEM;
> +	}
> +
> +	phys_addr += pp->rx_offset_correction;
> +	rx_desc->buf_phys_addr = phys_addr;
> +	rx_desc->buf_cookie = (uintptr_t)data;
> +
> +	return 0;
> +}
> +
> +/* Handle rxq fill: allocates rxq skbs; called when initializing a port */
> +static int mvneta_rxq_bm_fill(struct mvneta_port *pp,
> +			      struct mvneta_rx_queue *rxq,
> +			      int num)
> +{
> +	int i;
> +
> +	for (i = 0; i < num; i++) {
> +		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
> +		if (mvneta_rx_hwbm_refill(pp, rxq->descs + i) != 0) {
> +			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
> +				   __func__, rxq->id, i, num);
> +			break;
> +		}
> +	}
> +
> +	/* Add this number of RX descriptors as non occupied (ready to
> +	 * get packets)
> +	 */
> +	mvneta_rxq_non_occup_desc_add(pp, rxq, i);
> +
> +	return i;
> +}
> +
>  /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
>  static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
>  			   int num)
> @@ -2716,7 +2781,7 @@ static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
>  
>  	for (i = 0; i < num; i++) {
>  		memset(rxq->descs + i, 0, sizeof(struct mvneta_rx_desc));
> -		if (mvneta_rx_refill(pp, rxq->descs + i) != 0) {
> +		if (mvneta_rx_refill(pp, rxq->descs + i, rxq) != 0) {
>  			netdev_err(pp->dev, "%s:rxq %d, %d of %d buffs  filled\n",
>  				__func__, rxq->id, i, num);
>  			break;
> @@ -2784,14 +2849,21 @@ static int mvneta_rxq_init(struct mvneta_port *pp,
>  		mvneta_rxq_buf_size_set(pp, rxq,
>  					MVNETA_RX_BUF_SIZE(pp->pkt_size));
>  		mvneta_rxq_bm_disable(pp, rxq);
> +
> +		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
> +						  rxq->size * sizeof(void *),
> +						  GFP_KERNEL);

I would suggest allocate this buffer during probe. Otherwise, there's
memory leak if we either change the mtu or close then open the eth in
a loop, e.g

while true
do
	ifconfig eth0 up
	ifconfig eth0 down
done

Thanks,
Jisheng

> +		if (!rxq->buf_virt_addr)
> +			return -ENOMEM;
> +
> +		mvneta_rxq_fill(pp, rxq, rxq->size);
>  	} else {
>  		mvneta_rxq_bm_enable(pp, rxq);
>  		mvneta_rxq_long_pool_set(pp, rxq);
>  		mvneta_rxq_short_pool_set(pp, rxq);
> +		mvneta_rxq_bm_fill(pp, rxq, rxq->size);
>  	}
>  
> -	mvneta_rxq_fill(pp, rxq, rxq->size);
> -
>  	return 0;
>  }
>  

^ permalink raw reply

* [PATCH v4] cpsw: ethtool: add support for getting/setting EEE registers
From: yegorslists @ 2016-11-28  8:41 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, grygorii.strashko, mugunthanvnm, davem,
	Yegor Yefremov

From: Yegor Yefremov <yegorslists@googlemail.com>

Add the ability to query and set Energy Efficient Ethernet parameters
via ethtool for applicable devices.

This patch doesn't activate full EEE support in cpsw driver, but it
enables reading and writing EEE advertising settings. This way one
can disable advertising EEE for certain speeds.

Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
---
Changes:
	v4: respine against net-next (David Miller)
	v3: explain what features will be available with this patch (Florian Fainelli)
	v2: make routines static (Rami Rosen)

 drivers/net/ethernet/ti/cpsw.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index da40ea5..df87bff 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2237,6 +2237,30 @@ static int cpsw_set_channels(struct net_device *ndev,
 	return ret;
 }
 
+static int cpsw_get_eee(struct net_device *ndev, struct ethtool_eee *edata)
+{
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+	int slave_no = cpsw_slave_index(cpsw, priv);
+
+	if (cpsw->slaves[slave_no].phy)
+		return phy_ethtool_get_eee(cpsw->slaves[slave_no].phy, edata);
+	else
+		return -EOPNOTSUPP;
+}
+
+static int cpsw_set_eee(struct net_device *ndev, struct ethtool_eee *edata)
+{
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+	int slave_no = cpsw_slave_index(cpsw, priv);
+
+	if (cpsw->slaves[slave_no].phy)
+		return phy_ethtool_set_eee(cpsw->slaves[slave_no].phy, edata);
+	else
+		return -EOPNOTSUPP;
+}
+
 static const struct ethtool_ops cpsw_ethtool_ops = {
 	.get_drvinfo	= cpsw_get_drvinfo,
 	.get_msglevel	= cpsw_get_msglevel,
@@ -2260,6 +2284,8 @@ static const struct ethtool_ops cpsw_ethtool_ops = {
 	.set_channels	= cpsw_set_channels,
 	.get_link_ksettings	= cpsw_get_link_ksettings,
 	.set_link_ksettings	= cpsw_set_link_ksettings,
+	.get_eee	= cpsw_get_eee,
+	.set_eee	= cpsw_set_eee,
 };
 
 static void cpsw_slave_init(struct cpsw_slave *slave, struct cpsw_common *cpsw,
-- 
2.1.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox