Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: bridge/brctl/ip
From: Nikolay Aleksandrov @ 2016-04-02 22:50 UTC (permalink / raw)
  To: Bert Vermeulen, netdev, Stephen Hemminger
In-Reply-To: <57001CFF.20508@biot.com>

On 04/02/2016 09:26 PM, Bert Vermeulen wrote:
> Hi all,
> 
> I'm wondering about the current userspace toolset to control bridging in
> the Linux kernel. As far as I can determine, functionality is a bit
> scattered right now between the iproute2 (ip, bridge) and bridge-utils
> (brctl) tools:
> 
> - creating/deleting bridges: ip or brctl
> - adding/deleting ports to/from bridge: brctl only

ip link set dev ethX master bridgeY
ip link set dev ethX nomaster

> - showing bridge fdb: brctl (in-kernel fdb), bridge (hardware offloaded
>   fdb) (!)

bridge fdb show - shows all fdb entries, offloaded or not.

> ...and no doubt a few other things.
> 
> Also the brctl tool seems not to be getting updates, whereas the
> iproute2 tools are of course updated regularly. Is brctl considered
> obsolete?

iproute2 supports (almost, user-space stp?) everything now, there have been many recent
additions to the options that can be manipulated.
$ ip link set dev bridge0 type bridge help
Usage: ... bridge [ forward_delay FORWARD_DELAY ]
                  [ hello_time HELLO_TIME ]
                  [ max_age MAX_AGE ]
                  [ ageing_time AGEING_TIME ]
                  [ stp_state STP_STATE ]
                  [ priority PRIORITY ]
                  [ group_fwd_mask MASK ]
                  [ group_address ADDRESS ]
                  [ vlan_filtering VLAN_FILTERING ]
                  [ vlan_protocol VLAN_PROTOCOL ]
                  [ vlan_default_pvid VLAN_DEFAULT_PVID ]
                  [ mcast_snooping MULTICAST_SNOOPING ]
                  [ mcast_router MULTICAST_ROUTER ]
                  [ mcast_query_use_ifaddr MCAST_QUERY_USE_IFADDR ]
                  [ mcast_querier MULTICAST_QUERIER ]
                  [ mcast_hash_elasticity HASH_ELASTICITY ]
                  [ mcast_hash_max HASH_MAX ]
                  [ mcast_last_member_count LAST_MEMBER_COUNT ]
                  [ mcast_startup_query_count STARTUP_QUERY_COUNT ]
                  [ mcast_last_member_interval LAST_MEMBER_INTERVAL ]
                  [ mcast_membership_interval MEMBERSHIP_INTERVAL ]
                  [ mcast_querier_interval QUERIER_INTERVAL ]
                  [ mcast_query_interval QUERY_INTERVAL ]
                  [ mcast_query_response_interval QUERY_RESPONSE_INTERVAL ]
                  [ mcast_startup_query_interval STARTUP_QUERY_INTERVAL ]
                  [ nf_call_iptables NF_CALL_IPTABLES ]
                  [ nf_call_ip6tables NF_CALL_IP6TABLES ]
                  [ nf_call_arptables NF_CALL_ARPTABLES ]

Where: VLAN_PROTOCOL := { 802.1Q | 802.1ad }


> 
> If that is the case, would patches to add the missing functionality into
> the bridge tool be welcome? I'm thinking primarily of creating/deleting
> bridges, and adding/deleting ports in bridges.
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Tom Herbert @ 2016-04-02 22:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer, Lorenzo Colitti
In-Reply-To: <1459622491.18188.6.camel@sipsolutions.net>

On Sat, Apr 2, 2016 at 2:41 PM, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
>> This patch set introduces new infrastructure for programmatically
>> processing packets in the earliest stages of rx, as part of an effort
>> others are calling Express Data Path (XDP) [1]. Start this effort by
>> introducing a new bpf program type for early packet filtering, before
>> even
>> an skb has been allocated.
>>
>> With this, hope to enable line rate filtering, with this initial
>> implementation providing drop/allow action only.
>
> Since this is handed to the driver in some way, I assume the API would
> also allow offloading the program to the NIC itself, and as such be
> useful for what Android wants to do to save power in wireless?
>
Conceptually, yes. There is some ongoing work to offload BPF and one
goal is that BPF programs (like for XDP) could be portable between
userspace, kernel (maybe even other OSes), and devices.

I am curious though, how do you think this would specifically help
Android with power? Seems like the receiver still needs to be powered
to receive packets to filter them anyway...

Thanks,
Tom

> johannes

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: remove cwnd moderation after recovery
From: David Miller @ 2016-04-03  0:14 UTC (permalink / raw)
  To: ycheng; +Cc: netdev, mattmathis, ncardwell, soheil
In-Reply-To: <1459374860-32643-1-git-send-email-ycheng@google.com>

From: Yuchung Cheng <ycheng@google.com>
Date: Wed, 30 Mar 2016 14:54:20 -0700

> For non-SACK connections, cwnd is lowered to inflight plus 3 packets
> when the recovery ends. This is an optional feature in the NewReno
> RFC 2582 to reduce the potential burst when cwnd is "re-opened"
> after recovery and inflight is low.
> 
> This feature is questionably effective because of PRR: when
> the recovery ends (i.e., snd_una == high_seq) NewReno holds the
> CA_Recovery state for another round trip to prevent false fast
> retransmits. But if the inflight is low, PRR will overwrite the
> moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
> receiver responds bogus ACKs (i.e., acking future data) to speed up
> transfer after recovery, it can only induce a burst up to a window
> worth of data packets by acking up to SND.NXT. A restart from (short)
> idle or receiving streched ACKs can both cause such bursts as well.
> 
> On the other hand, if the recovery ends because the sender
> detects the losses were spurious (e.g., reordering). This feature
> unconditionally lowers a reverted cwnd even though nothing
> was lost.
> 
> By principle loss recovery module should not update cwnd. Further
> pacing is much more effective to reduce burst. Hence this patch
> removes the cwnd moderation feature.
> 
> v2 changes: revised commit message on bogus ACKs and burst, and
>             missing signature
> 
> Signed-off-by: Matt Mathis <mattmathis@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address
From: David Miller @ 2016-04-03  0:16 UTC (permalink / raw)
  To: yanhaishuang; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <1459419698-4466-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Thu, 31 Mar 2016 18:21:38 +0800

> Since nla_get_in_addr and nla_put_in_addr were implemented,
> so use them appropriately.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: hns: add support of pause frame ctrl for HNS V2
From: David Miller @ 2016-04-03  0:17 UTC (permalink / raw)
  To: Yisen.Zhuang
  Cc: salil.mehta, liguozhu, huangdaode, arnd, andriy.shevchenko,
	andrew, geliangtang, ivecera, lisheng011, fengguang.wu,
	charles.chenxin, haifeng.wei, netdev, linux-kernel,
	linux-arm-kernel, linuxarm
In-Reply-To: <1459429209-230671-1-git-send-email-Yisen.Zhuang@huawei.com>

From: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Date: Thu, 31 Mar 2016 21:00:09 +0800

> From: Lisheng <lisheng011@huawei.com>
> 
> The patch adds support of pause ctrl for HNS V2, and this feature is lost
> by HNS V1:
>        1) service ports can disable rx pause frame,
>        2) debug ports can open tx/rx pause frame.
> 
> And this patch updates the REGs about the pause ctrl when updated
> status function called by upper layer routine.
> 
> Signed-off-by: Lisheng <lisheng011@huawei.com>
> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Applied.

^ permalink raw reply

* Re: [RESEND PATCH net-next 00/13] Enhance stmmac driver to support GMAC4.x IP
From: David Miller @ 2016-04-03  0:23 UTC (permalink / raw)
  To: alexandre.torgue; +Cc: netdev, peppe.cavallaro, lars.persson
In-Reply-To: <1459503457-22569-1-git-send-email-alexandre.torgue@st.com>

From: Alexandre TORGUE <alexandre.torgue@st.com>
Date: Fri, 1 Apr 2016 11:37:24 +0200

> This is a subset of patch to enhance current stmmac driver to support
> new GMAC4.x chips. New set of callbacks is defined to support this new
> family: descriptors, dma, core.

Series applied, thanks.

^ permalink raw reply

* Homeland Security
From: Jeh Charles Johnsonr @ 2016-04-03  0:30 UTC (permalink / raw)
  To: DHS

Tel:+14073924312 text messages only

I am Mr. Jeh Charles Johnson. The secretary of the US Department of Homeland Security Washington DC, an international summit was held in West Africa and i was opporturned to be one of the officials, after the meeting I found out that a huge sum of 4.5 million dollars which is in your name was abandoned here in West Africa probably due to malpractices.

I would like you to kindly reconfirm your full address, Full name, Phone number, and nearest Airport so that I shall bring your parcel to our Head Office in Washington DC and you will come and claim or we can send it to your designated address I wait for your urgent and positive respond. You can reach me on this email as well: mr.jehjohnson12@gmail.com. I will be looking forward to your positive response. Thank you and God bless you.`

Best regards
Jeh Charles Johnson
mr.jehjohnson12@gmail.com.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Homeland Security
From: Jeh Charles Johnsonr @ 2016-04-03  0:37 UTC (permalink / raw)
  To: DHS

Tel:+14073924312 text messages only

I am Mr. Jeh Charles Johnson. The secretary of the US Department of Homeland Security Washington DC, an international summit was held in West Africa and i was opporturned to be one of the officials, after the meeting I found out that a huge sum of 4.5 million dollars which is in your name was abandoned here in West Africa probably due to malpractices.

I would like you to kindly reconfirm your full address, Full name, Phone number, and nearest Airport so that I shall bring your parcel to our Head Office in Washington DC and you will come and claim or we can send it to your designated address I wait for your urgent and positive respond. You can reach me on this email as well: mr.jehjohnson12@gmail.com. I will be looking forward to your positive response. Thank you and God bless you.`

Best regards
Jeh Charles Johnson
mr.jehjohnson12@gmail.com.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg
From: David Miller @ 2016-04-03  1:19 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, willemb, edumazet, ycheng, ncardwell, kafai, soheil
In-Reply-To: <1459523080-29329-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Fri,  1 Apr 2016 11:04:32 -0400

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> This patch series aim at enabling TX timestamping via cmsg.
> 
> Currently, to occasionally sample TX timestamping on a socket,
> applications need to call setsockopt twice: first for enabling
> timestamps and then for disabling them. This is an unnecessary
> overhead. With cmsg, in contrast, applications can sample TX
> timestamps per sendmsg().
> 
> This patch series adds the code for processing SO_TIMESTAMPING
> for cmsg's of the SOL_SOCKET level, and adds the glue code for
> TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
> supports overriding timestamp generation flags (i.e.,
> SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
> Applications must still enable timestamp reporting via
> setsockopt to receive timestamps.
> 
> This series does not change existing timestamping behavior for
> applications that are using socket options.
> 
> I will follow up with another patch to enable timestamping for
> active TFO (client-side TCP Fast Open) and also setting packet
> mark via cmsgs.
 ...
> Changes in v2:
> 	- Replace u32 with __u32 in the documentation.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg
From: David Miller @ 2016-04-03  1:27 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, willemb, edumazet, ycheng, ncardwell, kafai, soheil
In-Reply-To: <20160402.211942.598210499165268433.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Sat, 02 Apr 2016 21:19:42 -0400 (EDT)

> Series applied, thanks.

I had to revert, this breaks the build:

net/l2tp/l2tp_ip6.c: In function ‘l2tp_ip6_sendmsg’:
net/l2tp/l2tp_ip6.c:565:9: error: too few arguments to function ‘ip6_datagram_send_ctl’
   err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
         ^
In file included from net/l2tp/l2tp_ip6.c:33:0:
include/net/transp_v6.h:43:5: note: declared here
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
     ^
net/l2tp/l2tp_ip6.c:625:8: error: too few arguments to function ‘ip6_append_data’
  err = ip6_append_data(sk, ip_generic_getfrag, msg,
        ^
In file included from include/net/inetpeer.h:15:0,
                 from include/net/route.h:28,
                 from include/net/ip.h:31,
                 from net/l2tp/l2tp_ip6.c:23:
include/net/ipv6.h:865:5: note: declared here
 int ip6_append_data(struct sock *sk,
     ^

^ permalink raw reply

* Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg
From: Soheil Hassas Yeganeh @ 2016-04-03  1:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20160402.212729.2129485115379390591.davem@davemloft.net>

On Sat, Apr 2, 2016 at 9:27 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sat, 02 Apr 2016 21:19:42 -0400 (EDT)
>
>> Series applied, thanks.
>
> I had to revert, this breaks the build:
>
> net/l2tp/l2tp_ip6.c: In function ‘l2tp_ip6_sendmsg’:
> net/l2tp/l2tp_ip6.c:565:9: error: too few arguments to function ‘ip6_datagram_send_ctl’
>    err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
>          ^
> In file included from net/l2tp/l2tp_ip6.c:33:0:
> include/net/transp_v6.h:43:5: note: declared here
>  int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
>      ^
> net/l2tp/l2tp_ip6.c:625:8: error: too few arguments to function ‘ip6_append_data’
>   err = ip6_append_data(sk, ip_generic_getfrag, msg,
>         ^
> In file included from include/net/inetpeer.h:15:0,
>                  from include/net/route.h:28,
>                  from include/net/ip.h:31,
>                  from net/l2tp/l2tp_ip6.c:23:
> include/net/ipv6.h:865:5: note: declared here
>  int ip6_append_data(struct sock *sk,
>      ^

I'm really sorry about this. CONFIG_L2TP was no enabled in my config.
I'll fix the patch, and will mail v3.

Thanks,
Soheil

^ permalink raw reply

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Lorenzo Colitti @ 2016-04-03  2:28 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Johannes Berg, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, gerlitz,
	Daniel Borkmann, john fastabend, Jesper Dangaard Brouer
In-Reply-To: <CALx6S35kK9eNk8r-fZmP-mOSB=Tb8udbaQeSUjmHWJagh+=i=A@mail.gmail.com>

On Sun, Apr 3, 2016 at 7:57 AM, Tom Herbert <tom@herbertland.com> wrote:
> I am curious though, how do you think this would specifically help
> Android with power? Seems like the receiver still needs to be powered
> to receive packets to filter them anyway...

The receiver is powered up, but its wake/sleep cycles are much shorter
than the main CPU's. On a phone, leaving the CPU asleep with wifi on
might consume ~5mA average, but getting the CPU out of suspend might
average ~200mA for ~300ms as the system comes out of sleep,
initializes other hardware, wakes up userspace processes whose
timeouts have fired, freezes, and suspends again. Receiving one such
superfluous packet every 3 seconds (e.g., on networks that send
identical IPv6 RAs once every 3 seconds) works out to ~25mA, which is
5x the cost of idle. Pushing down filters to the hardware so it can
drop the packet without waking up the CPU thus saves a lot of idle
power.

That said, getting BPF to the driver is part of the picture. On the
chipsets we're targeting for APF, we're only seeing 2k-4k of memory
(that's 256-512 BPF instructions) available for filtering code, which
means that BPF might be too large.

^ permalink raw reply

* Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
From: Martin KaFai Lau @ 2016-04-03  2:33 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <CAM_iQpVeOKtTJ=oc9z6=SEaHDBTfpaPLKa-43TH1b_A3+TdWwA@mail.gmail.com>

On Fri, Apr 01, 2016 at 04:13:41PM -0700, Cong Wang wrote:
> On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> > +       bh_lock_sock(sk);
> > +       if (!sock_owned_by_user(sk))
> > +               ip6_datagram_dst_update(sk, false);
> > +       bh_unlock_sock(sk);
>
>
> My discussion with Eric shows that we probably don't need to hold
> this sock lock here, and you are Cc'ed in that thread, so
>
> 1) why do you still take the lock here?
> 2) why didn't you involve in our discussion if you disagree?
It is because I agree with the last thread discussion that updating
sk->sk_dst_cache does not need a sk lock.  I also don't see
a lock is need for other operations in that thread.

I am thinking another case that needs a lock, so I start
another RFC thread.  A quick recall for this commit message:
>> It is done under '!sock_owned_by_user(sk)' condition because
>> the user may make another ip6_datagram_connect() while
>> dst lookup and update are happening.
If that could not happen, then the lock is not needed.

One thing to note is that this patch uses the addresses from the sk
instead of iph when updating sk->sk_dst_cache.  It is basically the
same logic that the __ip6_datagram_connect() is doing, so some
refactoring works in the first two patches.

AFAIK, a UDP socket can become connected after sending out some
datagrams in un-connected state.  or It can be connected
multiple times to different destinations.  I did some quick
tests but I could be wrong.

I am thinking if there could be a chance that the skb->data, which
has the original outgoing iph, is not related to the current
connected address.  If it is possible, we have to specifically
use the addresses in the sk instead of skb->data (i.e. iph) when
updating the sk->sk_dst_cache.

If we need to use the sk addresses (and other info) to find out a
new dst for a connected udp socket, it is better not doing it while
the userland is connecting to somewhere else.

If the above case is impossible, we can keep using the info from iph to
do the dst update for a connected-udp sk without taking the lock.

>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index ed44663..f7e6a6d 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1417,8 +1417,19 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu);
>>
>>  void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
>>  {
>> +	struct dst_entry *dst;
>> +
>>  	ip6_update_pmtu(skb, sock_net(sk), mtu,
>>  			sk->sk_bound_dev_if, sk->sk_mark);
iph's addresses are used to update the pmtu.  It is fine
because it does not update the sk->sk_dst_cache.

>>> +
>> +	dst = __sk_dst_get(sk);
>> +	if (!dst || dst->ops->check(dst, inet6_sk(sk)->dst_cookie))
>> +		return;
>> +
>> +	bh_lock_sock(sk);
>> +	if (!sock_owned_by_user(sk))
sk is not connecting to another address.  Find a new dst
for the connected address.
>> +		ip6_datagram_dst_update(sk, false);
>> +	bh_unlock_sock(sk);
>>  }

^ permalink raw reply

* [PATCH v3 net-next 0/8] add TX timestamping via cmsg
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

This patch series aim at enabling TX timestamping via cmsg.

Currently, to occasionally sample TX timestamping on a socket,
applications need to call setsockopt twice: first for enabling
timestamps and then for disabling them. This is an unnecessary
overhead. With cmsg, in contrast, applications can sample TX
timestamps per sendmsg().

This patch series adds the code for processing SO_TIMESTAMPING
for cmsg's of the SOL_SOCKET level, and adds the glue code for
TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
supports overriding timestamp generation flags (i.e.,
SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
Applications must still enable timestamp reporting via
setsockopt to receive timestamps.

This series does not change existing timestamping behavior for
applications that are using socket options.

I will follow up with another patch to enable timestamping for
active TFO (client-side TCP Fast Open) and also setting packet
mark via cmsgs.

Thanks!

Changes in v2:
        - Replace u32 with __u32 in the documentation.

Changes in v3:
	- Fix the broken build for L2TP (due to changes
	  in IPv6).

Soheil Hassas Yeganeh (7):
  tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
  tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
  sock: accept SO_TIMESTAMPING flags in socket cmsg
  ipv4: process socket-level control messages in IPv4
  ipv6: process socket-level control messages in IPv6
  sock: enable timestamping using control messages
  sock: document timestamping via cmsg in Documentation

Willem de Bruijn (1):
  sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop

 Documentation/networking/timestamping.txt | 48 ++++++++++++++++++++++++++++--
 drivers/net/tun.c                         |  3 +-
 include/net/ip.h                          |  3 +-
 include/net/ipv6.h                        |  6 ++--
 include/net/sock.h                        | 13 +++++---
 include/net/tcp.h                         |  3 +-
 include/net/transp_v6.h                   |  3 +-
 include/uapi/linux/net_tstamp.h           | 10 +++++++
 net/can/raw.c                             |  2 +-
 net/core/sock.c                           | 49 +++++++++++++++++++++++--------
 net/ipv4/ip_sockglue.c                    |  9 +++++-
 net/ipv4/ping.c                           |  7 +++--
 net/ipv4/raw.c                            | 13 ++++----
 net/ipv4/tcp.c                            | 22 ++++++++++----
 net/ipv4/tcp_input.c                      |  2 +-
 net/ipv4/udp.c                            | 10 +++----
 net/ipv6/datagram.c                       |  9 +++++-
 net/ipv6/icmp.c                           |  6 ++--
 net/ipv6/ip6_flowlabel.c                  |  3 +-
 net/ipv6/ip6_output.c                     | 15 ++++++----
 net/ipv6/ipv6_sockglue.c                  |  3 +-
 net/ipv6/ping.c                           |  3 +-
 net/ipv6/raw.c                            |  7 +++--
 net/ipv6/udp.c                            | 10 +++++--
 net/l2tp/l2tp_ip6.c                       | 10 ++++---
 net/packet/af_packet.c                    | 30 +++++++++++++++----
 net/socket.c                              | 10 +++----
 27 files changed, 231 insertions(+), 78 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH v3 net-next 1/8] sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Willem de Bruijn <willemb@google.com>

To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.

Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/sock.h |  2 ++
 net/core/sock.c    | 33 ++++++++++++++++++++++-----------
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e0..03772d4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1420,6 +1420,8 @@ struct sockcm_cookie {
 	u32 mark;
 };
 
+int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+		     struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9ae..66976f8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1866,27 +1866,38 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
+int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+		     struct sockcm_cookie *sockc)
+{
+	switch (cmsg->cmsg_type) {
+	case SO_MARK:
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+			return -EPERM;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+			return -EINVAL;
+		sockc->mark = *(u32 *)CMSG_DATA(cmsg);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(__sock_cmsg_send);
+
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc)
 {
 	struct cmsghdr *cmsg;
+	int ret;
 
 	for_each_cmsghdr(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 		if (cmsg->cmsg_level != SOL_SOCKET)
 			continue;
-		switch (cmsg->cmsg_type) {
-		case SO_MARK:
-			if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
-				return -EPERM;
-			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
-				return -EINVAL;
-			sockc->mark = *(u32 *)CMSG_DATA(cmsg);
-			break;
-		default:
-			return -EINVAL;
-		}
+		ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+		if (ret)
+			return ret;
 	}
 	return 0;
 }
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
to associate timestamps with send calls. For TCP connections,
tp->snd_una is used as the starting point to calculate
relative IDs.

This socket option will fail if set before the handshake on a
passive TCP fast open connection with data in SYN or SYN/ACK,
since setsockopt requires the connection to be in the
ESTABLISHED state.

To address these, instead of limiting the option to the
ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
long as the connection is not in LISTEN or CLOSE states.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 66976f8..0a64fe2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -832,7 +832,8 @@ set_rcvbuf:
 		    !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
 			if (sk->sk_protocol == IPPROTO_TCP &&
 			    sk->sk_type == SOCK_STREAM) {
-				if (sk->sk_state != TCP_ESTABLISHED) {
+				if ((1 << sk->sk_state) &
+				    (TCPF_CLOSE | TCPF_LISTEN)) {
 					ret = -EINVAL;
 					break;
 				}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 4/8] sock: accept SO_TIMESTAMPING flags in socket cmsg
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.

This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.

This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.

This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 include/net/sock.h              |  1 +
 include/uapi/linux/net_tstamp.h | 10 ++++++++++
 net/core/sock.c                 | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 03772d4..af012da 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1418,6 +1418,7 @@ void sk_send_sigurg(struct sock *sk);
 
 struct sockcm_cookie {
 	u32 mark;
+	u16 tsflags;
 };
 
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 6d1abea..264e515 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -31,6 +31,16 @@ enum {
 				 SOF_TIMESTAMPING_LAST
 };
 
+/*
+ * SO_TIMESTAMPING flags are either for recording a packet timestamp or for
+ * reporting the timestamp to user space.
+ * Recording flags can be set both via socket options and control messages.
+ */
+#define SOF_TIMESTAMPING_TX_RECORD_MASK	(SOF_TIMESTAMPING_TX_HARDWARE | \
+					 SOF_TIMESTAMPING_TX_SOFTWARE | \
+					 SOF_TIMESTAMPING_TX_SCHED | \
+					 SOF_TIMESTAMPING_TX_ACK)
+
 /**
  * struct hwtstamp_config - %SIOCGHWTSTAMP and %SIOCSHWTSTAMP parameter
  *
diff --git a/net/core/sock.c b/net/core/sock.c
index 0a64fe2..315f5e5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1870,6 +1870,8 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
+	u32 tsflags;
+
 	switch (cmsg->cmsg_type) {
 	case SO_MARK:
 		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
@@ -1878,6 +1880,17 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 			return -EINVAL;
 		sockc->mark = *(u32 *)CMSG_DATA(cmsg);
 		break;
+	case SO_TIMESTAMPING:
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+			return -EINVAL;
+
+		tsflags = *(u32 *)CMSG_DATA(cmsg);
+		if (tsflags & ~SOF_TIMESTAMPING_TX_RECORD_MASK)
+			return -EINVAL;
+
+		sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
+		sockc->tsflags |= tsflags;
+		break;
 	default:
 		return -EINVAL;
 	}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 3/8] tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h    | 3 ++-
 net/ipv4/tcp.c       | 2 ++
 net/ipv4/tcp_input.c | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f..f3a80ec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -754,7 +754,8 @@ struct tcp_skb_cb {
 				TCPCB_REPAIRED)
 
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
-	/* 1 byte hole */
+	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
+			unused:7;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct inet_skb_parm	h4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 08b8b96..ce3c9eb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -432,10 +432,12 @@ static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb)
 {
 	if (sk->sk_tsflags) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
 		sock_tx_timestamp(sk, &shinfo->tx_flags);
 		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+		tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..2d5fee4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3093,7 +3093,7 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 	const struct skb_shared_info *shinfo;
 
 	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
-	if (likely(!(sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK)))
+	if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
 		return;
 
 	shinfo = skb_shinfo(skb);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 5/8] ipv4: process socket-level control messages in IPv4
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 include/net/ip.h       | 3 ++-
 net/ipv4/ip_sockglue.c | 9 ++++++++-
 net/ipv4/ping.c        | 2 +-
 net/ipv4/raw.c         | 2 +-
 net/ipv4/udp.c         | 3 +--
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fad74d3..93725e5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -56,6 +56,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb)
 }
 
 struct ipcm_cookie {
+	struct sockcm_cookie	sockc;
 	__be32			addr;
 	int			oif;
 	struct ip_options_rcu	*opt;
@@ -550,7 +551,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
 
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
 void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int offset);
-int ip_cmsg_send(struct net *net, struct msghdr *msg,
+int ip_cmsg_send(struct sock *sk, struct msghdr *msg,
 		 struct ipcm_cookie *ipc, bool allow_ipv6);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
 		  unsigned int optlen);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 035ad64..1b7c077 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -219,11 +219,12 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
-int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
+int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 		 bool allow_ipv6)
 {
 	int err, val;
 	struct cmsghdr *cmsg;
+	struct net *net = sock_net(sk);
 
 	for_each_cmsghdr(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
@@ -244,6 +245,12 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
 			continue;
 		}
 #endif
+		if (cmsg->cmsg_level == SOL_SOCKET) {
+			if (__sock_cmsg_send(sk, msg, cmsg, &ipc->sockc))
+				return -EINVAL;
+			continue;
+		}
+
 		if (cmsg->cmsg_level != SOL_IP)
 			continue;
 		switch (cmsg->cmsg_type) {
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index cf9700b..670639b 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -747,7 +747,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc, false);
+		err = ip_cmsg_send(sk, msg, &ipc, false);
 		if (unlikely(err)) {
 			kfree(ipc.opt);
 			return err;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8d22de7..088ce66 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -548,7 +548,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.oif = sk->sk_bound_dev_if;
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(net, msg, &ipc, false);
+		err = ip_cmsg_send(sk, msg, &ipc, false);
 		if (unlikely(err)) {
 			kfree(ipc.opt);
 			goto out;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e..bccb4e1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1034,8 +1034,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc,
-				   sk->sk_family == AF_INET6);
+		err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6);
 		if (unlikely(err)) {
 			kfree(ipc.opt);
 			return err;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 6/8] ipv6: process socket-level control messages in IPv6
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Process socket-level control messages by invoking
__sock_cmsg_send in ip6_datagram_send_ctl for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip6_datagram_send_ctl is called for
udp and raw, we also process socket-level control messages.

This is a bit uglier than IPv4, since IPv6 does not have
something like ipcm_cookie. Perhaps we can later create
a control message cookie for IPv6?

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv6 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 include/net/transp_v6.h  | 3 ++-
 net/ipv6/datagram.c      | 9 ++++++++-
 net/ipv6/ip6_flowlabel.c | 3 ++-
 net/ipv6/ipv6_sockglue.c | 3 ++-
 net/ipv6/raw.c           | 6 +++++-
 net/ipv6/udp.c           | 5 ++++-
 net/l2tp/l2tp_ip6.c      | 8 +++++---
 7 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index b927413..2b1c345 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -42,7 +42,8 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct msghdr *msg,
 
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
 			  struct flowi6 *fl6, struct ipv6_txoptions *opt,
-			  int *hlimit, int *tclass, int *dontfrag);
+			  int *hlimit, int *tclass, int *dontfrag,
+			  struct sockcm_cookie *sockc);
 
 void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
 			     __u16 srcp, __u16 destp, int bucket);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..a73d701 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -685,7 +685,8 @@ EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 			  struct msghdr *msg, struct flowi6 *fl6,
 			  struct ipv6_txoptions *opt,
-			  int *hlimit, int *tclass, int *dontfrag)
+			  int *hlimit, int *tclass, int *dontfrag,
+			  struct sockcm_cookie *sockc)
 {
 	struct in6_pktinfo *src_info;
 	struct cmsghdr *cmsg;
@@ -702,6 +703,12 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 			goto exit_f;
 		}
 
+		if (cmsg->cmsg_level == SOL_SOCKET) {
+			if (__sock_cmsg_send(sk, msg, cmsg, sockc))
+				return -EINVAL;
+			continue;
+		}
+
 		if (cmsg->cmsg_level != SOL_IPV6)
 			continue;
 
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index dc2db4f..35d3ddc 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -372,6 +372,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq,
 	if (olen > 0) {
 		struct msghdr msg;
 		struct flowi6 flowi6;
+		struct sockcm_cookie sockc_junk;
 		int junk;
 
 		err = -ENOMEM;
@@ -390,7 +391,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq,
 		memset(&flowi6, 0, sizeof(flowi6));
 
 		err = ip6_datagram_send_ctl(net, sk, &msg, &flowi6, fl->opt,
-					    &junk, &junk, &junk);
+					    &junk, &junk, &junk, &sockc_junk);
 		if (err)
 			goto done;
 		err = -EINVAL;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 4449ad1..a5557d2 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -471,6 +471,7 @@ sticky_done:
 		struct ipv6_txoptions *opt = NULL;
 		struct msghdr msg;
 		struct flowi6 fl6;
+		struct sockcm_cookie sockc_junk;
 		int junk;
 
 		memset(&fl6, 0, sizeof(fl6));
@@ -503,7 +504,7 @@ sticky_done:
 		msg.msg_control = (void *)(opt+1);
 
 		retv = ip6_datagram_send_ctl(net, sk, &msg, &fl6, opt, &junk,
-					     &junk, &junk);
+					     &junk, &junk, &sockc_junk);
 		if (retv)
 			goto done;
 update:
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index fa59dd7..f175ec0 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -745,6 +745,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct dst_entry *dst = NULL;
 	struct raw6_frag_vec rfv;
 	struct flowi6 fl6;
+	struct sockcm_cookie sockc;
 	int addr_len = msg->msg_namelen;
 	int hlimit = -1;
 	int tclass = -1;
@@ -821,13 +822,16 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (fl6.flowi6_oif == 0)
 		fl6.flowi6_oif = sk->sk_bound_dev_if;
 
+	sockc.tsflags = 0;
+
 	if (msg->msg_controllen) {
 		opt = &opt_space;
 		memset(opt, 0, sizeof(struct ipv6_txoptions));
 		opt->tot_len = sizeof(struct ipv6_txoptions);
 
 		err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
-					    &hlimit, &tclass, &dontfrag);
+					    &hlimit, &tclass, &dontfrag,
+					    &sockc);
 		if (err < 0) {
 			fl6_sock_release(flowlabel);
 			return err;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fd25e44..8371a51 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1128,6 +1128,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	int connected = 0;
 	int is_udplite = IS_UDPLITE(sk);
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+	struct sockcm_cookie sockc;
 
 	/* destination address check */
 	if (sin6) {
@@ -1247,6 +1248,7 @@ do_udp_sendmsg:
 		fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
 
 	fl6.flowi6_mark = sk->sk_mark;
+	sockc.tsflags = 0;
 
 	if (msg->msg_controllen) {
 		opt = &opt_space;
@@ -1254,7 +1256,8 @@ do_udp_sendmsg:
 		opt->tot_len = sizeof(*opt);
 
 		err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
-					    &hlimit, &tclass, &dontfrag);
+					    &hlimit, &tclass, &dontfrag,
+					    &sockc);
 		if (err < 0) {
 			fl6_sock_release(flowlabel);
 			return err;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 6b54ff3..4f29a4a 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -492,6 +492,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct ip6_flowlabel *flowlabel = NULL;
 	struct dst_entry *dst = NULL;
 	struct flowi6 fl6;
+	struct sockcm_cookie sockc_unused = {0};
 	int addr_len = msg->msg_namelen;
 	int hlimit = -1;
 	int tclass = -1;
@@ -562,9 +563,10 @@ static int l2tp_ip6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		memset(opt, 0, sizeof(struct ipv6_txoptions));
 		opt->tot_len = sizeof(struct ipv6_txoptions);
 
-		err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
-					    &hlimit, &tclass, &dontfrag);
-		if (err < 0) {
+                err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
+                                            &hlimit, &tclass, &dontfrag,
+                                            &sockc_unused);
+                if (err < 0) {
 			fl6_sock_release(flowlabel);
 			return err;
 		}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 7/8] sock: enable timestamping using control messages
From: Soheil Hassas Yeganeh @ 2016-04-03  3:08 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/tun.c      |  3 ++-
 include/net/ipv6.h     |  6 ++++--
 include/net/sock.h     | 10 ++++++----
 net/can/raw.c          |  2 +-
 net/ipv4/ping.c        |  5 +++--
 net/ipv4/raw.c         | 11 ++++++-----
 net/ipv4/tcp.c         | 20 +++++++++++++++-----
 net/ipv4/udp.c         |  7 ++++---
 net/ipv6/icmp.c        |  6 ++++--
 net/ipv6/ip6_output.c  | 15 +++++++++------
 net/ipv6/ping.c        |  3 ++-
 net/ipv6/raw.c         |  5 ++---
 net/ipv6/udp.c         |  7 ++++---
 net/l2tp/l2tp_ip6.c    |  2 +-
 net/packet/af_packet.c | 30 +++++++++++++++++++++++++-----
 net/socket.c           | 10 +++++-----
 16 files changed, 93 insertions(+), 49 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..6d2fcd0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -860,7 +860,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 
 	if (skb->sk && sk_fullsock(skb->sk)) {
-		sock_tx_timestamp(skb->sk, &skb_shinfo(skb)->tx_flags);
+		sock_tx_timestamp(skb->sk, skb->sk->sk_tsflags,
+				  &skb_shinfo(skb)->tx_flags);
 		sw_tx_timestamp(skb);
 	}
 
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..55ee1eb 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -867,7 +867,8 @@ int ip6_append_data(struct sock *sk,
 				int odd, struct sk_buff *skb),
 		    void *from, int length, int transhdrlen, int hlimit,
 		    int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6,
-		    struct rt6_info *rt, unsigned int flags, int dontfrag);
+		    struct rt6_info *rt, unsigned int flags, int dontfrag,
+		    const struct sockcm_cookie *sockc);
 
 int ip6_push_pending_frames(struct sock *sk);
 
@@ -884,7 +885,8 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 			     void *from, int length, int transhdrlen,
 			     int hlimit, int tclass, struct ipv6_txoptions *opt,
 			     struct flowi6 *fl6, struct rt6_info *rt,
-			     unsigned int flags, int dontfrag);
+			     unsigned int flags, int dontfrag,
+			     const struct sockcm_cookie *sockc);
 
 static inline struct sk_buff *ip6_finish_skb(struct sock *sk)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index af012da..e91b87f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2057,19 +2057,21 @@ static inline void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
 		sk->sk_stamp = skb->tstamp;
 }
 
-void __sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags);
+void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags);
 
 /**
  * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
  * @sk:		socket sending this packet
+ * @tsflags:	timestamping flags to use
  * @tx_flags:	completed with instructions for time stamping
  *
  * Note : callers should take care of initial *tx_flags value (usually 0)
  */
-static inline void sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags)
+static inline void sock_tx_timestamp(const struct sock *sk, __u16 tsflags,
+				     __u8 *tx_flags)
 {
-	if (unlikely(sk->sk_tsflags))
-		__sock_tx_timestamp(sk, tx_flags);
+	if (unlikely(tsflags))
+		__sock_tx_timestamp(tsflags, tx_flags);
 	if (unlikely(sock_flag(sk, SOCK_WIFI_STATUS)))
 		*tx_flags |= SKBTX_WIFI_STATUS;
 }
diff --git a/net/can/raw.c b/net/can/raw.c
index 2e67b14..972c187 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -755,7 +755,7 @@ static int raw_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 	if (err < 0)
 		goto free_skb;
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sk->sk_tsflags, &skb_shinfo(skb)->tx_flags);
 
 	skb->dev = dev;
 	skb->sk  = sk;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 670639b..66ddcb6 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -737,6 +737,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		/* no remote port */
 	}
 
+	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.oif = sk->sk_bound_dev_if;
@@ -744,8 +745,6 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.ttl = 0;
 	ipc.tos = -1;
 
-	sock_tx_timestamp(sk, &ipc.tx_flags);
-
 	if (msg->msg_controllen) {
 		err = ip_cmsg_send(sk, msg, &ipc, false);
 		if (unlikely(err)) {
@@ -768,6 +767,8 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		rcu_read_unlock();
 	}
 
+	sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
+
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 088ce66..438f50c 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -339,8 +339,8 @@ int raw_rcv(struct sock *sk, struct sk_buff *skb)
 
 static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 			   struct msghdr *msg, size_t length,
-			   struct rtable **rtp,
-			   unsigned int flags)
+			   struct rtable **rtp, unsigned int flags,
+			   const struct sockcm_cookie *sockc)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
@@ -379,7 +379,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb->ip_summed = CHECKSUM_NONE;
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 
 	skb->transport_header = skb->network_header;
 	err = -EFAULT;
@@ -540,6 +540,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		daddr = inet->inet_daddr;
 	}
 
+	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
@@ -638,10 +639,10 @@ back_from_confirm:
 
 	if (inet->hdrincl)
 		err = raw_send_hdrinc(sk, &fl4, msg, len,
-				      &rt, msg->msg_flags);
+				      &rt, msg->msg_flags, &ipc.sockc);
 
 	 else {
-		sock_tx_timestamp(sk, &ipc.tx_flags);
+		sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
 
 		if (!ipc.addr)
 			ipc.addr = fl4.daddr;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce3c9eb..4d73858 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -428,13 +428,13 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
-static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb)
+static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
 {
-	if (sk->sk_tsflags) {
+	if (sk->sk_tsflags || tsflags) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
-		sock_tx_timestamp(sk, &shinfo->tx_flags);
+		sock_tx_timestamp(sk, tsflags, &shinfo->tx_flags);
 		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 		tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
@@ -959,7 +959,7 @@ new_segment:
 		offset += copy;
 		size -= copy;
 		if (!size) {
-			tcp_tx_timestamp(sk, skb);
+			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
 			goto out;
 		}
 
@@ -1079,6 +1079,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	struct sockcm_cookie sockc;
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	bool sg;
@@ -1121,6 +1122,15 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		/* 'common' sending to sendq */
 	}
 
+	sockc.tsflags = sk->sk_tsflags;
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(sk, msg, &sockc);
+		if (unlikely(err)) {
+			err = -EINVAL;
+			goto out_err;
+		}
+	}
+
 	/* This should be in poll */
 	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 
@@ -1239,7 +1249,7 @@ new_segment:
 
 		copied += copy;
 		if (!msg_data_left(msg)) {
-			tcp_tx_timestamp(sk, skb);
+			tcp_tx_timestamp(sk, sockc.tsflags, skb);
 			goto out;
 		}
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index bccb4e1..45ff590 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1027,12 +1027,11 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 */
 		connected = 1;
 	}
-	ipc.addr = inet->inet_saddr;
 
+	ipc.sockc.tsflags = sk->sk_tsflags;
+	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
 
-	sock_tx_timestamp(sk, &ipc.tx_flags);
-
 	if (msg->msg_controllen) {
 		err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6);
 		if (unlikely(err)) {
@@ -1059,6 +1058,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
+	sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
+
 	if (ipc.opt && ipc.opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 0a37ddc..6b573eb 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -400,6 +400,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 	struct icmp6hdr tmp_hdr;
 	struct flowi6 fl6;
 	struct icmpv6_msg msg;
+	struct sockcm_cookie sockc_unused = {0};
 	int iif = 0;
 	int addr_type = 0;
 	int len;
@@ -527,7 +528,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 			      len + sizeof(struct icmp6hdr),
 			      sizeof(struct icmp6hdr), hlimit,
 			      np->tclass, NULL, &fl6, (struct rt6_info *)dst,
-			      MSG_DONTWAIT, np->dontfrag);
+			      MSG_DONTWAIT, np->dontfrag, &sockc_unused);
 	if (err) {
 		ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
 		ip6_flush_pending_frames(sk);
@@ -566,6 +567,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 	int hlimit;
 	u8 tclass;
 	u32 mark = IP6_REPLY_MARK(net, skb->mark);
+	struct sockcm_cookie sockc_unused = {0};
 
 	saddr = &ipv6_hdr(skb)->daddr;
 
@@ -617,7 +619,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 	err = ip6_append_data(sk, icmpv6_getfrag, &msg, skb->len + sizeof(struct icmp6hdr),
 				sizeof(struct icmp6hdr), hlimit, tclass, NULL, &fl6,
 				(struct rt6_info *)dst, MSG_DONTWAIT,
-				np->dontfrag);
+				np->dontfrag, &sockc_unused);
 
 	if (err) {
 		ICMP6_INC_STATS_BH(net, idev, ICMP6_MIB_OUTERRORS);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9428345..612f3d1 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1258,7 +1258,8 @@ static int __ip6_append_data(struct sock *sk,
 			     int getfrag(void *from, char *to, int offset,
 					 int len, int odd, struct sk_buff *skb),
 			     void *from, int length, int transhdrlen,
-			     unsigned int flags, int dontfrag)
+			     unsigned int flags, int dontfrag,
+			     const struct sockcm_cookie *sockc)
 {
 	struct sk_buff *skb, *skb_prev = NULL;
 	unsigned int maxfraglen, fragheaderlen, mtu, orig_mtu;
@@ -1329,7 +1330,7 @@ emsgsize:
 		csummode = CHECKSUM_PARTIAL;
 
 	if (sk->sk_type == SOCK_DGRAM || sk->sk_type == SOCK_RAW) {
-		sock_tx_timestamp(sk, &tx_flags);
+		sock_tx_timestamp(sk, sockc->tsflags, &tx_flags);
 		if (tx_flags & SKBTX_ANY_SW_TSTAMP &&
 		    sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
 			tskey = sk->sk_tskey++;
@@ -1565,7 +1566,8 @@ int ip6_append_data(struct sock *sk,
 				int odd, struct sk_buff *skb),
 		    void *from, int length, int transhdrlen, int hlimit,
 		    int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6,
-		    struct rt6_info *rt, unsigned int flags, int dontfrag)
+		    struct rt6_info *rt, unsigned int flags, int dontfrag,
+		    const struct sockcm_cookie *sockc)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -1593,7 +1595,8 @@ int ip6_append_data(struct sock *sk,
 
 	return __ip6_append_data(sk, fl6, &sk->sk_write_queue, &inet->cork.base,
 				 &np->cork, sk_page_frag(sk), getfrag,
-				 from, length, transhdrlen, flags, dontfrag);
+				 from, length, transhdrlen, flags, dontfrag,
+				 sockc);
 }
 EXPORT_SYMBOL_GPL(ip6_append_data);
 
@@ -1752,7 +1755,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 			     int hlimit, int tclass,
 			     struct ipv6_txoptions *opt, struct flowi6 *fl6,
 			     struct rt6_info *rt, unsigned int flags,
-			     int dontfrag)
+			     int dontfrag, const struct sockcm_cookie *sockc)
 {
 	struct inet_cork_full cork;
 	struct inet6_cork v6_cork;
@@ -1779,7 +1782,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 	err = __ip6_append_data(sk, fl6, &queue, &cork.base, &v6_cork,
 				&current->task_frag, getfrag, from,
 				length + exthdrlen, transhdrlen + exthdrlen,
-				flags, dontfrag);
+				flags, dontfrag, sockc);
 	if (err) {
 		__ip6_flush_pending_frames(sk, &queue, &cork, &v6_cork);
 		return ERR_PTR(err);
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index c382db7..da1cff7 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -62,6 +62,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct dst_entry *dst;
 	struct rt6_info *rt;
 	struct pingfakehdr pfh;
+	struct sockcm_cookie junk = {0};
 
 	pr_debug("ping_v6_sendmsg(sk=%p,sk->num=%u)\n", inet, inet->inet_num);
 
@@ -144,7 +145,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	err = ip6_append_data(sk, ping_getfrag, &pfh, len,
 			      0, hlimit,
 			      np->tclass, NULL, &fl6, rt,
-			      MSG_DONTWAIT, np->dontfrag);
+			      MSG_DONTWAIT, np->dontfrag, &junk);
 
 	if (err) {
 		ICMP6_INC_STATS(sock_net(sk), rt->rt6i_idev,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f175ec0..b07ce21 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -822,8 +822,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (fl6.flowi6_oif == 0)
 		fl6.flowi6_oif = sk->sk_bound_dev_if;
 
-	sockc.tsflags = 0;
-
+	sockc.tsflags = sk->sk_tsflags;
 	if (msg->msg_controllen) {
 		opt = &opt_space;
 		memset(opt, 0, sizeof(struct ipv6_txoptions));
@@ -901,7 +900,7 @@ back_from_confirm:
 		lock_sock(sk);
 		err = ip6_append_data(sk, raw6_getfrag, &rfv,
 			len, 0, hlimit, tclass, opt, &fl6, (struct rt6_info *)dst,
-			msg->msg_flags, dontfrag);
+			msg->msg_flags, dontfrag, &sockc);
 
 		if (err)
 			ip6_flush_pending_frames(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8371a51..fef21c4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1248,7 +1248,7 @@ do_udp_sendmsg:
 		fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
 
 	fl6.flowi6_mark = sk->sk_mark;
-	sockc.tsflags = 0;
+	sockc.tsflags = sk->sk_tsflags;
 
 	if (msg->msg_controllen) {
 		opt = &opt_space;
@@ -1324,7 +1324,7 @@ back_from_confirm:
 		skb = ip6_make_skb(sk, getfrag, msg, ulen,
 				   sizeof(struct udphdr), hlimit, tclass, opt,
 				   &fl6, (struct rt6_info *)dst,
-				   msg->msg_flags, dontfrag);
+				   msg->msg_flags, dontfrag, &sockc);
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
 			err = udp_v6_send_skb(skb, &fl6);
@@ -1351,7 +1351,8 @@ do_append_data:
 	err = ip6_append_data(sk, getfrag, msg, ulen,
 		sizeof(struct udphdr), hlimit, tclass, opt, &fl6,
 		(struct rt6_info *)dst,
-		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags, dontfrag);
+		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags, dontfrag,
+		&sockc);
 	if (err)
 		udp_v6_flush_pending_frames(sk);
 	else if (!corkreq)
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 4f29a4a..1a38f20 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -627,7 +627,7 @@ back_from_confirm:
 	err = ip6_append_data(sk, ip_generic_getfrag, msg,
 			      ulen, transhdrlen, hlimit, tclass, opt,
 			      &fl6, (struct rt6_info *)dst,
-			      msg->msg_flags, dontfrag);
+			      msg->msg_flags, dontfrag, &sockc_unused);
 	if (err)
 		ip6_flush_pending_frames(sk);
 	else if (!(msg->msg_flags & MSG_MORE))
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1ecfa71..0007e23 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1837,6 +1837,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	DECLARE_SOCKADDR(struct sockaddr_pkt *, saddr, msg->msg_name);
 	struct sk_buff *skb = NULL;
 	struct net_device *dev;
+	struct sockcm_cookie sockc;
 	__be16 proto = 0;
 	int err;
 	int extra_len = 0;
@@ -1925,12 +1926,21 @@ retry:
 		goto out_unlock;
 	}
 
+	sockc.tsflags = 0;
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(sk, msg, &sockc);
+		if (unlikely(err)) {
+			err = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
 	skb->protocol = proto;
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
 	if (unlikely(extra_len == 4))
 		skb->no_fcs = 1;
@@ -2486,7 +2496,8 @@ static int packet_snd_vnet_gso(struct sk_buff *skb,
 
 static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 		void *frame, struct net_device *dev, void *data, int tp_len,
-		__be16 proto, unsigned char *addr, int hlen, int copylen)
+		__be16 proto, unsigned char *addr, int hlen, int copylen,
+		const struct sockcm_cookie *sockc)
 {
 	union tpacket_uhdr ph;
 	int to_write, offset, len, nr_frags, len_max;
@@ -2500,7 +2511,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->dev = dev;
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
-	sock_tx_timestamp(&po->sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
 	skb_reserve(skb, hlen);
@@ -2624,6 +2635,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	struct sk_buff *skb;
 	struct net_device *dev;
 	struct virtio_net_hdr *vnet_hdr = NULL;
+	struct sockcm_cookie sockc;
 	__be16 proto;
 	int err, reserve = 0;
 	void *ph;
@@ -2655,6 +2667,13 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		dev = dev_get_by_index(sock_net(&po->sk), saddr->sll_ifindex);
 	}
 
+	sockc.tsflags = 0;
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(&po->sk, msg, &sockc);
+		if (unlikely(err))
+			goto out;
+	}
+
 	err = -ENXIO;
 	if (unlikely(dev == NULL))
 		goto out;
@@ -2712,7 +2731,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			goto out_status;
 		}
 		tp_len = tpacket_fill_skb(po, skb, ph, dev, data, tp_len, proto,
-					  addr, hlen, copylen);
+					  addr, hlen, copylen, &sockc);
 		if (likely(tp_len >= 0) &&
 		    tp_len > dev->mtu + reserve &&
 		    !po->has_vnet_hdr &&
@@ -2851,6 +2870,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_unlock;
 
+	sockc.tsflags = 0;
 	sockc.mark = sk->sk_mark;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(sk, msg, &sockc);
@@ -2908,7 +2928,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		goto out_free;
 	}
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
 	if (!vnet_hdr.gso_type && (len > dev->mtu + reserve + extra_len) &&
 	    !packet_extra_vlan_len_allowed(dev, skb)) {
diff --git a/net/socket.c b/net/socket.c
index 5f77a8e..979d314 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -587,20 +587,20 @@ void sock_release(struct socket *sock)
 }
 EXPORT_SYMBOL(sock_release);
 
-void __sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags)
+void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags)
 {
 	u8 flags = *tx_flags;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_HARDWARE)
+	if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE)
 		flags |= SKBTX_HW_TSTAMP;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_SOFTWARE)
+	if (tsflags & SOF_TIMESTAMPING_TX_SOFTWARE)
 		flags |= SKBTX_SW_TSTAMP;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_SCHED)
+	if (tsflags & SOF_TIMESTAMPING_TX_SCHED)
 		flags |= SKBTX_SCHED_TSTAMP;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK)
+	if (tsflags & SOF_TIMESTAMPING_TX_ACK)
 		flags |= SKBTX_ACK_TSTAMP;
 
 	*tx_flags = flags;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Brenden Blanco @ 2016-04-03  5:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer
In-Reply-To: <CALx6S37y8+kFh=04cceSpLWZMHkanwWREgoVKc7Edmyhe3qvzg@mail.gmail.com>

On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> Very nice! Do you think this hook will be sufficient to implement a
> fast forward patch also?
That is the goal, but more work needs to be done of course. It won't be
possible with just a single pseudo skb, the driver will need a fast way to get
batches of pseudo skbs (per core?) through from rx to tx. In mlx4 for
instance, either the skb needs to be much more complete to be handled from the
start of mlx4_en_xmit(), or that function would need to be split so that the
fast tx could start midway through.

Or, skb allocation just gets much faster. Then it should be pretty
straightforward.
> 
> Tom

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Brenden Blanco @ 2016-04-03  6:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend
In-Reply-To: <20160402102331.5aa3b3c2@redhat.com>

On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
[...]
> 
> I think you need to DMA sync RX-page before you can safely access
> packet data in page (on all arch's).
> 
Thanks, I will give that a try in the next spin.
> > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +						 frags[0].page_offset);
> > +			if (mlx4_call_bpf(prog, ethh, length)) {
> 
> AFAIK length here covers all the frags[n].page, thus potentially
> causing the BPF program to access memory out of bound (crash).
> 
> Having several page fragments is AFAIK an optimization for jumbo-frames
> on PowerPC (which is a bit annoying for you use-case ;-)).
> 
Yeah, this needs some more work. I can think of some options:
1. limit pseudo skb.len to first frag's length only, and signal to
program that the packet is incomplete
2. for nfrags>1 skip bpf processing, but this could be functionally
incorrect for some use cases
3. run the program for each frag
4. reject ndo_bpf_set when frags are possible (large mtu?)

My preference is to go with 1, thoughts?
> 
[...]

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Brenden Blanco @ 2016-04-03  6:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <1459562911.6473.299.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
[...]
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.
Sure, I will do some reading. Jesper also raised the issue after you
did. Please let me know what you think of the proposals.
> 
> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.
> 
>    This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.
> 
This sounds reasonable! Will look into it for the next spin.

^ permalink raw reply

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Brenden Blanco @ 2016-04-03  6:38 UTC (permalink / raw)
  To: Johannes Berg
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer
In-Reply-To: <1459622455.18188.5.camel@sipsolutions.net>

On Sat, Apr 02, 2016 at 08:40:55PM +0200, Johannes Berg wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> 
> > +static int mlx4_bpf_set(struct net_device *dev, int fd)
> > +{
> [...]
> > +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> > +			bpf_prog_put(prog);
> > +			return -EINVAL;
> > +		}
> > +	}
> 
> Why wouldn't this check be done in the generic code that calls
> ndo_bpf_set()?
Having a common check makes sense. The tricky thing is that the type can
only be checked after taking the reference, and I wanted to keep the
scope of the prog brief in the case of errors. I would have to move the
bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
ndo instead. Would that API look fine to you?

A possible extension of this is just to keep the bpf_prog * in the
netdev itself and expose a feature flag from the driver rather than an
ndo. But that would mean another 8 bytes in the netdev.
> 
> johannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox