Netdev List
 help / color / mirror / Atom feed
* Re: [net PATCH v2] gro: Allow tunnel stacking in the case of FOU/GUE
From: David Miller @ 2016-03-30 20:03 UTC (permalink / raw)
  To: aduyck; +Cc: jesse, netdev, alexander.duyck, tom
In-Reply-To: <20160329215226.12478.3696.stgit@localhost.localdomain>

From: Alexander Duyck <aduyck@mirantis.com>
Date: Tue, 29 Mar 2016 14:55:22 -0700

> This patch should fix the issues seen with a recent fix to prevent
> tunnel-in-tunnel frames from being generated with GRO.  The fix itself is
> correct for now as long as we do not add any devices that support
> NETIF_F_GSO_GRE_CSUM.  When such a device is added it could have the
> potential to mess things up due to the fact that the outer transport header
> points to the outer UDP header and not the GRE header as would be expected.
> 
> Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.")
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
> 
> v2: Dropped switch statements per suggestion of Tom Herbert.

Applied, thanks Alex.

^ permalink raw reply

* Re: [net PATCH] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet
From: Sowmini Varadhan @ 2016-03-30 20:09 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Alexander Duyck, Alexander Duyck, Netdev, intel-wired-lan,
	Jeff Kirsher
In-Reply-To: <20160330124135.000054c6@unknown>

On (03/30/16 12:41), Jesse Brandeburg wrote:
> This gets "Even Uglier", I've turned off all offloads at my receiver,
> enabled calling skb_linearize on *all* frames, which works fine for
> scp, but the receiver shows > MSS sized frames on the wire for
> rds-stress traffic.

fwiw, I was not able to reproduce this tx issue with iperf/netperf
etc either (I tried various perumtations of bidir, req-resp etc). One
difference between the rds-stress invocation and the other callers
is that this comes down via tcp_sendpage(), I dont know if there
is something in the sendpage path that i40e does not anticipate. 

Other drivers (ixgbe etc) work fine, so my hunch would be that this
is specific to i40e (and not a skb_linearize bug) but I could be
wrong.

--Sowmini

^ permalink raw reply

* Re: [net PATCH] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet
From: Eric Dumazet @ 2016-03-30 20:15 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Jesse Brandeburg, Alexander Duyck, Alexander Duyck, Netdev,
	intel-wired-lan, Jeff Kirsher
In-Reply-To: <20160330200959.GI27540@oracle.com>

On Wed, 2016-03-30 at 16:09 -0400, Sowmini Varadhan wrote:
> On (03/30/16 12:41), Jesse Brandeburg wrote:
> > This gets "Even Uglier", I've turned off all offloads at my receiver,
> > enabled calling skb_linearize on *all* frames, which works fine for
> > scp, but the receiver shows > MSS sized frames on the wire for
> > rds-stress traffic.
> 
> fwiw, I was not able to reproduce this tx issue with iperf/netperf
> etc either (I tried various perumtations of bidir, req-resp etc). One
> difference between the rds-stress invocation and the other callers
> is that this comes down via tcp_sendpage(), I dont know if there
> is something in the sendpage path that i40e does not anticipate. 

You might try netperf -t TCP_SENDFILE -- -m 150

to let netperf use sendfile() on small frags.

^ permalink raw reply

* Re: [RFC] Add netdev all_adj_list refcnt propagation to fix panic
From: David Miller @ 2016-03-30 20:15 UTC (permalink / raw)
  To: acollins; +Cc: netdev, mschiffer
In-Reply-To: <20160330.160150.1290720758360796805.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Wed, 30 Mar 2016 16:01:50 -0400 (EDT)

> Veaceslav, please look into this.

Of course, his email now bounces.... :-/

^ permalink raw reply

* Re: [PATCH net-next] tcp: remove cwnd moderation after recovery
From: Yuchung Cheng @ 2016-03-30 20:20 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, netdev, Matt Mathis, Neal Cardwell,
	Soheil Hassas Yeganeh
In-Reply-To: <20160329173545.5fe0ae8b@xeon-e3>

On Tue, Mar 29, 2016 at 5:35 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 29 Mar 2016 17:15:52 -0700
> Yuchung Cheng <ycheng@google.com> wrote:
>
>> For non-SACK connections, cwnd is lowered to inflight plus 3 packets
>> when the recovery ends. This is an optional feature in the NewReno
>> RFC 2582 to reduce the potential burst when cwnd is "re-opened"
>> after recovery and inflight is low.
>>
>> This feature is questionably effective because of PRR: when
>> the recovery ends (i.e., snd_una == high_seq) NewReno holds the
>> CA_Recovery state for another round trip to prevent false fast
>> retransmits. But if the inflight is low, PRR will overwrite the
>> moderated cwnd in tcp_cwnd_reduction() later.
>>
>> On the other hand, if the recovery ends because the sender
>> detects the losses were spurious (e.g., reordering). This feature
>> unconditionally lowers a reverted cwnd even though nothing
>> was lost.
>>
>> By principle loss recovery module should not update cwnd. Further
>> pacing is much more effective to reduce burst. Hence this patch
>> removes the cwnd moderation feature.
>>
>> Signed-off-by: Matt Mathis <mattmathis@google.com>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
>
> I have a concern that this might break Linux builtin protection
> against hostile receiver sending bogus ACK's.  Remember Linux is
> different than NewReno. You are changing something that has existed for
> a long long time.

I suppose the bogus ACKs are ACKs acking future but in-flight data
packets. The most bogus ACK would ack at most SND.NXT otherwise it
will be filtered early in tcp_ack(). Then cwnd_moderation() will pull
cwnd down to inflight + maxburst = 0 + 3 = 3. But immediately
afterward tcp_cwnd_reduction() will over-write cwnd = inflight +
sndcnt =~ 0 + newly_acked_sacked.

w/o pacing, such a bogus ACK can at most induce a burst of a window
(minus 3 dupacks). Restart after (short) idle or receiving stretched
ACKs can induce the same degree of behavior. More importantly the cwnd
moderation is already a NOP b/c cwnd is over-written by the PRR,
regardless of this patch. If the receiver's intention is to speed up
recovery, he is risking reliability w/ future ACKs. if the sole
intention is network abuse then he unlikely would bother to trigger
recovery and congestion control reactions in the first place.

I will update the commit message regarding your concern. Thanks.

^ permalink raw reply

* Re: [PATCH] sctp: flush if we can't fit another DATA chunk
From: Neil Horman @ 2016-03-30 20:32 UTC (permalink / raw)
  To: David Miller; +Cc: marcelo.leitner, netdev, vyasevich, linux-sctp
In-Reply-To: <20160330.154622.1322506750240048120.davem@davemloft.net>

On Wed, Mar 30, 2016 at 03:46:22PM -0400, David Miller wrote:
> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Date: Tue, 29 Mar 2016 10:41:25 -0300
> 
> > There is no point in delaying the packet if we can't fit a single byte
> > of data on it anymore. So lets just reduce the threshold by the amount
> > that a data chunk with 4 bytes (rounding) would use.
> > 
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/sctp/output.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/sctp/output.c b/net/sctp/output.c
> > index 97745351d58c2fb32b9f9b57d61831d7724d83b2..c518569123ce42a8f21f80754756306c39875013 100644
> > --- a/net/sctp/output.c
> > +++ b/net/sctp/output.c
> > @@ -705,7 +705,8 @@ static sctp_xmit_t sctp_packet_can_append_data(struct sctp_packet *packet,
> >  	/* Check whether this chunk and all the rest of pending data will fit
> >  	 * or delay in hopes of bundling a full sized packet.
> >  	 */
> > -	if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
> > +	if (chunk->skb->len + q->out_qlen >
> > +		maxsize - packet->overhead - sizeof(sctp_data_chunk_t) - 4)
> 
> There is no maxsize in this function.
> 
> You must generate and test your patches against my networking tree.
> 
> Neil, how were you able to see where 'maxsize' is and how it's even
> calculated before determining that this change is correct?
> 
Shit, sorry, dave, I trusted that Marcello built and tested the patch, and
reviewed it based on the validity of the math (which making assumptions for what
maxsize was, should be correct).  No exuse, I screwed up.
Neil

> Please don't ACK patches you really didn't verify in any way at all,
> thanks.  It's better to have no reviews than bad reviews, because ACKs
> are supposed to give me a reason to be more confident in the change.
> 
> Marcelo, I'm ignoring the rest of your SCTP changes, you have to get
> your act together.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [net PATCH] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet
From: Sowmini Varadhan @ 2016-03-30 20:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesse Brandeburg, Alexander Duyck, Alexander Duyck, Netdev,
	intel-wired-lan, Jeff Kirsher
In-Reply-To: <1459368904.6473.207.camel@edumazet-glaptop3.roam.corp.google.com>

On (03/30/16 13:15), Eric Dumazet wrote:
> You might try netperf -t TCP_SENDFILE -- -m 150
> 
> to let netperf use sendfile() on small frags.

that still did not reproduce it but let me try beating on
that approach with more permutations.

BTW, another data-point that may help debug this: even with i40e,
if you use the "-o" option to the rds-stress invocation, there are
no problems: the "-o" option  enforces uni-directional data
transfer, so one side is pure-Tx, other side is pure-Rx. 

It is only when both sides simultaneously do both Tx and Rx
on the tcp socket that you see the issue. I dont know if
that provides any clues.

--Sowmini

^ permalink raw reply

* Re: [RFC] Add netdev all_adj_list refcnt propagation to fix panic
From: Andrew Collins @ 2016-03-30 20:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, mschiffer, vfalico, vfalico
In-Reply-To: <20160330.160150.1290720758360796805.davem@davemloft.net>

> From: Andrew Collins <acollins@cradlepoint.com>
> Date: Tue, 29 Mar 2016 11:25:03 -0600
>
>> This is an RFC patch to fix a relatively easily reproducible kernel
>> panic related to the all_adj_list handling for netdevs in recent kernels.
>>
>> This is more to generate discussion than anything else.  I don't
>> particularly like this approach, I'm hoping someone has a better idea.
>>
>> The following sequence of commands will reproduce the issue:
>>
>> ip link add link eth0 name eth0.100 type vlan id 100
>> ip link add link eth0 name eth0.200 type vlan id 200
>> ip link add name testbr type bridge
>> ip link set eth0.100 master testbr
>> ip link set eth0.200 master testbr
>> ip link add link testbr mac0 type macvlan
>> ip link delete dev testbr
>>
>> This creates an upper/lower tree of (excuse the poor ASCII art):
>>
>>              /---eth0.100-eth0
>> mac0-testbr-
>>              \---eth0.200-eth0
>>
>> When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from
>> the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
>> reference to eth0 is added, so this results in a panic.
>>
>> This change adds reference count propagation so things are handled properly.
>>
>> Matthias Schiffer reported a similar crash in batman-adv:
>>
>> https://github.com/freifunk-gluon/gluon/issues/680
>> https://www.open-mesh.org/issues/247
>>
>> which this patch also seems to resolve.
>
> Veaceslav, please look into this.
>
> Thanks.
>
> !SIG:56fc30b5184771297483788!
>

+vfalico's new address picked up from git logs

^ permalink raw reply

* RE: possible bug in latest network tree
From: Light, John J @ 2016-03-30 21:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org

David,

I see a recent change in inet_connection_sock.c that uses sysctl_tcp_synack_retries suspiciously.

Previously, sysctl_tcp_synack_retries was a global variable, and now it has been moved into the fragment (really netns_ipv4).

In reqsk_timer_handler (inet_connection_sock.c) near line 563 it is used as the default max_retries value in case icsk->icsk_syn_retries is not set.

Previously, other transports (dccp?) would use the TCP global variable as a default.  Now that the defaults come from the fragment, I suspect the fragment variable is not set for other transports.  (I can't find where it's set for dccp.)

Of course, this will only show in non-TCP transport protocols when the icsk retries value is not set, so it's a rare case.  Perhaps it's an unreachable case, since I don't know all the kernel paths.

Maybe the problem is that the default shouldn't be to a TCP value, but should be a 'transport' value.

This code is somewhat convoluted, so I am not sure of my analysis, but I wanted you to consider it.

John Light
Intel OTC comms

^ permalink raw reply

* Re: [PATCHv2 net] team: team should sync the port's uc/mc addrs when add a port
From: David Miller @ 2016-03-30 21:07 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, jiri, marcelo.leitner, xiyou.wangcong
In-Reply-To: <e7eed00cdf5b0bf2d7eda4364f0fd28e95028db8.1459306719.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 30 Mar 2016 10:58:39 +0800

> There is an issue when we use mavtap over team:
> When we replug nic links from team0, the real nics's mc list will not
> include the maddr for macvtap any more. then we can't receive pkts to
> macvtap device, as they are filterred by mc list of nic.
> 
> In Bonding Driver, it syncs the uc/mc addrs in bond_enslave().
> 
> We will fix this issue on team by adding the port's uc/mc addrs sync in
> team_port_add.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: possible bug in latest network tree
From: Eric Dumazet @ 2016-03-30 21:17 UTC (permalink / raw)
  To: Light, John J; +Cc: David S. Miller, netdev@vger.kernel.org
In-Reply-To: <660C7E53FDDFB64C8741117E0D659AAAB8463B9F@ORSMSX116.amr.corp.intel.com>

On Wed, 2016-03-30 at 21:05 +0000, Light, John J wrote:
> David,
> 
> I see a recent change in inet_connection_sock.c that uses
> sysctl_tcp_synack_retries suspiciously.
> 
> Previously, sysctl_tcp_synack_retries was a global variable, and now
> it has been moved into the fragment (really netns_ipv4).

You mean the network namespace ?
> 
> In reqsk_timer_handler (inet_connection_sock.c) near line 563 it is
> used as the default max_retries value in case icsk->icsk_syn_retries
> is not set.
> 
> Previously, other transports (dccp?) would use the TCP global variable
> as a default.  Now that the defaults come from the fragment, I suspect
> the fragment variable is not set for other transports.  (I can't find
> where it's set for dccp.)
> 
> Of course, this will only show in non-TCP transport protocols when the
> icsk retries value is not set, so it's a rare case.  Perhaps it's an
> unreachable case, since I don't know all the kernel paths.
> 
> Maybe the problem is that the default shouldn't be to a TCP value, but
> should be a 'transport' value.
> 
> This code is somewhat convoluted, so I am not sure of my analysis, but
> I wanted you to consider it.
> 

No idea why you ask David instead of patch author ?

I don't see any problem here.

Do we really want to spend time on this minor issue ?

^ permalink raw reply

* Re: [net PATCH] i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet
From: Alexander Duyck @ 2016-03-30 21:20 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Sowmini Varadhan, Alexander Duyck, Netdev, intel-wired-lan,
	Jeff Kirsher
In-Reply-To: <20160330124135.000054c6@unknown>

On Wed, Mar 30, 2016 at 12:41 PM, Jesse Brandeburg
<jesse.brandeburg@intel.com> wrote:
> On Wed, 30 Mar 2016 10:35:55 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Wed, Mar 30, 2016 at 10:20 AM, Sowmini Varadhan
>> <sowmini.varadhan@oracle.com> wrote:
>> > On (03/30/16 10:12), Alexander Duyck wrote:
>> >> Yeah.  The patch was sort of a knee-jerk reaction to being told that
>> >> the patch referenced caused a regression.  From what I can tell that
>> >> is not the case as I am also seeing the Tx hangs when I run the test
>> >> with the frames being linearized.
>> >
>> > I'm not sure how important of a subtlety this is, but the actual
>> > console log after the patch is the following:
>> >
>> >  i40e 0000:82:00.0: TX driver issue detected, PF reset issued
>> >  i40e 0000:82:00.0 eth2: adding 68:05:ca:30:dd:18 vid=0
>> >  i40e 0000:82:00.0: TX driver issue detected, PF reset issued
>> >  i40e 0000:82:00.0 eth2: adding 68:05:ca:30:dd:18 vid=0
>> >  i40e 0000:82:00.0: TX driver issue detected, PF reset issued
>> >
>> > Comparing with what I'd pasted in the sourceforge thread earlier,
>> > I see that it does not say "Hung Tx queue etc."  any more, though
>> > it still resets.
>> >
>> > Not sure if that changed info is significant?
>>
>> It might be.  Right now I am chasing down the Tx driver issue as that
>> I what I am reproducing in my environment as well.
>
> This gets "Even Uglier", I've turned off all offloads at my receiver,
> enabled calling skb_linearize on *all* frames, which works fine for
> scp, but the receiver shows > MSS sized frames on the wire for
> rds-stress traffic.

Are you sure it isn't just GRO reassembling frames on the receive
side.  I know that one always trips me up when I am using the Rx path
to validate Tx checksums.

> This implies to me we have some issue with skb_linearize, possibly in
> how the stack linearizes the data, or how the driver interprets the
> linearized packets (which should always work)
>
> Wheee......

With the descriptor dump code you have you should be able to verify
what the layout is after the descriptor is linearized.  I would think
in most cases you would end up with at most something like 4 to maybe
5 descriptors for a 64K frame.

- Alex

^ permalink raw reply

* Re: [PATCH 0/5] wireless: ti: Convert specialized logging macros to kernel style
From: Joe Perches @ 2016-03-30 21:36 UTC (permalink / raw)
  To: Kalle Valo; +Cc: linux-kernel, linux-wireless, netdev
In-Reply-To: <87shz85gkq.fsf@purkki.adurom.net>

On Wed, 2016-03-30 at 14:51 +0300, Kalle Valo wrote:
> Joe Perches <joe@perches.com> writes:
> > Using the normal kernel logging mechanisms makes this code
> > a bit more like other wireless drivers.
> Personally I don't see the point but I don't have any strong opinions. A
> bigger problem is that TI drivers are not really in active development
> and that's I'm not thrilled to take big patches like this for dormant
> drivers.

Not very dormant.

35 patches in the last year, most of them adding functionality.

^ permalink raw reply

* [PATCH v2 net-next] tcp: remove cwnd moderation after recovery
From: Yuchung Cheng @ 2016-03-30 21:54 UTC (permalink / raw)
  To: davem
  Cc: netdev, Yuchung Cheng, Matt Mathis, Neal Cardwell,
	Soheil Hassas Yeganeh

For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.

This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.

On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.

By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.

v2 changes: revised commit message on bogus ACKs and burst, and
            missing signature

Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h    | 11 -----------
 net/ipv4/tcp_input.c | 11 -----------
 2 files changed, 22 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f..f8bb4a4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1039,17 +1039,6 @@ static inline __u32 tcp_max_tso_deferred_mss(const struct tcp_sock *tp)
 	return 3;
 }
 
-/* Slow start with delack produces 3 packets of burst, so that
- * it is safe "de facto".  This will be the default - same as
- * the default reordering threshold - but if reordering increases,
- * we must be able to allow cwnd to burst at least this much in order
- * to not pull it back when holes are filled.
- */
-static __inline__ __u32 tcp_max_burst(const struct tcp_sock *tp)
-{
-	return tp->reordering;
-}
-
 /* Returns end sequence number of the receiver's advertised window */
 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..f87b84a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2252,16 +2252,6 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 	}
 }
 
-/* CWND moderation, preventing bursts due to too big ACKs
- * in dubious situations.
- */
-static inline void tcp_moderate_cwnd(struct tcp_sock *tp)
-{
-	tp->snd_cwnd = min(tp->snd_cwnd,
-			   tcp_packets_in_flight(tp) + tcp_max_burst(tp));
-	tp->snd_cwnd_stamp = tcp_time_stamp;
-}
-
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
 	return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2410,7 +2400,6 @@ static bool tcp_try_undo_recovery(struct sock *sk)
 		/* Hold old state until something *above* high_seq
 		 * is ACKed. For Reno it is MUST to prevent false
 		 * fast retransmits (RFC2582). SACK TCP is safe. */
-		tcp_moderate_cwnd(tp);
 		if (!tcp_any_retrans_done(sk))
 			tp->retrans_stamp = 0;
 		return true;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 0/8] add TX timestamping via cmsg
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

This patch series aim at enabling TX timestamping via cmsg.

Currently, to occasionally sample TX timestamping on a socket,
applications need to call setsockopt twice: first for enabling
timestamps and then for disabling them. This is an unnecessary
overhead. With cmsg, in contrast, applications can sample TX
timestamps per sendmsg().

This patch series adds the code for processing SO_TIMESTAMPING
for cmsg's of the SOL_SOCKET level, and adds the glue code for
TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
supports overriding timestamp generation flags (i.e.,
SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
Applications must still enable timestamp reporting via
setsockopt to receive timestamps.

This series does not change existing timestamping behavior for
applications that are using socket options.

I will follow up with another patch to enable timestamping for
active TFO (client-side TCP Fast Open) and also setting packet
mark via cmsgs.

Thanks!

Soheil Hassas Yeganeh (7):
  tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
  tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
  sock: accept SO_TIMESTAMPING flags in socket cmsg
  ipv4: process socket-level control messages in IPv4
  ipv6: process socket-level control messages in IPv6
  sock: enable timestamping using control messages
  sock: document timestamping via cmsg in Documentation

Willem de Bruijn (1):
  sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop

 Documentation/networking/timestamping.txt | 48 ++++++++++++++++++++++++++++--
 drivers/net/tun.c                         |  3 +-
 include/net/ip.h                          |  3 +-
 include/net/ipv6.h                        |  6 ++--
 include/net/sock.h                        | 13 +++++---
 include/net/tcp.h                         |  3 +-
 include/net/transp_v6.h                   |  3 +-
 include/uapi/linux/net_tstamp.h           | 10 +++++++
 net/can/raw.c                             |  2 +-
 net/core/sock.c                           | 49 +++++++++++++++++++++++--------
 net/ipv4/ip_sockglue.c                    |  9 +++++-
 net/ipv4/ping.c                           |  7 +++--
 net/ipv4/raw.c                            | 13 ++++----
 net/ipv4/tcp.c                            | 22 ++++++++++----
 net/ipv4/tcp_input.c                      |  2 +-
 net/ipv4/udp.c                            | 10 +++----
 net/ipv6/datagram.c                       |  9 +++++-
 net/ipv6/icmp.c                           |  6 ++--
 net/ipv6/ip6_flowlabel.c                  |  3 +-
 net/ipv6/ip6_output.c                     | 15 ++++++----
 net/ipv6/ipv6_sockglue.c                  |  3 +-
 net/ipv6/ping.c                           |  3 +-
 net/ipv6/raw.c                            |  7 +++--
 net/ipv6/udp.c                            | 10 +++++--
 net/packet/af_packet.c                    | 30 +++++++++++++++----
 net/socket.c                              | 10 +++----
 26 files changed, 225 insertions(+), 74 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH net-next 1/8] sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Willem de Bruijn <willemb@google.com>

To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.

Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/sock.h |  2 ++
 net/core/sock.c    | 33 ++++++++++++++++++++++-----------
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e0..03772d4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1420,6 +1420,8 @@ struct sockcm_cookie {
 	u32 mark;
 };
 
+int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+		     struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9ae..66976f8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1866,27 +1866,38 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
+int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+		     struct sockcm_cookie *sockc)
+{
+	switch (cmsg->cmsg_type) {
+	case SO_MARK:
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+			return -EPERM;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+			return -EINVAL;
+		sockc->mark = *(u32 *)CMSG_DATA(cmsg);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(__sock_cmsg_send);
+
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
 		   struct sockcm_cookie *sockc)
 {
 	struct cmsghdr *cmsg;
+	int ret;
 
 	for_each_cmsghdr(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 		if (cmsg->cmsg_level != SOL_SOCKET)
 			continue;
-		switch (cmsg->cmsg_type) {
-		case SO_MARK:
-			if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
-				return -EPERM;
-			if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
-				return -EINVAL;
-			sockc->mark = *(u32 *)CMSG_DATA(cmsg);
-			break;
-		default:
-			return -EINVAL;
-		}
+		ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+		if (ret)
+			return ret;
 	}
 	return 0;
 }
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 3/8] tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h    | 3 ++-
 net/ipv4/tcp.c       | 2 ++
 net/ipv4/tcp_input.c | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f..f3a80ec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -754,7 +754,8 @@ struct tcp_skb_cb {
 				TCPCB_REPAIRED)
 
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
-	/* 1 byte hole */
+	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
+			unused:7;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct inet_skb_parm	h4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 08b8b96..ce3c9eb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -432,10 +432,12 @@ static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb)
 {
 	if (sk->sk_tsflags) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
 		sock_tx_timestamp(sk, &shinfo->tx_flags);
 		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+		tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..2d5fee4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3093,7 +3093,7 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 	const struct skb_shared_info *shinfo;
 
 	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
-	if (likely(!(sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK)))
+	if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
 		return;
 
 	shinfo = skb_shinfo(skb);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
to associate timestamps with send calls. For TCP connections,
tp->snd_una is used as the starting point to calculate
relative IDs.

This socket option will fail if set before the handshake on a
passive TCP fast open connection with data in SYN or SYN/ACK,
since setsockopt requires the connection to be in the
ESTABLISHED state.

To address these, instead of limiting the option to the
ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
long as the connection is not in LISTEN or CLOSE states.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 66976f8..0a64fe2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -832,7 +832,8 @@ set_rcvbuf:
 		    !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
 			if (sk->sk_protocol == IPPROTO_TCP &&
 			    sk->sk_type == SOCK_STREAM) {
-				if (sk->sk_state != TCP_ESTABLISHED) {
+				if ((1 << sk->sk_state) &
+				    (TCPF_CLOSE | TCPF_LISTEN)) {
 					ret = -EINVAL;
 					break;
 				}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 4/8] sock: accept SO_TIMESTAMPING flags in socket cmsg
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.

This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.

This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.

This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/sock.h              |  1 +
 include/uapi/linux/net_tstamp.h | 10 ++++++++++
 net/core/sock.c                 | 13 +++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 03772d4..af012da 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1418,6 +1418,7 @@ void sk_send_sigurg(struct sock *sk);
 
 struct sockcm_cookie {
 	u32 mark;
+	u16 tsflags;
 };
 
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 6d1abea..264e515 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -31,6 +31,16 @@ enum {
 				 SOF_TIMESTAMPING_LAST
 };
 
+/*
+ * SO_TIMESTAMPING flags are either for recording a packet timestamp or for
+ * reporting the timestamp to user space.
+ * Recording flags can be set both via socket options and control messages.
+ */
+#define SOF_TIMESTAMPING_TX_RECORD_MASK	(SOF_TIMESTAMPING_TX_HARDWARE | \
+					 SOF_TIMESTAMPING_TX_SOFTWARE | \
+					 SOF_TIMESTAMPING_TX_SCHED | \
+					 SOF_TIMESTAMPING_TX_ACK)
+
 /**
  * struct hwtstamp_config - %SIOCGHWTSTAMP and %SIOCSHWTSTAMP parameter
  *
diff --git a/net/core/sock.c b/net/core/sock.c
index 0a64fe2..315f5e5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1870,6 +1870,8 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
+	u32 tsflags;
+
 	switch (cmsg->cmsg_type) {
 	case SO_MARK:
 		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
@@ -1878,6 +1880,17 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 			return -EINVAL;
 		sockc->mark = *(u32 *)CMSG_DATA(cmsg);
 		break;
+	case SO_TIMESTAMPING:
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+			return -EINVAL;
+
+		tsflags = *(u32 *)CMSG_DATA(cmsg);
+		if (tsflags & ~SOF_TIMESTAMPING_TX_RECORD_MASK)
+			return -EINVAL;
+
+		sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
+		sockc->tsflags |= tsflags;
+		break;
 	default:
 		return -EINVAL;
 	}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 6/8] ipv6: process socket-level control messages in IPv6
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Process socket-level control messages by invoking
__sock_cmsg_send in ip6_datagram_send_ctl for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip6_datagram_send_ctl is called for
udp and raw, we also process socket-level control messages.

This is a bit uglier than IPv4, since IPv6 does not have
something like ipcm_cookie. Perhaps we can later create
a control message cookie for IPv6?

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv6 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/transp_v6.h  | 3 ++-
 net/ipv6/datagram.c      | 9 ++++++++-
 net/ipv6/ip6_flowlabel.c | 3 ++-
 net/ipv6/ipv6_sockglue.c | 3 ++-
 net/ipv6/raw.c           | 6 +++++-
 net/ipv6/udp.c           | 5 ++++-
 6 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index b927413..2b1c345 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -42,7 +42,8 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct msghdr *msg,
 
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
 			  struct flowi6 *fl6, struct ipv6_txoptions *opt,
-			  int *hlimit, int *tclass, int *dontfrag);
+			  int *hlimit, int *tclass, int *dontfrag,
+			  struct sockcm_cookie *sockc);
 
 void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
 			     __u16 srcp, __u16 destp, int bucket);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..a73d701 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -685,7 +685,8 @@ EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 			  struct msghdr *msg, struct flowi6 *fl6,
 			  struct ipv6_txoptions *opt,
-			  int *hlimit, int *tclass, int *dontfrag)
+			  int *hlimit, int *tclass, int *dontfrag,
+			  struct sockcm_cookie *sockc)
 {
 	struct in6_pktinfo *src_info;
 	struct cmsghdr *cmsg;
@@ -702,6 +703,12 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 			goto exit_f;
 		}
 
+		if (cmsg->cmsg_level == SOL_SOCKET) {
+			if (__sock_cmsg_send(sk, msg, cmsg, sockc))
+				return -EINVAL;
+			continue;
+		}
+
 		if (cmsg->cmsg_level != SOL_IPV6)
 			continue;
 
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index dc2db4f..35d3ddc 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -372,6 +372,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq,
 	if (olen > 0) {
 		struct msghdr msg;
 		struct flowi6 flowi6;
+		struct sockcm_cookie sockc_junk;
 		int junk;
 
 		err = -ENOMEM;
@@ -390,7 +391,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq,
 		memset(&flowi6, 0, sizeof(flowi6));
 
 		err = ip6_datagram_send_ctl(net, sk, &msg, &flowi6, fl->opt,
-					    &junk, &junk, &junk);
+					    &junk, &junk, &junk, &sockc_junk);
 		if (err)
 			goto done;
 		err = -EINVAL;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 4449ad1..a5557d2 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -471,6 +471,7 @@ sticky_done:
 		struct ipv6_txoptions *opt = NULL;
 		struct msghdr msg;
 		struct flowi6 fl6;
+		struct sockcm_cookie sockc_junk;
 		int junk;
 
 		memset(&fl6, 0, sizeof(fl6));
@@ -503,7 +504,7 @@ sticky_done:
 		msg.msg_control = (void *)(opt+1);
 
 		retv = ip6_datagram_send_ctl(net, sk, &msg, &fl6, opt, &junk,
-					     &junk, &junk);
+					     &junk, &junk, &sockc_junk);
 		if (retv)
 			goto done;
 update:
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index fa59dd7..f175ec0 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -745,6 +745,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct dst_entry *dst = NULL;
 	struct raw6_frag_vec rfv;
 	struct flowi6 fl6;
+	struct sockcm_cookie sockc;
 	int addr_len = msg->msg_namelen;
 	int hlimit = -1;
 	int tclass = -1;
@@ -821,13 +822,16 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (fl6.flowi6_oif == 0)
 		fl6.flowi6_oif = sk->sk_bound_dev_if;
 
+	sockc.tsflags = 0;
+
 	if (msg->msg_controllen) {
 		opt = &opt_space;
 		memset(opt, 0, sizeof(struct ipv6_txoptions));
 		opt->tot_len = sizeof(struct ipv6_txoptions);
 
 		err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
-					    &hlimit, &tclass, &dontfrag);
+					    &hlimit, &tclass, &dontfrag,
+					    &sockc);
 		if (err < 0) {
 			fl6_sock_release(flowlabel);
 			return err;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fd25e44..8371a51 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1128,6 +1128,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	int connected = 0;
 	int is_udplite = IS_UDPLITE(sk);
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+	struct sockcm_cookie sockc;
 
 	/* destination address check */
 	if (sin6) {
@@ -1247,6 +1248,7 @@ do_udp_sendmsg:
 		fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
 
 	fl6.flowi6_mark = sk->sk_mark;
+	sockc.tsflags = 0;
 
 	if (msg->msg_controllen) {
 		opt = &opt_space;
@@ -1254,7 +1256,8 @@ do_udp_sendmsg:
 		opt->tot_len = sizeof(*opt);
 
 		err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt,
-					    &hlimit, &tclass, &dontfrag);
+					    &hlimit, &tclass, &dontfrag,
+					    &sockc);
 		if (err < 0) {
 			fl6_sock_release(flowlabel);
 			return err;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 5/8] ipv4: process socket-level control messages in IPv4
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/ip.h       | 3 ++-
 net/ipv4/ip_sockglue.c | 9 ++++++++-
 net/ipv4/ping.c        | 2 +-
 net/ipv4/raw.c         | 2 +-
 net/ipv4/udp.c         | 3 +--
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fad74d3..93725e5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -56,6 +56,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb)
 }
 
 struct ipcm_cookie {
+	struct sockcm_cookie	sockc;
 	__be32			addr;
 	int			oif;
 	struct ip_options_rcu	*opt;
@@ -550,7 +551,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
 
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
 void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int offset);
-int ip_cmsg_send(struct net *net, struct msghdr *msg,
+int ip_cmsg_send(struct sock *sk, struct msghdr *msg,
 		 struct ipcm_cookie *ipc, bool allow_ipv6);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
 		  unsigned int optlen);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 035ad64..1b7c077 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -219,11 +219,12 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
-int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
+int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 		 bool allow_ipv6)
 {
 	int err, val;
 	struct cmsghdr *cmsg;
+	struct net *net = sock_net(sk);
 
 	for_each_cmsghdr(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
@@ -244,6 +245,12 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
 			continue;
 		}
 #endif
+		if (cmsg->cmsg_level == SOL_SOCKET) {
+			if (__sock_cmsg_send(sk, msg, cmsg, &ipc->sockc))
+				return -EINVAL;
+			continue;
+		}
+
 		if (cmsg->cmsg_level != SOL_IP)
 			continue;
 		switch (cmsg->cmsg_type) {
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index cf9700b..670639b 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -747,7 +747,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc, false);
+		err = ip_cmsg_send(sk, msg, &ipc, false);
 		if (unlikely(err)) {
 			kfree(ipc.opt);
 			return err;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8d22de7..088ce66 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -548,7 +548,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.oif = sk->sk_bound_dev_if;
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(net, msg, &ipc, false);
+		err = ip_cmsg_send(sk, msg, &ipc, false);
 		if (unlikely(err)) {
 			kfree(ipc.opt);
 			goto out;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e..bccb4e1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1034,8 +1034,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc,
-				   sk->sk_family == AF_INET6);
+		err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6);
 		if (unlikely(err)) {
 			kfree(ipc.opt);
 			return err;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 7/8] sock: enable timestamping using control messages
From: Soheil Hassas Yeganeh @ 2016-03-30 22:37 UTC (permalink / raw)
  To: davem, netdev
  Cc: willemb, edumazet, ycheng, ncardwell, kafai,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-1-git-send-email-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 drivers/net/tun.c      |  3 ++-
 include/net/ipv6.h     |  6 ++++--
 include/net/sock.h     | 10 ++++++----
 net/can/raw.c          |  2 +-
 net/ipv4/ping.c        |  5 +++--
 net/ipv4/raw.c         | 11 ++++++-----
 net/ipv4/tcp.c         | 20 +++++++++++++++-----
 net/ipv4/udp.c         |  7 ++++---
 net/ipv6/icmp.c        |  6 ++++--
 net/ipv6/ip6_output.c  | 15 +++++++++------
 net/ipv6/ping.c        |  3 ++-
 net/ipv6/raw.c         |  5 ++---
 net/ipv6/udp.c         |  7 ++++---
 net/packet/af_packet.c | 30 +++++++++++++++++++++++++-----
 net/socket.c           | 10 +++++-----
 15 files changed, 92 insertions(+), 48 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..6d2fcd0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -860,7 +860,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 
 	if (skb->sk && sk_fullsock(skb->sk)) {
-		sock_tx_timestamp(skb->sk, &skb_shinfo(skb)->tx_flags);
+		sock_tx_timestamp(skb->sk, skb->sk->sk_tsflags,
+				  &skb_shinfo(skb)->tx_flags);
 		sw_tx_timestamp(skb);
 	}
 
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..55ee1eb 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -867,7 +867,8 @@ int ip6_append_data(struct sock *sk,
 				int odd, struct sk_buff *skb),
 		    void *from, int length, int transhdrlen, int hlimit,
 		    int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6,
-		    struct rt6_info *rt, unsigned int flags, int dontfrag);
+		    struct rt6_info *rt, unsigned int flags, int dontfrag,
+		    const struct sockcm_cookie *sockc);
 
 int ip6_push_pending_frames(struct sock *sk);
 
@@ -884,7 +885,8 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 			     void *from, int length, int transhdrlen,
 			     int hlimit, int tclass, struct ipv6_txoptions *opt,
 			     struct flowi6 *fl6, struct rt6_info *rt,
-			     unsigned int flags, int dontfrag);
+			     unsigned int flags, int dontfrag,
+			     const struct sockcm_cookie *sockc);
 
 static inline struct sk_buff *ip6_finish_skb(struct sock *sk)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index af012da..e91b87f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2057,19 +2057,21 @@ static inline void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
 		sk->sk_stamp = skb->tstamp;
 }
 
-void __sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags);
+void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags);
 
 /**
  * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
  * @sk:		socket sending this packet
+ * @tsflags:	timestamping flags to use
  * @tx_flags:	completed with instructions for time stamping
  *
  * Note : callers should take care of initial *tx_flags value (usually 0)
  */
-static inline void sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags)
+static inline void sock_tx_timestamp(const struct sock *sk, __u16 tsflags,
+				     __u8 *tx_flags)
 {
-	if (unlikely(sk->sk_tsflags))
-		__sock_tx_timestamp(sk, tx_flags);
+	if (unlikely(tsflags))
+		__sock_tx_timestamp(tsflags, tx_flags);
 	if (unlikely(sock_flag(sk, SOCK_WIFI_STATUS)))
 		*tx_flags |= SKBTX_WIFI_STATUS;
 }
diff --git a/net/can/raw.c b/net/can/raw.c
index 2e67b14..972c187 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -755,7 +755,7 @@ static int raw_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 	if (err < 0)
 		goto free_skb;
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sk->sk_tsflags, &skb_shinfo(skb)->tx_flags);
 
 	skb->dev = dev;
 	skb->sk  = sk;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 670639b..66ddcb6 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -737,6 +737,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		/* no remote port */
 	}
 
+	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.oif = sk->sk_bound_dev_if;
@@ -744,8 +745,6 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc.ttl = 0;
 	ipc.tos = -1;
 
-	sock_tx_timestamp(sk, &ipc.tx_flags);
-
 	if (msg->msg_controllen) {
 		err = ip_cmsg_send(sk, msg, &ipc, false);
 		if (unlikely(err)) {
@@ -768,6 +767,8 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		rcu_read_unlock();
 	}
 
+	sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
+
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 088ce66..438f50c 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -339,8 +339,8 @@ int raw_rcv(struct sock *sk, struct sk_buff *skb)
 
 static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 			   struct msghdr *msg, size_t length,
-			   struct rtable **rtp,
-			   unsigned int flags)
+			   struct rtable **rtp, unsigned int flags,
+			   const struct sockcm_cookie *sockc)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
@@ -379,7 +379,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb->ip_summed = CHECKSUM_NONE;
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 
 	skb->transport_header = skb->network_header;
 	err = -EFAULT;
@@ -540,6 +540,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		daddr = inet->inet_daddr;
 	}
 
+	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
@@ -638,10 +639,10 @@ back_from_confirm:
 
 	if (inet->hdrincl)
 		err = raw_send_hdrinc(sk, &fl4, msg, len,
-				      &rt, msg->msg_flags);
+				      &rt, msg->msg_flags, &ipc.sockc);
 
 	 else {
-		sock_tx_timestamp(sk, &ipc.tx_flags);
+		sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
 
 		if (!ipc.addr)
 			ipc.addr = fl4.daddr;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce3c9eb..4d73858 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -428,13 +428,13 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
-static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb)
+static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
 {
-	if (sk->sk_tsflags) {
+	if (sk->sk_tsflags || tsflags) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
-		sock_tx_timestamp(sk, &shinfo->tx_flags);
+		sock_tx_timestamp(sk, tsflags, &shinfo->tx_flags);
 		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 		tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
@@ -959,7 +959,7 @@ new_segment:
 		offset += copy;
 		size -= copy;
 		if (!size) {
-			tcp_tx_timestamp(sk, skb);
+			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
 			goto out;
 		}
 
@@ -1079,6 +1079,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	struct sockcm_cookie sockc;
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	bool sg;
@@ -1121,6 +1122,15 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		/* 'common' sending to sendq */
 	}
 
+	sockc.tsflags = sk->sk_tsflags;
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(sk, msg, &sockc);
+		if (unlikely(err)) {
+			err = -EINVAL;
+			goto out_err;
+		}
+	}
+
 	/* This should be in poll */
 	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 
@@ -1239,7 +1249,7 @@ new_segment:
 
 		copied += copy;
 		if (!msg_data_left(msg)) {
-			tcp_tx_timestamp(sk, skb);
+			tcp_tx_timestamp(sk, sockc.tsflags, skb);
 			goto out;
 		}
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index bccb4e1..45ff590 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1027,12 +1027,11 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 */
 		connected = 1;
 	}
-	ipc.addr = inet->inet_saddr;
 
+	ipc.sockc.tsflags = sk->sk_tsflags;
+	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
 
-	sock_tx_timestamp(sk, &ipc.tx_flags);
-
 	if (msg->msg_controllen) {
 		err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6);
 		if (unlikely(err)) {
@@ -1059,6 +1058,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
+	sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
+
 	if (ipc.opt && ipc.opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 0a37ddc..6b573eb 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -400,6 +400,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 	struct icmp6hdr tmp_hdr;
 	struct flowi6 fl6;
 	struct icmpv6_msg msg;
+	struct sockcm_cookie sockc_unused = {0};
 	int iif = 0;
 	int addr_type = 0;
 	int len;
@@ -527,7 +528,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info)
 			      len + sizeof(struct icmp6hdr),
 			      sizeof(struct icmp6hdr), hlimit,
 			      np->tclass, NULL, &fl6, (struct rt6_info *)dst,
-			      MSG_DONTWAIT, np->dontfrag);
+			      MSG_DONTWAIT, np->dontfrag, &sockc_unused);
 	if (err) {
 		ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
 		ip6_flush_pending_frames(sk);
@@ -566,6 +567,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 	int hlimit;
 	u8 tclass;
 	u32 mark = IP6_REPLY_MARK(net, skb->mark);
+	struct sockcm_cookie sockc_unused = {0};
 
 	saddr = &ipv6_hdr(skb)->daddr;
 
@@ -617,7 +619,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 	err = ip6_append_data(sk, icmpv6_getfrag, &msg, skb->len + sizeof(struct icmp6hdr),
 				sizeof(struct icmp6hdr), hlimit, tclass, NULL, &fl6,
 				(struct rt6_info *)dst, MSG_DONTWAIT,
-				np->dontfrag);
+				np->dontfrag, &sockc_unused);
 
 	if (err) {
 		ICMP6_INC_STATS_BH(net, idev, ICMP6_MIB_OUTERRORS);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9428345..612f3d1 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1258,7 +1258,8 @@ static int __ip6_append_data(struct sock *sk,
 			     int getfrag(void *from, char *to, int offset,
 					 int len, int odd, struct sk_buff *skb),
 			     void *from, int length, int transhdrlen,
-			     unsigned int flags, int dontfrag)
+			     unsigned int flags, int dontfrag,
+			     const struct sockcm_cookie *sockc)
 {
 	struct sk_buff *skb, *skb_prev = NULL;
 	unsigned int maxfraglen, fragheaderlen, mtu, orig_mtu;
@@ -1329,7 +1330,7 @@ emsgsize:
 		csummode = CHECKSUM_PARTIAL;
 
 	if (sk->sk_type == SOCK_DGRAM || sk->sk_type == SOCK_RAW) {
-		sock_tx_timestamp(sk, &tx_flags);
+		sock_tx_timestamp(sk, sockc->tsflags, &tx_flags);
 		if (tx_flags & SKBTX_ANY_SW_TSTAMP &&
 		    sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
 			tskey = sk->sk_tskey++;
@@ -1565,7 +1566,8 @@ int ip6_append_data(struct sock *sk,
 				int odd, struct sk_buff *skb),
 		    void *from, int length, int transhdrlen, int hlimit,
 		    int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6,
-		    struct rt6_info *rt, unsigned int flags, int dontfrag)
+		    struct rt6_info *rt, unsigned int flags, int dontfrag,
+		    const struct sockcm_cookie *sockc)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -1593,7 +1595,8 @@ int ip6_append_data(struct sock *sk,
 
 	return __ip6_append_data(sk, fl6, &sk->sk_write_queue, &inet->cork.base,
 				 &np->cork, sk_page_frag(sk), getfrag,
-				 from, length, transhdrlen, flags, dontfrag);
+				 from, length, transhdrlen, flags, dontfrag,
+				 sockc);
 }
 EXPORT_SYMBOL_GPL(ip6_append_data);
 
@@ -1752,7 +1755,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 			     int hlimit, int tclass,
 			     struct ipv6_txoptions *opt, struct flowi6 *fl6,
 			     struct rt6_info *rt, unsigned int flags,
-			     int dontfrag)
+			     int dontfrag, const struct sockcm_cookie *sockc)
 {
 	struct inet_cork_full cork;
 	struct inet6_cork v6_cork;
@@ -1779,7 +1782,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 	err = __ip6_append_data(sk, fl6, &queue, &cork.base, &v6_cork,
 				&current->task_frag, getfrag, from,
 				length + exthdrlen, transhdrlen + exthdrlen,
-				flags, dontfrag);
+				flags, dontfrag, sockc);
 	if (err) {
 		__ip6_flush_pending_frames(sk, &queue, &cork, &v6_cork);
 		return ERR_PTR(err);
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index c382db7..da1cff7 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -62,6 +62,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct dst_entry *dst;
 	struct rt6_info *rt;
 	struct pingfakehdr pfh;
+	struct sockcm_cookie junk = {0};
 
 	pr_debug("ping_v6_sendmsg(sk=%p,sk->num=%u)\n", inet, inet->inet_num);
 
@@ -144,7 +145,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	err = ip6_append_data(sk, ping_getfrag, &pfh, len,
 			      0, hlimit,
 			      np->tclass, NULL, &fl6, rt,
-			      MSG_DONTWAIT, np->dontfrag);
+			      MSG_DONTWAIT, np->dontfrag, &junk);
 
 	if (err) {
 		ICMP6_INC_STATS(sock_net(sk), rt->rt6i_idev,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f175ec0..b07ce21 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -822,8 +822,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (fl6.flowi6_oif == 0)
 		fl6.flowi6_oif = sk->sk_bound_dev_if;
 
-	sockc.tsflags = 0;
-
+	sockc.tsflags = sk->sk_tsflags;
 	if (msg->msg_controllen) {
 		opt = &opt_space;
 		memset(opt, 0, sizeof(struct ipv6_txoptions));
@@ -901,7 +900,7 @@ back_from_confirm:
 		lock_sock(sk);
 		err = ip6_append_data(sk, raw6_getfrag, &rfv,
 			len, 0, hlimit, tclass, opt, &fl6, (struct rt6_info *)dst,
-			msg->msg_flags, dontfrag);
+			msg->msg_flags, dontfrag, &sockc);
 
 		if (err)
 			ip6_flush_pending_frames(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8371a51..fef21c4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1248,7 +1248,7 @@ do_udp_sendmsg:
 		fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
 
 	fl6.flowi6_mark = sk->sk_mark;
-	sockc.tsflags = 0;
+	sockc.tsflags = sk->sk_tsflags;
 
 	if (msg->msg_controllen) {
 		opt = &opt_space;
@@ -1324,7 +1324,7 @@ back_from_confirm:
 		skb = ip6_make_skb(sk, getfrag, msg, ulen,
 				   sizeof(struct udphdr), hlimit, tclass, opt,
 				   &fl6, (struct rt6_info *)dst,
-				   msg->msg_flags, dontfrag);
+				   msg->msg_flags, dontfrag, &sockc);
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
 			err = udp_v6_send_skb(skb, &fl6);
@@ -1351,7 +1351,8 @@ do_append_data:
 	err = ip6_append_data(sk, getfrag, msg, ulen,
 		sizeof(struct udphdr), hlimit, tclass, opt, &fl6,
 		(struct rt6_info *)dst,
-		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags, dontfrag);
+		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags, dontfrag,
+		&sockc);
 	if (err)
 		udp_v6_flush_pending_frames(sk);
 	else if (!corkreq)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1ecfa71..0007e23 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1837,6 +1837,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	DECLARE_SOCKADDR(struct sockaddr_pkt *, saddr, msg->msg_name);
 	struct sk_buff *skb = NULL;
 	struct net_device *dev;
+	struct sockcm_cookie sockc;
 	__be16 proto = 0;
 	int err;
 	int extra_len = 0;
@@ -1925,12 +1926,21 @@ retry:
 		goto out_unlock;
 	}
 
+	sockc.tsflags = 0;
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(sk, msg, &sockc);
+		if (unlikely(err)) {
+			err = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
 	skb->protocol = proto;
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
 	if (unlikely(extra_len == 4))
 		skb->no_fcs = 1;
@@ -2486,7 +2496,8 @@ static int packet_snd_vnet_gso(struct sk_buff *skb,
 
 static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 		void *frame, struct net_device *dev, void *data, int tp_len,
-		__be16 proto, unsigned char *addr, int hlen, int copylen)
+		__be16 proto, unsigned char *addr, int hlen, int copylen,
+		const struct sockcm_cookie *sockc)
 {
 	union tpacket_uhdr ph;
 	int to_write, offset, len, nr_frags, len_max;
@@ -2500,7 +2511,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->dev = dev;
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
-	sock_tx_timestamp(&po->sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
 	skb_reserve(skb, hlen);
@@ -2624,6 +2635,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	struct sk_buff *skb;
 	struct net_device *dev;
 	struct virtio_net_hdr *vnet_hdr = NULL;
+	struct sockcm_cookie sockc;
 	__be16 proto;
 	int err, reserve = 0;
 	void *ph;
@@ -2655,6 +2667,13 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		dev = dev_get_by_index(sock_net(&po->sk), saddr->sll_ifindex);
 	}
 
+	sockc.tsflags = 0;
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(&po->sk, msg, &sockc);
+		if (unlikely(err))
+			goto out;
+	}
+
 	err = -ENXIO;
 	if (unlikely(dev == NULL))
 		goto out;
@@ -2712,7 +2731,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			goto out_status;
 		}
 		tp_len = tpacket_fill_skb(po, skb, ph, dev, data, tp_len, proto,
-					  addr, hlen, copylen);
+					  addr, hlen, copylen, &sockc);
 		if (likely(tp_len >= 0) &&
 		    tp_len > dev->mtu + reserve &&
 		    !po->has_vnet_hdr &&
@@ -2851,6 +2870,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_unlock;
 
+	sockc.tsflags = 0;
 	sockc.mark = sk->sk_mark;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(sk, msg, &sockc);
@@ -2908,7 +2928,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		goto out_free;
 	}
 
-	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
+	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
 	if (!vnet_hdr.gso_type && (len > dev->mtu + reserve + extra_len) &&
 	    !packet_extra_vlan_len_allowed(dev, skb)) {
diff --git a/net/socket.c b/net/socket.c
index 5f77a8e..979d314 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -587,20 +587,20 @@ void sock_release(struct socket *sock)
 }
 EXPORT_SYMBOL(sock_release);
 
-void __sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags)
+void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags)
 {
 	u8 flags = *tx_flags;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_HARDWARE)
+	if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE)
 		flags |= SKBTX_HW_TSTAMP;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_SOFTWARE)
+	if (tsflags & SOF_TIMESTAMPING_TX_SOFTWARE)
 		flags |= SKBTX_SW_TSTAMP;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_SCHED)
+	if (tsflags & SOF_TIMESTAMPING_TX_SCHED)
 		flags |= SKBTX_SCHED_TSTAMP;
 
-	if (sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK)
+	if (tsflags & SOF_TIMESTAMPING_TX_ACK)
 		flags |= SKBTX_ACK_TSTAMP;
 
 	*tx_flags = flags;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [net 11/11] ixgbe: Fix cls_u32 offload support for L4 ports
From: Jeff Kirsher @ 2016-03-30 23:00 UTC (permalink / raw)
  To: davem; +Cc: Sridhar Samudrala, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <1459378855-139837-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sridhar Samudrala <sridhar.samudrala@intel.com>

Fix support for 16 bit source/dest port matches in ixgbe model.
u32 uses a single 32-bit key value for both source and destination ports
starting at offset 0. So replace the 2 functions with a single function
that takes this key value/mask to program both source and dest ports.

Verified with the following filter:

 #tc qdisc add dev p4p1 ingress
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 1: u32 divisor 1
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 800:0:10 u32 ht 800: link 1: \
	offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 1:0:10 u32 ht 1: \
	match tcp src 1024 ffff match tcp dst 80 ffff action drop
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 1:0:11 u32 ht 1: \
	match tcp src 1025 ffff action drop
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 1:0:12 u32 ht 1: \
	match tcp dst 81 ffff action drop

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_model.h | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
index 61f7290..74c53ad 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
@@ -64,28 +64,20 @@ static struct ixgbe_mat_field ixgbe_ipv4_fields[] = {
 	{ .val = NULL } /* terminal node */
 };
 
-static inline int ixgbe_mat_prgm_sport(struct ixgbe_fdir_filter *input,
+static inline int ixgbe_mat_prgm_ports(struct ixgbe_fdir_filter *input,
 				       union ixgbe_atr_input *mask,
 				       u32 val, u32 m)
 {
 	input->filter.formatted.src_port = val & 0xffff;
 	mask->formatted.src_port = m & 0xffff;
-	return 0;
-};
+	input->filter.formatted.dst_port = val >> 16;
+	mask->formatted.dst_port = m >> 16;
 
-static inline int ixgbe_mat_prgm_dport(struct ixgbe_fdir_filter *input,
-				       union ixgbe_atr_input *mask,
-				       u32 val, u32 m)
-{
-	input->filter.formatted.dst_port = val & 0xffff;
-	mask->formatted.dst_port = m & 0xffff;
 	return 0;
 };
 
 static struct ixgbe_mat_field ixgbe_tcp_fields[] = {
-	{.off = 0, .val = ixgbe_mat_prgm_sport,
-	 .type = IXGBE_ATR_FLOW_TYPE_TCPV4},
-	{.off = 2, .val = ixgbe_mat_prgm_dport,
+	{.off = 0, .val = ixgbe_mat_prgm_ports,
 	 .type = IXGBE_ATR_FLOW_TYPE_TCPV4},
 	{ .val = NULL } /* terminal node */
 };
-- 
2.5.5

^ permalink raw reply related

* [net 10/11] ixgbe: Fix cls_u32 offload support for fields with masks
From: Jeff Kirsher @ 2016-03-30 23:00 UTC (permalink / raw)
  To: davem; +Cc: Sridhar Samudrala, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <1459378855-139837-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Sridhar Samudrala <sridhar.samudrala@intel.com>

Remove the incorrect check for mask in ixgbe_configure_clsu32 and
drop the 'mask' field that is not required in struct ixgbe_mat_field

Verified with the following filters:

 #tc qdisc add dev p4p1 ingress
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 800:0:1 u32 ht 800: \
	match ip dst 10.0.0.1/8 match ip src 10.0.0.2/8 action drop
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 800:0:2 u32 ht 800: \
	match ip dst 11.0.0.1/16 match ip src 11.0.0.2/16 action drop
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 800:0:3 u32 ht 800: \
	match ip dst 12.0.0.1/24 match ip src 12.0.0.2/24 action drop
 #tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
	handle 800:0:4 u32 ht 800: \
	match ip dst 13.0.0.1/32 match ip src 13.0.0.2/32 action drop

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 3 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_model.h | 9 ++++-----
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ca9c543..7df3fe2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8331,8 +8331,7 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter,
 		int j;
 
 		for (j = 0; field_ptr[j].val; j++) {
-			if (field_ptr[j].off == off &&
-			    field_ptr[j].mask == m) {
+			if (field_ptr[j].off == off) {
 				field_ptr[j].val(input, &mask, val, m);
 				input->filter.formatted.flow_type |=
 					field_ptr[j].type;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
index ce48872..61f7290 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_model.h
@@ -32,7 +32,6 @@
 
 struct ixgbe_mat_field {
 	unsigned int off;
-	unsigned int mask;
 	int (*val)(struct ixgbe_fdir_filter *input,
 		   union ixgbe_atr_input *mask,
 		   u32 val, u32 m);
@@ -58,9 +57,9 @@ static inline int ixgbe_mat_prgm_dip(struct ixgbe_fdir_filter *input,
 }
 
 static struct ixgbe_mat_field ixgbe_ipv4_fields[] = {
-	{ .off = 12, .mask = -1, .val = ixgbe_mat_prgm_sip,
+	{ .off = 12, .val = ixgbe_mat_prgm_sip,
 	  .type = IXGBE_ATR_FLOW_TYPE_IPV4},
-	{ .off = 16, .mask = -1, .val = ixgbe_mat_prgm_dip,
+	{ .off = 16, .val = ixgbe_mat_prgm_dip,
 	  .type = IXGBE_ATR_FLOW_TYPE_IPV4},
 	{ .val = NULL } /* terminal node */
 };
@@ -84,9 +83,9 @@ static inline int ixgbe_mat_prgm_dport(struct ixgbe_fdir_filter *input,
 };
 
 static struct ixgbe_mat_field ixgbe_tcp_fields[] = {
-	{.off = 0, .mask = 0xffff, .val = ixgbe_mat_prgm_sport,
+	{.off = 0, .val = ixgbe_mat_prgm_sport,
 	 .type = IXGBE_ATR_FLOW_TYPE_TCPV4},
-	{.off = 2, .mask = 0xffff, .val = ixgbe_mat_prgm_dport,
+	{.off = 2, .val = ixgbe_mat_prgm_dport,
 	 .type = IXGBE_ATR_FLOW_TYPE_TCPV4},
 	{ .val = NULL } /* terminal node */
 };
-- 
2.5.5

^ permalink raw reply related

* [net 08/11] ixgbe: make __ixgbe_setup_tc static
From: Jeff Kirsher @ 2016-03-30 23:00 UTC (permalink / raw)
  To: davem; +Cc: Emil Tantilov, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1459378855-139837-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Emil Tantilov <emil.s.tantilov@intel.com>

This function is only used in ixgbe_main.c
Resolves a "missing prototype" warning when building the driver with W=1

Reported-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 267a507..f5736a3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8376,8 +8376,8 @@ err_out:
 	return -EINVAL;
 }
 
-int __ixgbe_setup_tc(struct net_device *dev, u32 handle, __be16 proto,
-		     struct tc_to_netdev *tc)
+static int __ixgbe_setup_tc(struct net_device *dev, u32 handle, __be16 proto,
+			    struct tc_to_netdev *tc)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 
-- 
2.5.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox