Netdev List
 help / color / mirror / Atom feed
* [PATCHv1 net-next] xen-netback: remove unconditional pull_skb_tail in guest Tx path
From: David Vrabel @ 2014-11-03 17:23 UTC (permalink / raw)
  To: netdev; +Cc: David Vrabel, xen-devel, Ian Campbell, Wei Liu, Malcolm Crossley

From: Malcolm Crossley <malcolm.crossley@citrix.com>

Unconditionally pulling 128 bytes into the linear buffer is not
required. Netback has already grant copied up-to 128 bytes from the
first slot of a packet into the linear buffer. The first slot normally
contain all the IPv4/IPv6 and TCP/UDP headers.

The unconditional pull would often copy frag data unnecessarily.  This
is a performance problem when running on a version of Xen where grant
unmap avoids TLB flushes for pages which are not accessed.  TLB
flushes can now be avoided for > 99% of unmaps (it was 0% before).

Grant unmap TLB flush avoidance will be available in a future version
of Xen (probably 4.6).

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netback/netback.c |   26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 730252c..14e18bb 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -82,6 +82,16 @@ MODULE_PARM_DESC(max_queues,
 static unsigned int fatal_skb_slots = FATAL_SKB_SLOTS_DEFAULT;
 module_param(fatal_skb_slots, uint, 0444);
 
+/* The amount to copy out of the first guest Tx slot into the skb's
+ * linear area.  If the first slot has more data, it will be mapped
+ * and put into the first frag.
+ *
+ * This is sized to avoid pulling headers from the frags for most
+ * TCP/IP packets.
+ */
+#define XEN_NETBACK_TX_COPY_LEN 128
+
+
 static void xenvif_idx_release(struct xenvif_queue *queue, u16 pending_idx,
 			       u8 status);
 
@@ -125,13 +135,6 @@ static inline struct xenvif_queue *ubuf_to_queue(const struct ubuf_info *ubuf)
 			    pending_tx_info[0]);
 }
 
-/* This is a miniumum size for the linear area to avoid lots of
- * calls to __pskb_pull_tail() as we set up checksum offsets. The
- * value 128 was chosen as it covers all IPv4 and most likely
- * IPv6 headers.
- */
-#define PKT_PROT_LEN 128
-
 static u16 frag_get_pending_idx(skb_frag_t *frag)
 {
 	return (u16)frag->page_offset;
@@ -1446,9 +1449,9 @@ static void xenvif_tx_build_gops(struct xenvif_queue *queue,
 		index = pending_index(queue->pending_cons);
 		pending_idx = queue->pending_ring[index];
 
-		data_len = (txreq.size > PKT_PROT_LEN &&
+		data_len = (txreq.size > XEN_NETBACK_TX_COPY_LEN &&
 			    ret < XEN_NETBK_LEGACY_SLOTS_MAX) ?
-			PKT_PROT_LEN : txreq.size;
+			XEN_NETBACK_TX_COPY_LEN : txreq.size;
 
 		skb = xenvif_alloc_skb(data_len);
 		if (unlikely(skb == NULL)) {
@@ -1653,11 +1656,6 @@ static int xenvif_tx_submit(struct xenvif_queue *queue)
 			}
 		}
 
-		if (skb_is_nonlinear(skb) && skb_headlen(skb) < PKT_PROT_LEN) {
-			int target = min_t(int, skb->len, PKT_PROT_LEN);
-			__pskb_pull_tail(skb, target - skb_headlen(skb));
-		}
-
 		skb->dev      = queue->vif->dev;
 		skb->protocol = eth_type_trans(skb, skb->dev);
 		skb_reset_network_header(skb);
-- 
1.7.10.4

^ permalink raw reply related

* Re: [stable request <= 3.11] net/mlx4_en: Fix BlueFlame race
From: Cong Wang @ 2014-11-03 17:22 UTC (permalink / raw)
  To: David Miller
  Cc: Ben Hutchings, Vinson Lee, Amir Vadai, Or Gerlitz,
	Jack Morgenstein, Eugenia Emantayev, Matan Barak, netdev
In-Reply-To: <20141101.134142.888127846627939710.davem@davemloft.net>

On Sat, Nov 1, 2014 at 10:41 AM, David Miller <davem@davemloft.net> wrote:
>
> There is no documented way nor do I wish to state anything so strictly.
> I want maximum flexibility for such a time consuming task.
>
> I tend to go back 3 or 4 releases at most, and it really depends upon
> the difficulty of the backports and my own time constraints.

You should really offload to developers, otherwise too much work for you. :)

Thanks!

^ permalink raw reply

* Re: [PATCH v2] PPC: bpf_jit_comp: add SKF_AD_PKTTYPE instruction
From: Alexei Starovoitov @ 2014-11-03 17:21 UTC (permalink / raw)
  To: David Miller, felix
  Cc: Denis Kirjanov, netdev@vger.kernel.org, linuxppc-dev,
	Michael Ellerman, Matt Evans
In-Reply-To: <20141103.120633.183870310578884991.davem@davemloft.net>

On Mon, Nov 3, 2014 at 9:06 AM, David Miller <davem@davemloft.net> wrote:
> From: Denis Kirjanov <kda@linux-powerpc.org>
> Date: Thu, 30 Oct 2014 09:12:15 +0300
>
>> Add BPF extension SKF_AD_PKTTYPE to ppc JIT to load
>> skb->pkt_type field.
>>
>> Before:
>> [   88.262622] test_bpf: #11 LD_IND_NET 86 97 99 PASS
>> [   88.265740] test_bpf: #12 LD_PKTTYPE 109 107 PASS
>>
>> After:
>> [   80.605964] test_bpf: #11 LD_IND_NET 44 40 39 PASS
>> [   80.607370] test_bpf: #12 LD_PKTTYPE 9 9 PASS
>>
>> CC: Alexei Starovoitov<alexei.starovoitov@gmail.com>
>> CC: Michael Ellerman<mpe@ellerman.id.au>
>> Cc: Matt Evans <matt@ozlabs.org>
>> Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
>>
>> v2: Added test rusults
>
> So, can I apply this now?

I think this question is more towards ppc folks,
since both Daniel and myself said before that it looks ok.
Philippe just tested the previous version of this patch on ppc64le...
I'm guessing that Matt (original author of bpf jit for ppc) is not replying,
because he has no objections.
Either way the addition is tiny and contained, so can go in now.

^ permalink raw reply

* Re: [PATCH bluetooth-next] netdevice: add ieee802154_ptr to net_device
From: David Miller @ 2014-11-03 17:20 UTC (permalink / raw)
  To: alex.aring-Re5JQEeQqe8AvxtiuMwx3w
  Cc: linux-wpan-u79uwXL29TY76Z2rM5mHXA, kernel-bIcnvbaLZ9MEGnE8C9+IrQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1414907094-9623-1-git-send-email-alex.aring-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: Alexander Aring <alex.aring-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Sun,  2 Nov 2014 06:44:54 +0100

> This patch adds an ieee802154_ptr to the net_device structure.
> Furthermore the 802.15.4 subsystem will introduce a nl802154 framework
> which is similar like the nl80211 framework and a wpan_dev structure.
> The wpan_dev structure will hold additional net_device attributes like
> address options which are 802.15.4 specific. In the upcoming nl802154
> implementation we will introduce a NL802154_FLAG_NEED_WPAN_DEV like
> NL80211_FLAG_NEED_WDEV. For this flag an ieee802154_ptr in net_device is
> needed. Additional we can access the wpan_dev attributes in upper layers
> like IEEE 802.15.4 6LoWPAN easily. Current solution is a complicated
> callback interface and getting these values over subif data structure
> in mac802154.
> 
> Signed-off-by: Alexander Aring <alex.aring-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

No objections, and you can merge this via the bluetooth tree:

Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 3/7] gue: Add infrastructure for flags and options
From: David Miller @ 2014-11-03 17:18 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <1414882683-25484-4-git-send-email-therbert@google.com>

From: Tom Herbert <therbert@google.com>
Date: Sat,  1 Nov 2014 15:57:59 -0700

> @@ -20,7 +20,16 @@ static size_t fou_encap_hlen(struct ip_tunnel_encap *e)
>  
>  static size_t gue_encap_hlen(struct ip_tunnel_encap *e)
>  {
> -	return sizeof(struct udphdr) + sizeof(struct guehdr);
> +	size_t len;
> +	bool need_priv = false;
> +
> +	len = sizeof(struct udphdr) + sizeof(struct guehdr);
> +
> +	/* Add in lengths flags */
> +
> +	len += need_priv ? GUE_LEN_PRIV : 0;

Add this need_priv logic in patch #6, not here.

^ permalink raw reply

* Re: [PATCH v2] PPC: bpf_jit_comp: add SKF_AD_PKTTYPE instruction
From: David Miller @ 2014-11-03 17:06 UTC (permalink / raw)
  To: kda; +Cc: netdev, linuxppc-dev, alexei.starovoitov, mpe, matt
In-Reply-To: <1414649535-3956-1-git-send-email-kda@linux-powerpc.org>

From: Denis Kirjanov <kda@linux-powerpc.org>
Date: Thu, 30 Oct 2014 09:12:15 +0300

> Add BPF extension SKF_AD_PKTTYPE to ppc JIT to load
> skb->pkt_type field.
> 
> Before:
> [   88.262622] test_bpf: #11 LD_IND_NET 86 97 99 PASS
> [   88.265740] test_bpf: #12 LD_PKTTYPE 109 107 PASS
> 
> After:
> [   80.605964] test_bpf: #11 LD_IND_NET 44 40 39 PASS
> [   80.607370] test_bpf: #12 LD_PKTTYPE 9 9 PASS
> 
> CC: Alexei Starovoitov<alexei.starovoitov@gmail.com>
> CC: Michael Ellerman<mpe@ellerman.id.au>
> Cc: Matt Evans <matt@ozlabs.org>
> Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
> 
> v2: Added test rusults

So, can I apply this now?

^ permalink raw reply

* Re: [PATCH 1/7] can: m_can: fix possible sleep in napi poll
From: Marc Kleine-Budde @ 2014-11-03 17:02 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-1-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 529 bytes --]

Hey Dong Aisheng,

the state of the patches are:

1: can/master + stable
2: can/mster
3: can/mster
4: can/master + stable
5: waiting for Oliver's opinion due to the user space change
6: waiting for your input
7: waiting for your input

Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 4/7] can: m_can: add a bit delay after setting CCCR_INIT bit
From: Marc Kleine-Budde @ 2014-11-03 16:52 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-4-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 825 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> The spec mentions there may be a delay until the value written to
> INIT can be read back due to the synchronization mechanism between the
> two clock domains. But it does not indicate the exact clock cycles needed.
> The 5us delay is a test value and seems ok.
> 
> Without the delay, CCCR.CCE bit may fail to be set and then the
> initialization fail sometimes when do repeatly up and down.
> 
> Signed-off-by: Dong Aisheng <b29396@freescale.com>

Applied to can/master.

Thanks,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 3/7] can: m_can: add .ndo_change_mtu function
From: Marc Kleine-Budde @ 2014-11-03 16:49 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-3-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 462 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> Use common can_change_mtu function.
> 
> Signed-off-by: Dong Aisheng <b29396@freescale.com>

Applied to can/master.

Thanks,
Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 7/7] can: m_can: workaround for transmit data less than 4 bytes
From: Marc Kleine-Budde @ 2014-11-03 16:48 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-7-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 2685 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> We meet an IC issue that we have to write the full 8 bytes (whatever
> value for the second word) in Message RAM to avoid bit error for transmit
> data less than 4 bytes.

Is this a SoC or a m_can problem? Are all versions of the SoC/m_can
affected? Is there a m_can version register somewhere?

> Without the workaround, we can easily see the following errors:
> root@imx6qdlsolo:~# ip link set can0 up type can bitrate 1000000
> [   66.882520] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
> root@imx6qdlsolo:~# cansend can0 123#112233
> [   66.935640] m_can 20e8000.can can0: Bit Error Uncorrected
> 
> Signed-off-by: Dong Aisheng <b29396@freescale.com>
> ---
>  drivers/net/can/m_can/m_can.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> index 219e0e3..f2d9ebe 100644
> --- a/drivers/net/can/m_can/m_can.c
> +++ b/drivers/net/can/m_can/m_can.c
> @@ -1058,10 +1058,19 @@ static netdev_tx_t m_can_start_xmit(struct sk_buff *skb,
>  	m_can_fifo_write(priv, 0, M_CAN_FIFO_ID, id);
>  	m_can_fifo_write(priv, 0, M_CAN_FIFO_DLC, can_len2dlc(cf->len) << 16);
>  
> -	for (i = 0; i < cf->len; i += 4)
> +	for (i = 0; i < cf->len; i += 4) {
>  		m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(i / 4),
>  				 *(u32 *)(cf->data + i));
>  
> +		/* FIXME: we meet an IC issue that we have to write the full 8

FIXME usually indicates that the driver needs some work here. Just
describe your hardware bug, you might add a reference to an errata if
available, though.

> +		 * bytes (whatever value for the second word) in Message RAM to
> +		 * avoid bit error for transmit data less than 4 bytes
> +		 */
> +		if (cf->len <= 4)
> +			m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(i / 4 + 1),
> +					 0x0);

This workaround doesn't handle the dlc == 0 case, your error description
isn't completely if this is a problem, too.

It should be possible to change the for loop to go always to 8, or
simply unroll the loop:

/* errata description goes here */
m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), *(u32 *)(cf->data + 0));
m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), *(u32 *)(cf->data + 4));

> +	}
> +
>  	can_put_echo_skb(skb, dev, 0);
>  
>  	if (priv->can.ctrlmode & CAN_CTRLMODE_FD) {
> 

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: DMA allocations from CMA and fatal_signal_pending check
From: Michal Nazarewicz @ 2014-11-03 16:45 UTC (permalink / raw)
  To: Florian Fainelli, Joonsoo Kim
  Cc: linux-arm-kernel, Brian Norris, Gregory Fong, linux-kernel,
	linux-mm, lauraa, gioh.kim, aneesh.kumar, m.szyprowski, akpm,
	netdev@vger.kernel.org
In-Reply-To: <5453F80C.4090006@gmail.com>

On Fri, Oct 31 2014, Florian Fainelli wrote:
> I agree that the CMA allocation should not be allowed to succeed, but
> the dma_alloc_coherent() allocation should succeed. If we look at the
> sysport driver, there are kmalloc() calls to initialize private
> structures, those will succeed (except under high memory pressure), so
> by the same token, a driver expects DMA allocations to succeed (unless
> we are under high memory pressure)
>
> What are we trying to solve exactly with the fatal_signal_pending()
> check here? Are we just optimizing for the case where a process has
> allocated from a CMA region to allow this region to be returned to the
> pool of free pages when it gets killed? Could there be another mechanism
> used to reclaim those pages if we know the process is getting killed
> anyway?

We're guarding against situations where process may hang around
arbitrarily long time after receiving SIGKILL.  If user does “kill -9
$pid” the usual expectation is that the $pid process will die within
seconds and anything longer is perceived by user as a bug.

What problem are *you* trying to solve?  If user sent SIGKILL to
a process that imitated device initialisation, what is the point of
continuing initialising the device?  Just recover and return -EINTR.

> Well, not really. This driver is not an isolated case, there are tons of
> other networking drivers that do exactly the same thing, and we do
> expect these dma_alloc_* calls to succeed.

Again, why do you expect them to succeed?  The code must handle failures
correctly anyway so why do you wish to ignore fatal signal?

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +--<mpn@google.com>--<xmpp:mina86@jabber.org>--ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: TCP NewReno and single retransmit
From: Marcelo Ricardo Leitner @ 2014-11-03 16:38 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Neal Cardwell, netdev, Eric Dumazet
In-Reply-To: <CAK6E8=dYfJw16Q0D40QD7RLr=wq=y+5W59zHmZ24L49OPS9O5A@mail.gmail.com>

On 31-10-2014 01:51, Yuchung Cheng wrote:
> On Thu, Oct 30, 2014 at 7:24 PM, Marcelo Ricardo Leitner
> <mleitner@redhat.com> wrote:
>> On 30-10-2014 00:03, Neal Cardwell wrote:
>>>
>>> On Mon, Oct 27, 2014 at 2:49 PM, Marcelo Ricardo Leitner
>>> <mleitner@redhat.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We have a report from a customer saying that on a very calm connection,
>>>> like
>>>> having only a single data packet within some minutes, if this packet gets
>>>> to
>>>> be re-transmitted, retrans_stamp is only cleared when the next acked
>>>> packet
>>>> is received. But this may make we abort the connection too soon if this
>>>> next
>>>> packet also gets lost, because the reference for the initial loss is
>>>> still
>>>> for a big while ago..
>>>
>>> ...
>>>>
>>>> @@ -2382,31 +2382,32 @@ static inline bool tcp_may_undo(const struct
>>>> tcp_sock *tp)
>>>>    static bool tcp_try_undo_recovery(struct sock *sk)
>>>
>>> ...
>>>>
>>>>           if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
>>>>                   /* Hold old state until something *above* high_seq
>>>>                    * is ACKed. For Reno it is MUST to prevent false
>>>>                    * fast retransmits (RFC2582). SACK TCP is safe. */
> Or we can just remove this strange state-holding logic?
>
> I couldn't find such a "MUST" statement in RFC2582. RFC2582 section 3
> step 5 suggests exiting the recovery procedure when an ACK acknowledges
> the "recover" variable (== tp->high_seq - 1).
>
> Since we've called tcp_reset_reno_sack() before tcp_try_undo_recovery(),
> I couldn't see how false fast retransmits can be triggered without
> this state-holding.
>
> Any insights?

Nice one, me neither. Neal?

 From RFC2582, Section 5, Avoiding Multiple Fast Retransmits:

    Nevertheless, unnecessary Fast Retransmits can occur with Reno or
    NewReno TCP, particularly if a Retransmit Timeout occurs during Fast
    Recovery.  (This is illustrated for Reno on page 6 of [F98], and for
    NewReno on page 8 of [F98].)  With NewReno, the data sender remains
    in Fast Recovery until either a Retransmit Timeout, or *until all of
    the data outstanding when Fast Retransmit was entered has been
    acknowledged*.  Thus with NewReno, the problem of multiple Fast
    Retransmits from a single window of data can only occur after a
    Retransmit Timeout.

Bolding mark is mine. If I didn't miss anything, as that condition was met, we 
should be good to keep that cwnd reduction (required by section 3 step 5) and 
but get back to Open state right away.

Marcelo

>>>>                   tcp_moderate_cwnd(tp);
>>>> +               tp->retrans_stamp = 0;
>>>>                   return true;
>>>>           }
>>>>           tcp_set_ca_state(sk, TCP_CA_Open);
>>>>           return false;
>>>>    }
>>>>
>>>> We would still hold state, at least part of it.. WDYT?
>>>
>>>
>>> This approach sounds OK to me as long as we include a check of
>>> tcp_any_retrans_done(), as we do in the similar code paths (for
>>> motivation, see the comment above tcp_any_retrans_done()).
>>
>>
>> Yes, okay. I thought that this would be taken care of already by then but
>> reading the code again now after your comment, I can see what you're saying.
>> Thanks.
>>
>>> So it sounds fine to me if you change that one new line to the following
>>> 2:
>>>
>>> +  if (!tcp_any_retrans_done(sk))
>>> +    tp->retrans_stamp = 0;
>>
>>
>> Will do.
>>
>>> Nice catch!
>>
>>
>> A good part of it (including the diagram) was done by customer. :)
>> I'll post the patch as soon as we sync with them (credits).
>>
>> Marcelo
>>
>

^ permalink raw reply

* [PATCH -next v4 3/3] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415032503-4936-1-git-send-email-fw@strlen.de>

This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.

One can use 'ip route change dev $dev $net features ecn' to toggle this.

Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.

There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:

Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.

It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.

We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.

Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).

 [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
 [2] http://ecn.ethz.ch/

Joint work with Daniel Borkmann.

Reference: http://thread.gmane.org/gmane.linux.network/335797
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v3:
 tcp_ecn_create_request() can have 'struct dst_entry *' can be 'const', spotted by Eric.
 Changes since v2:
  - alter new cookie_ecn_ok() ok function to also evaluate
  RTAX_FEATURE_ECN if timestamp-ecn-bit is set and the ecn sysctl
  is off.
  - extra blank line in tcp_output.c to appease checkpatch.pl

 include/net/tcp.h     |  2 +-
 net/ipv4/syncookies.c |  6 +++---
 net/ipv4/tcp_input.c  | 25 +++++++++++++++----------
 net/ipv4/tcp_output.c | 13 +++++++++++--
 net/ipv6/syncookies.c |  2 +-
 5 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 36c5084..f50f29faf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -493,7 +493,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
 __u32 cookie_init_timestamp(struct request_sock *req);
 bool cookie_timestamp_decode(struct tcp_options_received *opt);
 bool cookie_ecn_ok(const struct tcp_options_received *opt,
-		   const struct net *net);
+		   const struct net *net, const struct dst_entry *dst);
 
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 6de7725..45fe60c 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -273,7 +273,7 @@ bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
 EXPORT_SYMBOL(cookie_timestamp_decode);
 
 bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
-		   const struct net *net)
+		   const struct net *net, const struct dst_entry *dst)
 {
 	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
 
@@ -283,7 +283,7 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
 	if (net->ipv4.sysctl_tcp_ecn)
 		return true;
 
-	return false;
+	return dst_feature(dst, RTAX_FEATURE_ECN);
 }
 EXPORT_SYMBOL(cookie_ecn_ok);
 
@@ -387,7 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
 
 	ret = get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4e4617e..196b438 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5876,20 +5876,22 @@ static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
  */
 static void tcp_ecn_create_request(struct request_sock *req,
 				   const struct sk_buff *skb,
-				   const struct sock *listen_sk)
+				   const struct sock *listen_sk,
+				   const struct dst_entry *dst)
 {
 	const struct tcphdr *th = tcp_hdr(skb);
 	const struct net *net = sock_net(listen_sk);
 	bool th_ecn = th->ece && th->cwr;
-	bool ect, need_ecn;
+	bool ect, need_ecn, ecn_ok;
 
 	if (!th_ecn)
 		return;
 
 	ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
 	need_ecn = tcp_ca_needs_ecn(listen_sk);
+	ecn_ok = net->ipv4.sysctl_tcp_ecn || dst_feature(dst, RTAX_FEATURE_ECN);
 
-	if (!ect && !need_ecn && net->ipv4.sysctl_tcp_ecn)
+	if (!ect && !need_ecn && ecn_ok)
 		inet_rsk(req)->ecn_ok = 1;
 	else if (ect && need_ecn)
 		inet_rsk(req)->ecn_ok = 1;
@@ -5954,13 +5956,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
 
-	if (!want_cookie || tmp_opt.tstamp_ok)
-		tcp_ecn_create_request(req, skb, sk);
-
-	if (want_cookie) {
-		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-		req->cookie_ts = tmp_opt.tstamp_ok;
-	} else if (!isn) {
+	if (!want_cookie && !isn) {
 		/* VJ's idea. We save last timestamp seen
 		 * from the destination in peer table, when entering
 		 * state TIME-WAIT, and check against it before
@@ -6008,6 +6004,15 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 			goto drop_and_free;
 	}
 
+	tcp_ecn_create_request(req, skb, sk, dst);
+
+	if (want_cookie) {
+		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
+		req->cookie_ts = tmp_opt.tstamp_ok;
+		if (!tmp_opt.tstamp_ok)
+			inet_rsk(req)->ecn_ok = 0;
+	}
+
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_openreq_init_rwin(req, sk, dst);
 	fastopen = !want_cookie &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a3d453b..0b88158 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -333,10 +333,19 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
+		       tcp_ca_needs_ecn(sk);
+
+	if (!use_ecn) {
+		const struct dst_entry *dst = __sk_dst_get(sk);
+
+		if (dst && dst_feature(dst, RTAX_FEATURE_ECN))
+			use_ecn = true;
+	}
 
 	tp->ecn_flags = 0;
-	if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
-	    tcp_ca_needs_ecn(sk)) {
+
+	if (use_ecn) {
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
 		tp->ecn_flags = TCP_ECN_OK;
 		if (tcp_ca_needs_ecn(sk))
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 52cc8cb..7337fc7 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -262,7 +262,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
 
 	ret = get_cookie_sock(sk, skb, req, dst);
 out:
-- 
2.0.4

^ permalink raw reply related

* [PATCH -next v4 2/3] syncookies: split cookie_check_timestamp() into two functions
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415032503-4936-1-git-send-email-fw@strlen.de>

The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.

We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.

While this is the natural place for decoding options such as ECN which
commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.

Therefore, we split this functionality into cookie_ecn_ok(), which tells
us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.

Joint work with Daniel Borkmann.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 No changes since v3, preserved Erics Acked-by tag.

 include/net/tcp.h     |  9 ++++-----
 net/ipv4/syncookies.c | 31 +++++++++++++++++++++----------
 net/ipv6/syncookies.c |  5 ++---
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3a35b15..36c5084 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -490,17 +490,16 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
 			      u16 *mssp);
 __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
 			      __u16 *mss);
-#endif
-
 __u32 cookie_init_timestamp(struct request_sock *req);
-bool cookie_check_timestamp(struct tcp_options_received *opt, struct net *net,
-			    bool *ecn_ok);
+bool cookie_timestamp_decode(struct tcp_options_received *opt);
+bool cookie_ecn_ok(const struct tcp_options_received *opt,
+		   const struct net *net);
 
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
 		      u32 cookie);
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
-#ifdef CONFIG_SYN_COOKIES
+
 u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
 			      const struct tcphdr *th, u16 *mssp);
 __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index c3792c0..6de7725 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -241,10 +241,10 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
  * additional tcp options in the timestamp.
  * This extracts these options from the timestamp echo.
  *
- * return false if we decode an option that should not be.
+ * return false if we decode a tcp option that is disabled
+ * on the host.
  */
-bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
-			struct net *net, bool *ecn_ok)
+bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
 {
 	/* echoed timestamp, lowest bits contain options */
 	u32 options = tcp_opt->rcv_tsecr;
@@ -258,9 +258,6 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 		return false;
 
 	tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0;
-	*ecn_ok = options & TS_OPT_ECN;
-	if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
-		return false;
 
 	if (tcp_opt->sack_ok && !sysctl_tcp_sack)
 		return false;
@@ -273,7 +270,22 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 
 	return sysctl_tcp_window_scaling != 0;
 }
-EXPORT_SYMBOL(cookie_check_timestamp);
+EXPORT_SYMBOL(cookie_timestamp_decode);
+
+bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
+		   const struct net *net)
+{
+	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
+
+	if (!ecn_ok)
+		return false;
+
+	if (net->ipv4.sysctl_tcp_ecn)
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL(cookie_ecn_ok);
 
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 {
@@ -289,7 +301,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct rtable *rt;
 	__u8 rcv_wscale;
-	bool ecn_ok = false;
 	struct flowi4 fl4;
 
 	if (!sysctl_tcp_syncookies || !th->ack || th->rst)
@@ -310,7 +321,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
-	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
 	ret = NULL;
@@ -328,7 +339,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->ir_loc_addr	= ip_hdr(skb)->daddr;
 	ireq->ir_rmt_addr	= ip_hdr(skb)->saddr;
 	ireq->ir_mark		= inet_request_mark(sk, skb);
-	ireq->ecn_ok		= ecn_ok;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
 	ireq->wscale_ok		= tcp_opt.wscale_ok;
@@ -377,6 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
 
 	ret = get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index be291ba..52cc8cb 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -166,7 +166,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct dst_entry *dst;
 	__u8 rcv_wscale;
-	bool ecn_ok = false;
 
 	if (!sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -186,7 +185,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
-	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
 	ret = NULL;
@@ -223,7 +222,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	req->expires = 0UL;
 	req->num_retrans = 0;
-	ireq->ecn_ok		= ecn_ok;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
 	ireq->wscale_ok		= tcp_opt.wscale_ok;
@@ -264,6 +262,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
 
 	ret = get_cookie_sock(sk, skb, req, dst);
 out:
-- 
2.0.4

^ permalink raw reply related

* [PATCH -next v4 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415032503-4936-1-git-send-email-fw@strlen.de>

Was a bit more difficult to read than needed due to magic shifts;
add defines and document the used encoding scheme.

Joint work with Daniel Borkmann.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v3, added preserved Erics Acked-by tag.

 net/ipv4/syncookies.c | 50 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 4ac7bca..c3792c0 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -19,10 +19,6 @@
 #include <net/tcp.h>
 #include <net/route.h>
 
-/* Timestamps: lowest bits store TCP options */
-#define TSBITS 6
-#define TSMASK (((__u32)1 << TSBITS) - 1)
-
 extern int sysctl_tcp_syncookies;
 
 static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
@@ -30,6 +26,30 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
 #define COOKIEBITS 24	/* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
 
+/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
+ * stores TCP options:
+ *
+ * MSB                               LSB
+ * | 31 ...   6 |  5  |  4   | 3 2 1 0 |
+ * |  Timestamp | ECN | SACK | WScale  |
+ *
+ * When we receive a valid cookie-ACK, we look at the echoed tsval (if
+ * any) to figure out which TCP options we should use for the rebuilt
+ * connection.
+ *
+ * A WScale setting of '0xf' (which is an invalid scaling value)
+ * means that original syn did not include the TCP window scaling option.
+ */
+#define TS_OPT_WSCALE_MASK	0xf
+#define TS_OPT_SACK		BIT(4)
+#define TS_OPT_ECN		BIT(5)
+/* There is no TS_OPT_TIMESTAMP:
+ * if ACK contains timestamp option, we already know it was
+ * requested/supported by the syn/synack exchange.
+ */
+#define TSBITS	6
+#define TSMASK	(((__u32)1 << TSBITS) - 1)
+
 static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
 		      ipv4_cookie_scratch);
 
@@ -67,9 +87,11 @@ __u32 cookie_init_timestamp(struct request_sock *req)
 
 	ireq = inet_rsk(req);
 
-	options = ireq->wscale_ok ? ireq->snd_wscale : 0xf;
-	options |= ireq->sack_ok << 4;
-	options |= ireq->ecn_ok << 5;
+	options = ireq->wscale_ok ? ireq->snd_wscale : TS_OPT_WSCALE_MASK;
+	if (ireq->sack_ok)
+		options |= TS_OPT_SACK;
+	if (ireq->ecn_ok)
+		options |= TS_OPT_ECN;
 
 	ts = ts_now & ~TSMASK;
 	ts |= options;
@@ -219,16 +241,13 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
  * additional tcp options in the timestamp.
  * This extracts these options from the timestamp echo.
  *
- * The lowest 4 bits store snd_wscale.
- * next 2 bits indicate SACK and ECN support.
- *
  * return false if we decode an option that should not be.
  */
 bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 			struct net *net, bool *ecn_ok)
 {
 	/* echoed timestamp, lowest bits contain options */
-	u32 options = tcp_opt->rcv_tsecr & TSMASK;
+	u32 options = tcp_opt->rcv_tsecr;
 
 	if (!tcp_opt->saw_tstamp)  {
 		tcp_clear_options(tcp_opt);
@@ -238,19 +257,20 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 	if (!sysctl_tcp_timestamps)
 		return false;
 
-	tcp_opt->sack_ok = (options & (1 << 4)) ? TCP_SACK_SEEN : 0;
-	*ecn_ok = (options >> 5) & 1;
+	tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0;
+	*ecn_ok = options & TS_OPT_ECN;
 	if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
 		return false;
 
 	if (tcp_opt->sack_ok && !sysctl_tcp_sack)
 		return false;
 
-	if ((options & 0xf) == 0xf)
+	if ((options & TS_OPT_WSCALE_MASK) == TS_OPT_WSCALE_MASK)
 		return true; /* no window scaling */
 
 	tcp_opt->wscale_ok = 1;
-	tcp_opt->snd_wscale = options & 0xf;
+	tcp_opt->snd_wscale = options & TS_OPT_WSCALE_MASK;
+
 	return sysctl_tcp_window_scaling != 0;
 }
 EXPORT_SYMBOL(cookie_check_timestamp);
-- 
2.0.4

^ permalink raw reply related

* [PATCH -next v4 0/3] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev

Here is v4 of the patchset, its exactly the same as v3 except in patch3/3
where I added the missing 'const' qualifier to a function argument that
Eric spotted during review.

I preserved Erics Acks so that he doesn't have to resend them.

v3 cover letter:

When using syn cookies, then do not simply trust that the echoed timestamp
was not modified to make sure that ecn is not turned on magically when it
is disabled on the host.

The first two patches, which were not part of earlier series, prepare
the cookie code for the ecn route metrics change by allowing is to
more easily use the existing dst object for ecn validation.

The 3rd patch adds the ecn route metric feature support.
It is almost the same as in v2, except that we'll now also test the
dst_features when decoding a syn cookie timestamp that indicates ecn support.

These three patches then allow turning on explicit congestion notification
based on the destination network.

For example, assuming the default tcp_ecn sysctl '2', the following will
enable ecn (tcp_ecn=1 behaviour, i.e. request ecn to be enabled for a
tcp connection) for all connections to hosts inside the 192.168.2/24 network:

ip route change 192.168.2.0/24 dev eth0 features ecn

Having a more fine-grained per-route setting can be beneficial for
various reasons, for example 1) within data centers, or 2) local ISPs
may deploy ECN support for their own video/streaming services [1], etc.

Joint work with Daniel Borkmann, feature suggested by Hannes Frederic Sowa.

The patch to enable this in iproute2 will be posted shortly, it is currently
also available here:
http://git.breakpoint.cc/cgit/fw/iproute2.git/commit/?h=iproute_features&id=8843d2d8973fb81c78a7efe6d42e3a17d739003e

[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15

^ permalink raw reply

* Re: [PATCH 5/7] can: clear ctrlmode when close candev
From: Marc Kleine-Budde @ 2014-11-03 16:28 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel,
	Oliver Hartkopp
In-Reply-To: <1414579527-31100-5-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> Currently priv->ctrlmode is not cleared when close_candev, so next time
> the driver will still use this value to set controller even user
> does not set any ctrl mode.
> e.g.
> Step 1. ip link set can0 up type can0 bitrate 1000000 loopback on
> Controller will be in loopback mode
> Step 2. ip link set can0 down
> Step 3. ip link set can0 up type can0 bitrate 1000000
> Controller will still be set to loopback mode in driver due to saved
> priv->ctrlmode.
> 
> This patch clears priv->ctrlmode when the CAN interface is closed,
> and set it to correct mode according to next user setting.

Oliver, what do you think of this patch? It will introduce a subtle
change to the userspace.

Marc

> Signed-off-by: Dong Aisheng <b29396@freescale.com>
> ---
>  drivers/net/can/dev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
> index 02492d2..1fce485 100644
> --- a/drivers/net/can/dev.c
> +++ b/drivers/net/can/dev.c
> @@ -671,6 +671,7 @@ void close_candev(struct net_device *dev)
>  
>  	del_timer_sync(&priv->restart_timer);
>  	can_flush_echo_skb(dev);
> +	priv->ctrlmode = 0;
>  }
>  EXPORT_SYMBOL_GPL(close_candev);
>  
> 


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 2/7] can: m_can: fix the incorrect error messages
From: Marc Kleine-Budde @ 2014-11-03 16:25 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-2-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> Fix a few error messages.
> 
> Signed-off-by: Dong Aisheng <b29396@freescale.com>

Applied to can/master.

Thanks,
Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 1/7] can: m_can: fix possible sleep in napi poll
From: Marc Kleine-Budde @ 2014-11-03 16:24 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-1-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> The m_can_get_berr_counter function can sleep and it may be called in napi poll function.
> Rework it to fix the following warning.

Applied to can/master.

Thanks, Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [PATCH net-next] net: add rbnode to struct sk_buff
From: Eric Dumazet @ 2014-11-03 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yaogong Wang

From: Eric Dumazet <edumazet@google.com>

Yaogong replaces TCP out of order receive queue by an RB tree.

As netem already does a private skb->{next/prev/tstamp} union
with a 'struct rb_node', lets do this in a cleaner way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
---
 include/linux/skbuff.h |   20 +++++++++++++-------
 net/sched/sch_netem.c  |   27 +++++++--------------------
 2 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6c8b6f604e7609f2d320a7aa572a8ec6a870f4b8..5ad9675b6fe17b2819476c181a4707780a2e03f2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -20,6 +20,7 @@
 #include <linux/time.h>
 #include <linux/bug.h>
 #include <linux/cache.h>
+#include <linux/rbtree.h>
 
 #include <linux/atomic.h>
 #include <asm/types.h>
@@ -440,6 +441,7 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  *	@next: Next buffer in list
  *	@prev: Previous buffer in list
  *	@tstamp: Time we arrived/left
+ *	@rbnode: RB tree node, alternative to next/prev for netem/tcp
  *	@sk: Socket we are owned by
  *	@dev: Device we arrived on/are leaving by
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
@@ -504,15 +506,19 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  */
 
 struct sk_buff {
-	/* These two members must be first. */
-	struct sk_buff		*next;
-	struct sk_buff		*prev;
-
 	union {
-		ktime_t		tstamp;
-		struct skb_mstamp skb_mstamp;
+		struct {
+			/* These two members must be first. */
+			struct sk_buff		*next;
+			struct sk_buff		*prev;
+
+			union {
+				ktime_t		tstamp;
+				struct skb_mstamp skb_mstamp;
+			};
+		};
+		struct rb_node	rbnode; /* used in netem & tcp stack */
 	};
-
 	struct sock		*sk;
 	struct net_device	*dev;
 
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index b34331967e020b6f1151b25be8f744e494a80ad6..179f1c8c0d8bba4aa00705e6c9ca41ef909f7d50 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -139,33 +139,20 @@ struct netem_sched_data {
 
 /* Time stamp put into socket buffer control block
  * Only valid when skbs are in our internal t(ime)fifo queue.
+ *
+ * As skb->rbnode uses same storage than skb->next, skb->prev and skb->tstamp,
+ * and skb->next & skb->prev are scratch space for a qdisc,
+ * we save skb->tstamp value in skb->cb[] before destroying it.
  */
 struct netem_skb_cb {
 	psched_time_t	time_to_send;
 	ktime_t		tstamp_save;
 };
 
-/* Because space in skb->cb[] is tight, netem overloads skb->next/prev/tstamp
- * to hold a rb_node structure.
- *
- * If struct sk_buff layout is changed, the following checks will complain.
- */
-static struct rb_node *netem_rb_node(struct sk_buff *skb)
-{
-	BUILD_BUG_ON(offsetof(struct sk_buff, next) != 0);
-	BUILD_BUG_ON(offsetof(struct sk_buff, prev) !=
-		     offsetof(struct sk_buff, next) + sizeof(skb->next));
-	BUILD_BUG_ON(offsetof(struct sk_buff, tstamp) !=
-		     offsetof(struct sk_buff, prev) + sizeof(skb->prev));
-	BUILD_BUG_ON(sizeof(struct rb_node) > sizeof(skb->next) +
-					      sizeof(skb->prev) +
-					      sizeof(skb->tstamp));
-	return (struct rb_node *)&skb->next;
-}
 
 static struct sk_buff *netem_rb_to_skb(struct rb_node *rb)
 {
-	return (struct sk_buff *)rb;
+	return container_of(rb, struct sk_buff, rbnode);
 }
 
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -403,8 +390,8 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 		else
 			p = &parent->rb_left;
 	}
-	rb_link_node(netem_rb_node(nskb), parent, p);
-	rb_insert_color(netem_rb_node(nskb), &q->t_root);
+	rb_link_node(&nskb->rbnode, parent, p);
+	rb_insert_color(&nskb->rbnode, &q->t_root);
 	sch->q.qlen++;
 }
 

^ permalink raw reply related

* Re: [PATCH -next v3 3/3] net: allow setting ecn via routing table
From: Eric Dumazet @ 2014-11-03 16:11 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Daniel Borkmann
In-Reply-To: <1415019720-17106-4-git-send-email-fw@strlen.de>

On Mon, 2014-11-03 at 14:02 +0100, Florian Westphal wrote:
> This patch allows to set ECN on a per-route basis in case the sysctl
> tcp_ecn is not set to 1. In other words, when ECN is set for specific
> routes, it provides a tcp_ecn=1 behaviour for that route while the rest
> of the stack acts according to the global settings.
> 
> One can use 'ip route change dev $dev $net features ecn' to toggle this.
> 
> Having a more fine-grained per-route setting can be beneficial for various
> reasons, for example, 1) within data centers, or 2) local ISPs may deploy
> ECN support for their own video/streaming services [1], etc.
> 
> There was a recent measurement study/paper [2] which scanned the Alexa's
> publicly available top million websites list from a vantage point in US,
> Europe and Asia:
> 
> Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
> blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
> only ECN") ;)); the break in connectivity on-path was found is about
> 1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
> more common in the negotiation phase (and mostly seen in the Alexa
> middle band, ranks around 50k-150k): from 12-thousand hosts on which
> there _may_ be ECN-linked connection failures, only 79 failed with RST
> when _not_ failing with RST when ECN is not requested.
> 
> It's unclear though, how much equipment in the wild actually marks CE
> when buffers start to fill up.
> 
> We thought about a fallback to non-ECN for retransmitted SYNs as another
> global option (which could perhaps one day be made default), but as Eric
> points out, there's much more work needed to detect broken middleboxes.
> 
> Two examples Eric mentioned are buggy firewalls that accept only a single
> SYN per flow, and middleboxes that successfully let an ECN flow establish,
> but later mark CE for all packets (so cwnd converges to 1).
> 
>  [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
>  [2] http://ecn.ethz.ch/
> 
> Joint work with Daniel Borkmann.
> 
> Reference: http://thread.gmane.org/gmane.linux.network/335797
> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  Changes since v2:
>   - alter new cookie_ecn_ok() ok function to also evaluate
>   RTAX_FEATURE_ECN if timestamp-ecn-bit is set and the ecn sysctl
>   is off.
>   - extra blank line in tcp_output.c to appease checkpatch.pl
> 
>  include/net/tcp.h     |  2 +-
>  net/ipv4/syncookies.c |  6 +++---
>  net/ipv4/tcp_input.c  | 25 +++++++++++++++----------
>  net/ipv4/tcp_output.c | 13 +++++++++++--
>  net/ipv6/syncookies.c |  2 +-
>  5 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 36c5084..f50f29faf 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -493,7 +493,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
>  __u32 cookie_init_timestamp(struct request_sock *req);
>  bool cookie_timestamp_decode(struct tcp_options_received *opt);
>  bool cookie_ecn_ok(const struct tcp_options_received *opt,
> -		   const struct net *net);
> +		   const struct net *net, const struct dst_entry *dst);
>  
>  /* From net/ipv6/syncookies.c */
>  int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 6de7725..45fe60c 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -273,7 +273,7 @@ bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
>  EXPORT_SYMBOL(cookie_timestamp_decode);
>  
>  bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
> -		   const struct net *net)
> +		   const struct net *net, const struct dst_entry *dst)
>  {
>  	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
>  
> @@ -283,7 +283,7 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
>  	if (net->ipv4.sysctl_tcp_ecn)
>  		return true;
>  
> -	return false;
> +	return dst_feature(dst, RTAX_FEATURE_ECN);
>  }
>  EXPORT_SYMBOL(cookie_ecn_ok);
>  
> @@ -387,7 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>  				  dst_metric(&rt->dst, RTAX_INITRWND));
>  
>  	ireq->rcv_wscale  = rcv_wscale;
> -	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
> +	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
>  
>  	ret = get_cookie_sock(sk, skb, req, &rt->dst);
>  	/* ip_queue_xmit() depends on our flow being setup
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 4e4617e..9db942a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5876,20 +5876,22 @@ static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
>   */
>  static void tcp_ecn_create_request(struct request_sock *req,
>  				   const struct sk_buff *skb,
> -				   const struct sock *listen_sk)
> +				   const struct sock *listen_sk,
> +				   struct dst_entry *dst)

Nit : This probably should be 'const struct dst_entry *dst'

Otherwise, patch looks fine, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH -next v3 2/3] syncookies: split cookie_check_timestamp() into two functions
From: Eric Dumazet @ 2014-11-03 16:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Daniel Borkmann
In-Reply-To: <1415019720-17106-3-git-send-email-fw@strlen.de>

On Mon, 2014-11-03 at 14:01 +0100, Florian Westphal wrote:
> The function cookie_check_timestamp(), both called from IPv4/6 context,
> is being used to decode the echoed timestamp from the SYN/ACK into TCP
> options used for follow-up communication with the peer.
> 
> We can remove ECN handling from that function, split it into a separate
> one, and simply rename the original function into cookie_decode_options().
> cookie_decode_options() just fills in tcp_option struct based on the
> echoed timestamp received from the peer. Anything that fails in this
> function will actually discard the request socket.
> 
> While this is the natural place for decoding options such as ECN which
> commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
> that in particular for ECN handling, it can be checked at a later point
> in time as the request sock would actually not need to be dropped from
> this, but just ECN support turned off.
> 
> Therefore, we split this functionality into cookie_ecn_ok(), which tells
> |us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.
> 
> This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
> won't be enough anymore at that point; if the timestamp indicates ECN
> and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.
> 
> This would mean adding a route lookup to cookie_check_timestamp(), which
> we definitely want to avoid. As we already do a route lookup at a later
> point in cookie_{v4,v6}_check(), we can simply make use of that as well
> for the new cookie_ecn_ok() function w/o any additional cost.
> 
> Joint work with Daniel Borkmann.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH] PPC: bpf_jit_comp: add SKF_AD_PKTTYPE instruction
From: Philippe Bergheaud @ 2014-11-03 15:45 UTC (permalink / raw)
  To: Denis Kirjanov; +Cc: linuxppc-dev, netdev, Matt Evans, mpe
In-Reply-To: <CAOJe8K0t3G-bHm_24GjrTp9mmKnYZS1_bdGgrwQBLjC_s5is6w@mail.gmail.com>

Denis Kirjanov wrote:
> Any feedback from PPC folks?

I have reviewed the patch and it looks fine to me.
I have tested successfuly on ppc64le.
I could not test it on ppc64.

Philippe

> On 10/26/14, Denis Kirjanov <kda@linux-powerpc.org> wrote:
> 
>>Cc: Matt Evans <matt@ozlabs.org>
>>Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
>>---
>> arch/powerpc/include/asm/ppc-opcode.h | 1 +
>> arch/powerpc/net/bpf_jit.h            | 7 +++++++
>> arch/powerpc/net/bpf_jit_comp.c       | 5 +++++
>> 3 files changed, 13 insertions(+)
>>
>>diff --git a/arch/powerpc/include/asm/ppc-opcode.h
>>b/arch/powerpc/include/asm/ppc-opcode.h
>>index 6f85362..1a52877 100644
>>--- a/arch/powerpc/include/asm/ppc-opcode.h
>>+++ b/arch/powerpc/include/asm/ppc-opcode.h
>>@@ -204,6 +204,7 @@
>> #define PPC_INST_ERATSX_DOT		0x7c000127
>>
>> /* Misc instructions for BPF compiler */
>>+#define PPC_INST_LBZ			0x88000000
>> #define PPC_INST_LD			0xe8000000
>> #define PPC_INST_LHZ			0xa0000000
>> #define PPC_INST_LHBRX			0x7c00062c
>>diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
>>index 9aee27c..c406aa9 100644
>>--- a/arch/powerpc/net/bpf_jit.h
>>+++ b/arch/powerpc/net/bpf_jit.h
>>@@ -87,6 +87,9 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
>> #define PPC_STD(r, base, i)	EMIT(PPC_INST_STD | ___PPC_RS(r) |	      \
>> 				     ___PPC_RA(base) | ((i) & 0xfffc))
>>
>>+
>>+#define PPC_LBZ(r, base, i)	EMIT(PPC_INST_LBZ | ___PPC_RT(r) |	      \
>>+				     ___PPC_RA(base) | IMM_L(i))
>> #define PPC_LD(r, base, i)	EMIT(PPC_INST_LD | ___PPC_RT(r) |	      \
>> 				     ___PPC_RA(base) | IMM_L(i))
>> #define PPC_LWZ(r, base, i)	EMIT(PPC_INST_LWZ | ___PPC_RT(r) |	      \
>>@@ -96,6 +99,10 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
>> #define PPC_LHBRX(r, base, b)	EMIT(PPC_INST_LHBRX | ___PPC_RT(r) |	      \
>> 				     ___PPC_RA(base) | ___PPC_RB(b))
>> /* Convenience helpers for the above with 'far' offsets: */
>>+#define PPC_LBZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LBZ(r, base, i);
>>  \
>>+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
>>+			PPC_LBZ(r, r, IMM_L(i)); } } while(0)
>>+
>> #define PPC_LD_OFFS(r, base, i) do { if ((i) < 32768) PPC_LD(r, base, i);
>>  \
>> 		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
>> 			PPC_LD(r, r, IMM_L(i)); } } while(0)
>>diff --git a/arch/powerpc/net/bpf_jit_comp.c
>>b/arch/powerpc/net/bpf_jit_comp.c
>>index cbae2df..d110e28 100644
>>--- a/arch/powerpc/net/bpf_jit_comp.c
>>+++ b/arch/powerpc/net/bpf_jit_comp.c
>>@@ -407,6 +407,11 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32
>>*image,
>> 			PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
>> 							  queue_mapping));
>> 			break;
>>+		case BPF_ANC | SKF_AD_PKTTYPE:
>>+			PPC_LBZ_OFFS(r_A, r_skb, PKT_TYPE_OFFSET());
>>+			PPC_ANDI(r_A, r_A, PKT_TYPE_MAX);
>>+			PPC_SRWI(r_A, r_A, 5);
>>+			break;
>> 		case BPF_ANC | SKF_AD_CPU:
>> #ifdef CONFIG_SMP
>> 			/*
>>--
>>2.1.0
>>
>>
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Eric Dumazet @ 2014-11-03 15:41 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Daniel Borkmann
In-Reply-To: <1415019720-17106-2-git-send-email-fw@strlen.de>

On Mon, 2014-11-03 at 14:01 +0100, Florian Westphal wrote:
> Was a bit more difficult to read than needed due to magic shifts;
> add defines and document the used encoding scheme.
> 
> Joint work with Daniel Borkmann.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH] ipv4: avoid divide 0 error in tcp_incr_quickack
From: Eric Dumazet @ 2014-11-03 15:30 UTC (permalink / raw)
  To: chenweilong, Yuchung Cheng, Neal Cardwell
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <54571335.2060709@huawei.com>

On Mon, 2014-11-03 at 13:31 +0800, chenweilong wrote:

> Hi Eric,
> 
> I check the code and find that:
> 
> 1.In function "tcp_rcv_state_process",
> the "tcp_initialize_rcv_mss" is called at "step 5: check the ACK field" when the sk->sk_state is TCP_SYN_RECV
> and there is a "tcp_validate_incoming" just before it.
> So when we call "tcp_validate_incoming", the rcv_mss may not been initialized.
> 
> 2.In function "tcp_validate_incoming",
> the "Step 1: check sequence number", according to RFC793 page 69,
> If an incoming segment is not acceptable,an acknowledgment should be sent in reply (unless the RST
> bit is set, if so drop the segment and return).
> So we may call "tcp_send_dupack" while the rcv_mss hasn't been initialized.
> 
> 3.In function "tcp_send_dupack",
> when the condition is suitable, it'll enter quick ack mode. Notice it only check the seq !
> So I think add another state check should be OK.
> 
> Any suggestion ?
> 

You did find what immediate conditions for the crash (rcv_mss = 0, state
= TCP_SYN_RCV) were.

Your patch avoids the zero divide, but leaves other issues. rcv_mss = 0
here is a sign some logic is wrong in the stack.

Given this potential zero divide had been there for years, I believe we
should take the time for a more complete fix, instead of papering over
the immediate problem.

We have been working with Neal to reproduce the issue with packetdrill,
we'll post our results when we manage to get our first crash ;)

Thanks !

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox