Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] Revert "xen-netback: improve ring effeciency for guest RX"
From: David Miller @ 2013-10-08 19:11 UTC (permalink / raw)
  To: ian.campbell
  Cc: wei.liu2, netdev, xen-devel, annie.li, msw, xixiong, david.vrabel,
	paul.durrant
In-Reply-To: <1381227707.3804.65.camel@hastur.hellion.org.uk>

From: Ian Campbell <ian.campbell@citrix.com>
Date: Tue, 8 Oct 2013 11:21:47 +0100

> On Tue, 2013-10-08 at 10:54 +0100, Wei Liu wrote:
>> This reverts commit 4f0581d25827d5e864bcf07b05d73d0d12a20a5c.
>> 
>> The named changeset is causing problem. Let's aim to make this part less
>> fragile before trying to improve things.
>> 
>> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> Although I thought davem would just run git revert so I don't know if it
> is needed.

This works too, because it gives you guys an opportunity to add
some explanation to the commit message.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] mac80211: port CCMP to cryptoapi's CCM driver
From: Johannes Berg @ 2013-10-08 19:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, patches-QSEj5FYQhm4dnm+yROfE0A,
	linville-2XuSBdqkA4R54TAoqtyWWQ
In-Reply-To: <1381231915-24232-1-git-send-email-ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

I'm not too familiar with the aead API, so here's another question:

> +	sg_init_one(&pt, data, data_len);
> +	sg_init_one(&assoc, &aad[2], be16_to_cpup((__be16 *)aad));
> +	sg_init_table(ct, 2);
> +	sg_set_buf(&ct[0], cdata, data_len);
> +	sg_set_buf(&ct[1], mic, IEEE80211_CCMP_MIC_LEN);

Is it guaranteed to be allowed that the input and output are the same
buffer? It seems we rely on that for encrypt_one(), but is it true here
as well?

(Btw - why pass in data/cdata as separate pointers into the function?)

> @@ -343,7 +337,7 @@ static void ccmp_special_blocks(struct sk_buff *skb, u8 *pn, u8 *scratch,
>  		data_len -= IEEE80211_CCMP_MIC_LEN;
>  
>  	/* First block, b_0 */
> -	b_0[0] = 0x59; /* flags: Adata: 1, M: 011, L: 001 */
> +	b_0[0] = 0x1; /* set L := 1, M and Adata flags are implied */

Hmm. I don't think I understand, can you explain this to me?

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: tx checksum offload in rtl8168evl disabled in driver
From: jason.morgan @ 2013-10-08 18:44 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Hayes Wang, netdev
In-Reply-To: <20131005092252.GA23084@electric-eye.fr.zoreil.com>

> 
> (please don't top post)

Sorry, I was not aware of the correct netiquette
replies now inline.

> 
> jason.morgan@....> :
> > Ubuntu 12.04.3 LTS + Kernel 3.8.13-8 64bit
> > 
> > I've patched the driver to allow tx checksum offload for this chip and
> > found the following:
> >
> > MTU 9000 standard driver:
> > 517Mbps with 2k + header frames
> > 
> > MTU 9000 patched driver:
> > 770Mbps with 2k + header frames
> > 
> > 100% transfer without error (1e6 frames)
> 
> (Ok, so that's 20 ~ 30s worth of traffic)

Right now this is representative as the network has no other traffic and 
the CPU has no other function.
I will be producing hours of traffic soon in a more realistic network, see 
later....

> 
> > 48% increase in performance combined with a massive decrease in CPU
> > effort is not to be sniffed at.
> 
> *sniff* :o)

Drums.... BurrDumpf!

> 
> It depends on the CPU. You did not specify it and you did not give 
numbers
> for the decrease (did you use 'perf' btw ?). They would be welcome.

As I've patched the kernel driver and build a new kernel, I have no idea 
how to build a perf for my installed kernel.
It appears that there is not one pre-built.
Any help here?
When I learn how to build perf I can reply with CPU performance data.

> 
> > IMO tx offload should be more prevalent as the frames grow, to reduce 
> > CPU load.
> 
> I can't disagree.
> 
> > OK, so make the default OFF if there is a silicon error (that spans
> > mulitple chips?),
> 
> Yes, I want safe defaults for the kernel.

Safe defaults are of course a good thing.

Others have stated that this is not a silicon error, it's a hardware 
feature which it appears
the Linux default for which is non-optimal and there is no means (bar 
hacking) to optimise it.

> 
> I give the manufacturer's explanations a lot of credit when they're
> related to hardware (up to the point where the marketing or legal dept
> kicks in). If we want to balance these with experimental evidences, the
> latter must be really, really strong.

To me it appears the latency stats ( a marketing metric of the MAC ) 
somehow outweigh the load on the CPU
( which is factored out of MAC stats ) so perhaps your comment on the 
marketing dept rings true?

Are there any tests that you can suggest that would provide such evidence?

I am in the process of building a LAN of 18 machines with the intent of
saturating a 1G/10G Ethernet switch, so I have a good test platform - for 
one chip anyway.

> 
> > but why prevent it being turned on in the driver? 
> > even if there is a kernel message that this might cause problems.
> 
> Two points:
> - it's a hack: ethtool will return success. A kernel message is not a
> substitute for "Yes, I opt in for problems".
> - we can't tell when it's safe and when it isn't.

I would agree that generally ethtool can't be trusted, but it is clear 
that
it does work with this driver for this chip for this control so in that 
case surely it can be trusted?

The statement "we can't tell when it's safe and when it isn't." is true 
for almost any hardware interface
Usually we offer safe defaults and a means to trade off safety and 
performance, though that is not always the case, e.g.
It's much safer to run DDR3-1600 as DDR3-1066, but if that were the 
default people would complain.

> 
> -- 
> Ueimor
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] mac80211: port CCMP to cryptoapi's CCM driver
From: Johannes Berg @ 2013-10-08 18:27 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: David Laight,
	<linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<patches-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>,
	<linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
In-Reply-To: <CAKv+Gu8D_7d=u1PGWuxoLEETHe8uJMby3K98uQWQn7tk=t_t_w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 2013-10-08 at 16:52 +0200, Ard Biesheuvel wrote:

> However, personally I don't think this should be necessary and in fact
> my patch removes a stack allocation of u8[48] (from
> ieee80211_crypto_ccmp_decrypt() and from ccmp_encrypt_skb() in wpa.c)
> so it does even out a bit.

I tend to agree.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPv6 kernel warning
From: dormando @ 2013-10-08 18:24 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Michele Baldessari, Russell King - ARM Linux, netdev,
	Neal Cardwell, Nandita Dukkipati
In-Reply-To: <CAK6E8=d_O+HHTKb37zGKxpU8E0tTUL6m77g_o6a7ASEiJfXSkw@mail.gmail.com>

On Mon, 7 Oct 2013, Yuchung Cheng wrote:

> On Mon, Oct 7, 2013 at 12:56 PM, dormando <dormando@rydia.net> wrote:
> > On Mon, 7 Oct 2013, Yuchung Cheng wrote:
> >
> >> On Mon, Oct 7, 2013 at 11:13 AM, dormando <dormando@rydia.net> wrote:
> >> >
> >> > > >
> >> > > > there's been multiple reports about this one:
> >> > > > https://bugzilla.redhat.com/show_bug.cgi?id=989251
> >> > > > http://bugzilla.kernel.org/show_bug.cgi?id=60779
> >> > > >
> >> > > > Could you try Yuchung's debug patch?
> >> > > > http://www.spinics.net/lists/netdev/msg250193.html
> >> > > Yes it looks like the same bug. Please try that patch to help identify
> >> > > this elusive bug.
> >> > >
> >> >
> >> > Hi!
> >> >
> >> > We get this one a few times a day in production. Here's a warning with
> >> > your debug trace in the line immediately following:
> >> > (I censored a few things)
> >> >
> >> >  [125311.721950] ------------[ cut here ]------------
> >> >  [125311.721961] WARNING: at net/ipv4/tcp_input.c:2776 tcp_fastretrans_alert+0xb58/0xc80()
> >> >  [125311.721962] Modules linked in: bridge ip_vs macvlan coretemp crc32_pclmul ghash_clmulni_intel gpio_ich ipmi_watchdog microcode ipmi_devintf sb_edac lpc_ich edac_core mfd_core ipmi_si ipmi_msghandler iptable_nat nf_nat_ipv4 nf_nat ixgbe igb mdio i2c_algo_bit ptp pps_core
> >> >  [125311.721981] CPU: 11 PID: 0 Comm: swapper/11 Not tainted 3.10.13 #1
> >> >  [125311.721982] Hardware name: Supermicro XXXXXXXXXXX, BIOS 1.1 10/03/2012
> >> >  [125311.721984]  ffffffff81a82007 ffff88407fc63958 ffffffff816bb9cc ffff88407fc63998
> >> >  [125311.721986]  ffffffff8104b940 00ff8840ad904f82 ffff883b8a165b00 0000000000004120
> >> >  [125311.721989]  0000000000000001 0000000000000019 0000000000000000 ffff88407fc639a8
> >> >  [125311.721991] Call Trace:
> >> >  [125311.721992]  <IRQ>  [<ffffffff816bb9cc>] dump_stack+0x19/0x1d
> >> >  [125311.722002]  [<ffffffff8104b940>] warn_slowpath_common+0x70/0xa0
> >> >  [125311.722005]  [<ffffffff8104b98a>] warn_slowpath_null+0x1a/0x20
> >> >  [125311.722007]  [<ffffffff81616db8>] tcp_fastretrans_alert+0xb58/0xc80
> >> >  [125311.722011]  [<ffffffff8161891f>] tcp_ack+0x6df/0xe90
> >> >  [125311.722016]  [<ffffffff8164e0ca>] ? ipt_do_table+0x22a/0x680
> >> >  [125311.722018]  [<ffffffff816194b3>] ? tcp_validate_incoming+0x63/0x320
> >> >  [125311.722021]  [<ffffffff8161a55c>] tcp_rcv_established+0x2cc/0x810
> >> >  [125311.722023]  [<ffffffff81622c84>] tcp_v4_do_rcv+0x254/0x4f0
> >> >  [125311.722025]  [<ffffffff816245ac>] tcp_v4_rcv+0x5fc/0x750
> >> >  [125311.722027]  [<ffffffff815ffa00>] ? ip_rcv+0x350/0x350
> >> >  [125311.722032]  [<ffffffff815df3ad>] ? nf_hook_slow+0x7d/0x160
> >> >  [125311.722034]  [<ffffffff815ffa00>] ? ip_rcv+0x350/0x350
> >> >  [125311.722036]  [<ffffffff815fface>] ip_local_deliver_finish+0xce/0x250
> >> >  [125311.722037]  [<ffffffff815ffc9c>] ip_local_deliver+0x4c/0x80
> >> >  [125311.722039]  [<ffffffff815ff329>] ip_rcv_finish+0x119/0x360
> >> >  [125311.722040]  [<ffffffff815ff8e0>] ip_rcv+0x230/0x350
> >> >  [125311.722046]  [<ffffffff815b4067>] __netif_receive_skb_core+0x477/0x600
> >> >  [125311.722049]  [<ffffffff815b4217>] __netif_receive_skb+0x27/0x70
> >> >  [125311.722051]  [<ffffffff815b4354>] process_backlog+0xf4/0x1e0
> >> >  [125311.722053]  [<ffffffff815b4b45>] net_rx_action+0xf5/0x250
> >> >  [125311.722056]  [<ffffffff81053a5f>] __do_softirq+0xef/0x270
> >> >  [125311.722058]  [<ffffffff81053cb5>] irq_exit+0x95/0xa0
> >> >  [125311.722062]  [<ffffffff816c8f26>] do_IRQ+0x66/0xe0
> >> >  [125311.722065]  [<ffffffff816bf62a>] common_interrupt+0x6a/0x6a
> >> >  [125311.722065]  <EOI>  [<ffffffff8100abf1>] ? default_idle+0x21/0xc0
> >> >  [125311.722082]  [<ffffffff8100a54f>] arch_cpu_idle+0xf/0x20
> >> >  [125311.722086]  [<ffffffff8108f353>] cpu_startup_entry+0xb3/0x230
> >> >  [125311.722091]  [<ffffffff816b439e>] start_secondary+0x1dc/0x1e3
> >> >  [125311.722093] ---[ end trace e77cd5ba583fcbe9 ]---
> >> >  [125311.722096] 355.355.1.355:22496 F0x4120 S1 s7 IF25+17-1-24f0 ur57 rr3 rt0 um0 hs23120 nxt23120
> >> >
> >> > It's been happening with all 3.10 kernels, and the one above is .13 as
> >> > stated in the trace.
> >>
> >> Thanks! could you post the output of `sysctl -a |grep tcp`?
> >>
> >> I suspect tcp_process_tlp_ack() should not revert state to Open
> >> directly, but calling tcp_try_keep_open() instead, similar to all the
> >> undo processing in the tcp_fastretrans_alert(): after
> >> tcp_end_cwnd_reduction(), the process (E) falls back to check other
> >> stats before moving to CA_Open.
> >>
> >>
> >> index 9c62257..9012b42 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -3314,7 +3314,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack,
> >>                         tcp_init_cwnd_reduction(sk, true);
> >>                         tcp_set_ca_state(sk, TCP_CA_CWR);
> >>                         tcp_end_cwnd_reduction(sk);
> >> -                       tcp_set_ca_state(sk, TCP_CA_Open);
> >> +                       tcp_try_keep_open(sk);
> >>                         NET_INC_STATS_BH(sock_net(sk),
> >>                                          LINUX_MIB_TCPLOSSPROBERECOVERY);
> >>                 }
> >>
> >
> > Should I apply this and see if the warning stops?
> I'd like to hear what the authors of TLP think. In the mean time could
> you help us collect more evidence by disabling TLP with
> sysctl net.ipv4.tcp_early_retrans=2
> and see if the problem still occurs? (it should not).
>
> thanks

Box hasn't had a warning in the last 24ish hours. A neighboring machine
with the default tcp_early_retrans setting has had 5-6 in the same
timeframe.

Is this a harmful situation to the socket in any way, or is it just
informational weirdness?

^ permalink raw reply

* Re: IPv6 kernel warning
From: Yuchung Cheng @ 2013-10-08 17:56 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: dormando, Michele Baldessari, Russell King - ARM Linux, netdev,
	Nandita Dukkipati
In-Reply-To: <CADVnQynJVvFesFYp68N5kzCJJD3-NXPZUUMfkg8m=or4eaLzMQ@mail.gmail.com>

On Tue, Oct 8, 2013 at 7:05 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Oct 7, 2013 at 3:51 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> I suspect tcp_process_tlp_ack() should not revert state to Open
>> directly, but calling tcp_try_keep_open() instead, similar to all the
>> undo processing in the tcp_fastretrans_alert(): after
>> tcp_end_cwnd_reduction(), the process (E) falls back to check other
>> stats before moving to CA_Open.
>>
>>
>> index 9c62257..9012b42 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3314,7 +3314,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack,
>>                         tcp_init_cwnd_reduction(sk, true);
>>                         tcp_set_ca_state(sk, TCP_CA_CWR);
>>                         tcp_end_cwnd_reduction(sk);
>> -                       tcp_set_ca_state(sk, TCP_CA_Open);
>> +                       tcp_try_keep_open(sk);
>>                         NET_INC_STATS_BH(sock_net(sk),
>>                                          LINUX_MIB_TCPLOSSPROBERECOVERY);
>>                 }
>
> Yes, nice catch! This looks good to me. My testing confirms that this
> definitely fixes a bug when this code fires and there are segments
> SACKed out. Since it will stay in CA_Disorder if there are outstanding
> retransmissions, I bet it will also fix the WARN_ON(tp->retrans_out !=
> 0) in state TCP_CA_Open that people are seeing.
Sounds good. Let me do more tests then I will submit a bug fix.

>
> neal

^ permalink raw reply

* [PATCH 1/3] net: bpf jit: ppc: optimize choose_load_func error path
From: Vladimir Murzin @ 2013-10-08 16:31 UTC (permalink / raw)
  To: netdev
  Cc: av1474, Vladimir Murzin, Jan Seiffert, Benjamin Herrenschmidt,
	Paul Mackerras, Daniel Borkmann, Matt Evans

Macro CHOOSE_LOAD_FUNC returns handler for "any offset" if checks for K
were not passed. At the same time handlers for "any offset" cases make
the same checks against r_addr at run-time, that will always lead to
bpf_error.

Run-time checks are still necessary for indirect load operations, but
error path for absolute and mesh loads are worth to optimize during bpf
compile time.

Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>

Cc: Jan Seiffert <kaffeemonster@googlemail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Matt Evans <matt@ozlabs.org>

---
 arch/powerpc/net/bpf_jit_comp.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index bf56e33..754320a 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -132,7 +132,7 @@ static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
 }
 
 #define CHOOSE_LOAD_FUNC(K, func) \
-	((int)K < 0 ? ((int)K >= SKF_LL_OFF ? func##_negative_offset : func) : func##_positive_offset)
+	((int)K < 0 ? ((int)K >= SKF_LL_OFF ? func##_negative_offset : NULL) : func##_positive_offset)
 
 /* Assemble the body code between the prologue & epilogue. */
 static int bpf_jit_build_body(struct sk_filter *fp, u32 *image,
@@ -427,6 +427,11 @@ static int bpf_jit_build_body(struct sk_filter *fp, u32 *image,
 		case BPF_S_LD_B_ABS:
 			func = CHOOSE_LOAD_FUNC(K, sk_load_byte);
 		common_load:
+			if (!func) {
+				PPC_LI(r_ret, 0);
+				PPC_JMP(exit_addr);
+				break;
+			}
 			/* Load from [K]. */
 			ctx->seen |= SEEN_DATAREF;
 			PPC_LI64(r_scratch1, func);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net 6/6] sfc: Only bind to EF10 functions with the LinkCtrl and Trusted flags
From: Ben Hutchings @ 2013-10-08 17:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

Although we do not yet enable multiple PFs per port, it is possible
that a board will be reconfigured to enable them while the driver has
not yet been updated to fully support this.

The most obvious problem is that multiple functions may try to set
conflicting link settings.  But we will also run into trouble if the
firmware doesn't consider us fully trusted.  So, abort probing unless
both the LinkCtrl and Trusted flags are set for this function.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/mcdi.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/mcdi.c b/drivers/net/ethernet/sfc/mcdi.c
index c082562..366c8e3 100644
--- a/drivers/net/ethernet/sfc/mcdi.c
+++ b/drivers/net/ethernet/sfc/mcdi.c
@@ -963,7 +963,7 @@ static int efx_mcdi_drv_attach(struct efx_nic *efx, bool driver_operating,
 			       bool *was_attached)
 {
 	MCDI_DECLARE_BUF(inbuf, MC_CMD_DRV_ATTACH_IN_LEN);
-	MCDI_DECLARE_BUF(outbuf, MC_CMD_DRV_ATTACH_OUT_LEN);
+	MCDI_DECLARE_BUF(outbuf, MC_CMD_DRV_ATTACH_EXT_OUT_LEN);
 	size_t outlen;
 	int rc;
 
@@ -981,6 +981,22 @@ static int efx_mcdi_drv_attach(struct efx_nic *efx, bool driver_operating,
 		goto fail;
 	}
 
+	/* We currently assume we have control of the external link
+	 * and are completely trusted by firmware.  Abort probing
+	 * if that's not true for this function.
+	 */
+	if (driver_operating &&
+	    outlen >= MC_CMD_DRV_ATTACH_EXT_OUT_LEN &&
+	    (MCDI_DWORD(outbuf, DRV_ATTACH_EXT_OUT_FUNC_FLAGS) &
+	     (1 << MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_LINKCTRL |
+	      1 << MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_TRUSTED)) !=
+	    (1 << MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_LINKCTRL |
+	     1 << MC_CMD_DRV_ATTACH_EXT_OUT_FLAG_TRUSTED)) {
+		netif_err(efx, probe, efx->net_dev,
+			  "This driver version only supports one function per port\n");
+		return -ENODEV;
+	}
+
 	if (was_attached != NULL)
 		*was_attached = MCDI_DWORD(outbuf, DRV_ATTACH_OUT_OLD_STATE);
 	return 0;

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net 5/6] sfc: Add PM and RXDP drop counters to ethtool stats
From: Ben Hutchings @ 2013-10-08 17:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

From: Edward Cree <ecree@solarflare.com>

Recognise the new Packet Memory and RX Data Path counters.

The following counters are added:
rx_pm_{trunc,discard}_bb_overflow - burst buffer overflowed.  This should not
 occur if BB correctly configured.
rx_pm_{trunc,discard}_vfifo_full - not enough space in packet memory.  May
 indicate RX performance problems.
rx_pm_{trunc,discard}_qbb - dropped by 802.1Qbb early discard mechanism.
 Since Qbb is not supported at present, this should not occur.
rx_pm_discard_mapping - 802.1p priority configured to be dropped.  This should
 not occur in normal operation.
rx_dp_q_disabled_packets - packet was to be delivered to a queue but queue is
 disabled.  May indicate misconfiguration by the driver.
rx_dp_di_dropped_packets - parser-dispatcher indicated that a packet should be
 dropped.
rx_dp_streaming_packets - packet was sent to the RXDP streaming bus, ie. a
 filter directed the packet to the MCPU.
rx_dp_emerg_{fetch,wait} - RX datapath had to wait for descriptors to be
 loaded.  Indicates performance problems but not drops.

These are only provided if the MC firmware has the
PM_AND_RXDP_COUNTERS capability.  Otherwise, mask them out.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c | 35 +++++++++++++++++++++++++++++++++++
 drivers/net/ethernet/sfc/nic.h  | 12 ++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index c00a5d6..21f9ad6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -444,6 +444,18 @@ static const struct efx_hw_stat_desc efx_ef10_stat_desc[EF10_STAT_COUNT] = {
 	EF10_DMA_STAT(rx_align_error, RX_ALIGN_ERROR_PKTS),
 	EF10_DMA_STAT(rx_length_error, RX_LENGTH_ERROR_PKTS),
 	EF10_DMA_STAT(rx_nodesc_drops, RX_NODESC_DROPS),
+	EF10_DMA_STAT(rx_pm_trunc_bb_overflow, PM_TRUNC_BB_OVERFLOW),
+	EF10_DMA_STAT(rx_pm_discard_bb_overflow, PM_DISCARD_BB_OVERFLOW),
+	EF10_DMA_STAT(rx_pm_trunc_vfifo_full, PM_TRUNC_VFIFO_FULL),
+	EF10_DMA_STAT(rx_pm_discard_vfifo_full, PM_DISCARD_VFIFO_FULL),
+	EF10_DMA_STAT(rx_pm_trunc_qbb, PM_TRUNC_QBB),
+	EF10_DMA_STAT(rx_pm_discard_qbb, PM_DISCARD_QBB),
+	EF10_DMA_STAT(rx_pm_discard_mapping, PM_DISCARD_MAPPING),
+	EF10_DMA_STAT(rx_dp_q_disabled_packets, RXDP_Q_DISABLED_PKTS),
+	EF10_DMA_STAT(rx_dp_di_dropped_packets, RXDP_DI_DROPPED_PKTS),
+	EF10_DMA_STAT(rx_dp_streaming_packets, RXDP_STREAMING_PKTS),
+	EF10_DMA_STAT(rx_dp_emerg_fetch, RXDP_EMERGENCY_FETCH_CONDITIONS),
+	EF10_DMA_STAT(rx_dp_emerg_wait, RXDP_EMERGENCY_WAIT_CONDITIONS),
 };
 
 #define HUNT_COMMON_STAT_MASK ((1ULL << EF10_STAT_tx_bytes) |		\
@@ -498,15 +510,38 @@ static const struct efx_hw_stat_desc efx_ef10_stat_desc[EF10_STAT_COUNT] = {
 #define HUNT_40G_EXTRA_STAT_MASK ((1ULL << EF10_STAT_rx_align_error) |	\
 				  (1ULL << EF10_STAT_rx_length_error))
 
+/* These statistics are only provided if the firmware supports the
+ * capability PM_AND_RXDP_COUNTERS.
+ */
+#define HUNT_PM_AND_RXDP_STAT_MASK (					\
+	(1ULL << EF10_STAT_rx_pm_trunc_bb_overflow) |			\
+	(1ULL << EF10_STAT_rx_pm_discard_bb_overflow) |			\
+	(1ULL << EF10_STAT_rx_pm_trunc_vfifo_full) |			\
+	(1ULL << EF10_STAT_rx_pm_discard_vfifo_full) |			\
+	(1ULL << EF10_STAT_rx_pm_trunc_qbb) |				\
+	(1ULL << EF10_STAT_rx_pm_discard_qbb) |				\
+	(1ULL << EF10_STAT_rx_pm_discard_mapping) |			\
+	(1ULL << EF10_STAT_rx_dp_q_disabled_packets) |			\
+	(1ULL << EF10_STAT_rx_dp_di_dropped_packets) |			\
+	(1ULL << EF10_STAT_rx_dp_streaming_packets) |			\
+	(1ULL << EF10_STAT_rx_dp_emerg_fetch) |				\
+	(1ULL << EF10_STAT_rx_dp_emerg_wait))
+
 static u64 efx_ef10_raw_stat_mask(struct efx_nic *efx)
 {
 	u64 raw_mask = HUNT_COMMON_STAT_MASK;
 	u32 port_caps = efx_mcdi_phy_get_caps(efx);
+	struct efx_ef10_nic_data *nic_data = efx->nic_data;
 
 	if (port_caps & (1 << MC_CMD_PHY_CAP_40000FDX_LBN))
 		raw_mask |= HUNT_40G_EXTRA_STAT_MASK;
 	else
 		raw_mask |= HUNT_10G_ONLY_STAT_MASK;
+
+	if (nic_data->datapath_caps &
+	    (1 << MC_CMD_GET_CAPABILITIES_OUT_PM_AND_RXDP_COUNTERS_LBN))
+		raw_mask |= HUNT_PM_AND_RXDP_STAT_MASK;
+
 	return raw_mask;
 }
 
diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
index fda29d3..890bbbe 100644
--- a/drivers/net/ethernet/sfc/nic.h
+++ b/drivers/net/ethernet/sfc/nic.h
@@ -386,6 +386,18 @@ enum {
 	EF10_STAT_rx_align_error,
 	EF10_STAT_rx_length_error,
 	EF10_STAT_rx_nodesc_drops,
+	EF10_STAT_rx_pm_trunc_bb_overflow,
+	EF10_STAT_rx_pm_discard_bb_overflow,
+	EF10_STAT_rx_pm_trunc_vfifo_full,
+	EF10_STAT_rx_pm_discard_vfifo_full,
+	EF10_STAT_rx_pm_trunc_qbb,
+	EF10_STAT_rx_pm_discard_qbb,
+	EF10_STAT_rx_pm_discard_mapping,
+	EF10_STAT_rx_dp_q_disabled_packets,
+	EF10_STAT_rx_dp_di_dropped_packets,
+	EF10_STAT_rx_dp_streaming_packets,
+	EF10_STAT_rx_dp_emerg_fetch,
+	EF10_STAT_rx_dp_emerg_wait,
 	EF10_STAT_COUNT
 };
 


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net 4/6] sfc: Add definitions for new stats counters and capability flag
From: Ben Hutchings @ 2013-10-08 17:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

From: Matthew Slattery <mslattery@solarflare.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/mcdi_pcol.h | 56 ++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/mcdi_pcol.h b/drivers/net/ethernet/sfc/mcdi_pcol.h
index b5cf624..e0a63dd 100644
--- a/drivers/net/ethernet/sfc/mcdi_pcol.h
+++ b/drivers/net/ethernet/sfc/mcdi_pcol.h
@@ -2574,8 +2574,58 @@
 #define          MC_CMD_MAC_RX_LANES01_DISP_ERR  0x39 /* enum */
 #define          MC_CMD_MAC_RX_LANES23_DISP_ERR  0x3a /* enum */
 #define          MC_CMD_MAC_RX_MATCH_FAULT  0x3b /* enum */
-#define          MC_CMD_GMAC_DMABUF_START  0x40 /* enum */
-#define          MC_CMD_GMAC_DMABUF_END    0x5f /* enum */
+/* enum: PM trunc_bb_overflow counter. Valid for EF10 with PM_AND_RXDP_COUNTERS
+ * capability only.
+ */
+#define          MC_CMD_MAC_PM_TRUNC_BB_OVERFLOW  0x3c
+/* enum: PM discard_bb_overflow counter. Valid for EF10 with
+ * PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_PM_DISCARD_BB_OVERFLOW  0x3d
+/* enum: PM trunc_vfifo_full counter. Valid for EF10 with PM_AND_RXDP_COUNTERS
+ * capability only.
+ */
+#define          MC_CMD_MAC_PM_TRUNC_VFIFO_FULL  0x3e
+/* enum: PM discard_vfifo_full counter. Valid for EF10 with
+ * PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_PM_DISCARD_VFIFO_FULL  0x3f
+/* enum: PM trunc_qbb counter. Valid for EF10 with PM_AND_RXDP_COUNTERS
+ * capability only.
+ */
+#define          MC_CMD_MAC_PM_TRUNC_QBB  0x40
+/* enum: PM discard_qbb counter. Valid for EF10 with PM_AND_RXDP_COUNTERS
+ * capability only.
+ */
+#define          MC_CMD_MAC_PM_DISCARD_QBB  0x41
+/* enum: PM discard_mapping counter. Valid for EF10 with PM_AND_RXDP_COUNTERS
+ * capability only.
+ */
+#define          MC_CMD_MAC_PM_DISCARD_MAPPING  0x42
+/* enum: RXDP counter: Number of packets dropped due to the queue being
+ * disabled. Valid for EF10 with PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_RXDP_Q_DISABLED_PKTS  0x43
+/* enum: RXDP counter: Number of packets dropped by the DICPU. Valid for EF10
+ * with PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_RXDP_DI_DROPPED_PKTS  0x45
+/* enum: RXDP counter: Number of non-host packets. Valid for EF10 with
+ * PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_RXDP_STREAMING_PKTS  0x46
+/* enum: RXDP counter: Number of times an emergency descriptor fetch was
+ * performed. Valid for EF10 with PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_RXDP_EMERGENCY_FETCH_CONDITIONS  0x47
+/* enum: RXDP counter: Number of times the DPCPU waited for an existing
+ * descriptor fetch. Valid for EF10 with PM_AND_RXDP_COUNTERS capability only.
+ */
+#define          MC_CMD_MAC_RXDP_EMERGENCY_WAIT_CONDITIONS  0x48
+/* enum: Start of GMAC stats buffer space, for Siena only. */
+#define          MC_CMD_GMAC_DMABUF_START  0x40
+/* enum: End of GMAC stats buffer space, for Siena only. */
+#define          MC_CMD_GMAC_DMABUF_END    0x5f
 #define          MC_CMD_MAC_GENERATION_END 0x60 /* enum */
 #define          MC_CMD_MAC_NSTATS  0x61 /* enum */
 
@@ -5065,6 +5115,8 @@
 #define        MC_CMD_GET_CAPABILITIES_OUT_RX_BATCHING_WIDTH 1
 #define        MC_CMD_GET_CAPABILITIES_OUT_MCAST_FILTER_CHAINING_LBN 26
 #define        MC_CMD_GET_CAPABILITIES_OUT_MCAST_FILTER_CHAINING_WIDTH 1
+#define        MC_CMD_GET_CAPABILITIES_OUT_PM_AND_RXDP_COUNTERS_LBN 27
+#define        MC_CMD_GET_CAPABILITIES_OUT_PM_AND_RXDP_COUNTERS_WIDTH 1
 /* RxDPCPU firmware id. */
 #define       MC_CMD_GET_CAPABILITIES_OUT_RX_DPCPU_FW_ID_OFST 4
 #define       MC_CMD_GET_CAPABILITIES_OUT_RX_DPCPU_FW_ID_LEN 2


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net 3/6] sfc: Refactor EF10 stat mask code to allow for more conditional stats
From: Ben Hutchings @ 2013-10-08 17:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

From: Edward Cree <ecree@solarflare.com>

Previously, efx_ef10_stat_mask returned a static const unsigned long[], which
meant that each possible mask had to be declared statically with
STAT_MASK_BITMAP.  Since adding a condition would double the size of the
decision tree, we now create the bitmask dynamically.

To do this, we have two functions efx_ef10_raw_stat_mask, which returns a u64,
and efx_ef10_get_stat_mask, which fills in an unsigned long * argument.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c | 49 +++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index a4fbb38..c00a5d6 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -498,44 +498,49 @@ static const struct efx_hw_stat_desc efx_ef10_stat_desc[EF10_STAT_COUNT] = {
 #define HUNT_40G_EXTRA_STAT_MASK ((1ULL << EF10_STAT_rx_align_error) |	\
 				  (1ULL << EF10_STAT_rx_length_error))
 
-#if BITS_PER_LONG == 64
-#define STAT_MASK_BITMAP(bits) (bits)
-#else
-#define STAT_MASK_BITMAP(bits) (bits) & 0xffffffff, (bits) >> 32
-#endif
-
-static const unsigned long *efx_ef10_stat_mask(struct efx_nic *efx)
+static u64 efx_ef10_raw_stat_mask(struct efx_nic *efx)
 {
-	static const unsigned long hunt_40g_stat_mask[] = {
-		STAT_MASK_BITMAP(HUNT_COMMON_STAT_MASK |
-				 HUNT_40G_EXTRA_STAT_MASK)
-	};
-	static const unsigned long hunt_10g_only_stat_mask[] = {
-		STAT_MASK_BITMAP(HUNT_COMMON_STAT_MASK |
-				 HUNT_10G_ONLY_STAT_MASK)
-	};
+	u64 raw_mask = HUNT_COMMON_STAT_MASK;
 	u32 port_caps = efx_mcdi_phy_get_caps(efx);
 
 	if (port_caps & (1 << MC_CMD_PHY_CAP_40000FDX_LBN))
-		return hunt_40g_stat_mask;
+		raw_mask |= HUNT_40G_EXTRA_STAT_MASK;
 	else
-		return hunt_10g_only_stat_mask;
+		raw_mask |= HUNT_10G_ONLY_STAT_MASK;
+	return raw_mask;
+}
+
+static void efx_ef10_get_stat_mask(struct efx_nic *efx, unsigned long *mask)
+{
+	u64 raw_mask = efx_ef10_raw_stat_mask(efx);
+
+#if BITS_PER_LONG == 64
+	mask[0] = raw_mask;
+#else
+	mask[0] = raw_mask & 0xffffffff;
+	mask[1] = raw_mask >> 32;
+#endif
 }
 
 static size_t efx_ef10_describe_stats(struct efx_nic *efx, u8 *names)
 {
+	DECLARE_BITMAP(mask, EF10_STAT_COUNT);
+
+	efx_ef10_get_stat_mask(efx, mask);
 	return efx_nic_describe_stats(efx_ef10_stat_desc, EF10_STAT_COUNT,
-				      efx_ef10_stat_mask(efx), names);
+				      mask, names);
 }
 
 static int efx_ef10_try_update_nic_stats(struct efx_nic *efx)
 {
 	struct efx_ef10_nic_data *nic_data = efx->nic_data;
-	const unsigned long *stats_mask = efx_ef10_stat_mask(efx);
+	DECLARE_BITMAP(mask, EF10_STAT_COUNT);
 	__le64 generation_start, generation_end;
 	u64 *stats = nic_data->stats;
 	__le64 *dma_stats;
 
+	efx_ef10_get_stat_mask(efx, mask);
+
 	dma_stats = efx->stats_buffer.addr;
 	nic_data = efx->nic_data;
 
@@ -543,7 +548,7 @@ static int efx_ef10_try_update_nic_stats(struct efx_nic *efx)
 	if (generation_end == EFX_MC_STATS_GENERATION_INVALID)
 		return 0;
 	rmb();
-	efx_nic_update_stats(efx_ef10_stat_desc, EF10_STAT_COUNT, stats_mask,
+	efx_nic_update_stats(efx_ef10_stat_desc, EF10_STAT_COUNT, mask,
 			     stats, efx->stats_buffer.addr, false);
 	rmb();
 	generation_start = dma_stats[MC_CMD_MAC_GENERATION_START];
@@ -564,12 +569,14 @@ static int efx_ef10_try_update_nic_stats(struct efx_nic *efx)
 static size_t efx_ef10_update_stats(struct efx_nic *efx, u64 *full_stats,
 				    struct rtnl_link_stats64 *core_stats)
 {
-	const unsigned long *mask = efx_ef10_stat_mask(efx);
+	DECLARE_BITMAP(mask, EF10_STAT_COUNT);
 	struct efx_ef10_nic_data *nic_data = efx->nic_data;
 	u64 *stats = nic_data->stats;
 	size_t stats_count = 0, index;
 	int retry;
 
+	efx_ef10_get_stat_mask(efx, mask);
+
 	/* If we're unlucky enough to read statistics during the DMA, wait
 	 * up to 10ms for it to finish (typically takes <500us)
 	 */


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net 2/6] sfc: Fix internal indices of ethtool stats for EF10
From: Ben Hutchings @ 2013-10-08 17:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

From: Edward Cree <ecree@solarflare.com>

The indices in nic_data->stats need to match the EF10_STAT_whatever
enum values.  In efx_nic_update_stats, only mask; gaps are removed in
efx_ef10_update_stats.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/nic.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/sfc/nic.c b/drivers/net/ethernet/sfc/nic.c
index e7dbd2d..9826594 100644
--- a/drivers/net/ethernet/sfc/nic.c
+++ b/drivers/net/ethernet/sfc/nic.c
@@ -469,8 +469,7 @@ size_t efx_nic_describe_stats(const struct efx_hw_stat_desc *desc, size_t count,
  * @count: Length of the @desc array
  * @mask: Bitmask of which elements of @desc are enabled
  * @stats: Buffer to update with the converted statistics.  The length
- *	of this array must be at least the number of set bits in the
- *	first @count bits of @mask.
+ *	of this array must be at least @count.
  * @dma_buf: DMA buffer containing hardware statistics
  * @accumulate: If set, the converted values will be added rather than
  *	directly stored to the corresponding elements of @stats
@@ -503,11 +502,9 @@ void efx_nic_update_stats(const struct efx_hw_stat_desc *desc, size_t count,
 			}
 
 			if (accumulate)
-				*stats += val;
+				stats[index] += val;
 			else
-				*stats = val;
+				stats[index] = val;
 		}
-
-		++stats;
 	}
 }


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net 1/6] sfc: Add rmb() between reading stats and generation count to ensure consistency
From: Ben Hutchings @ 2013-10-08 17:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

From: Jon Cooper <jcooper@solarflare.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 9f18ae9..a4fbb38 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -545,6 +545,7 @@ static int efx_ef10_try_update_nic_stats(struct efx_nic *efx)
 	rmb();
 	efx_nic_update_stats(efx_ef10_stat_desc, EF10_STAT_COUNT, stats_mask,
 			     stats, efx->stats_buffer.addr, false);
+	rmb();
 	generation_start = dma_stats[MC_CMD_MAC_GENERATION_START];
 	if (generation_end != generation_start)
 		return -EAGAIN;


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 1/6] sfc: Add rmb() between reading stats and generation count to ensure consistency
From: Ben Hutchings @ 2013-10-08 17:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1381254091.1530.9.camel@bwh-desktop.uk.level5networks.com>

From: Jon Cooper <jcooper@solarflare.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 9f18ae9..a4fbb38 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -545,6 +545,7 @@ static int efx_ef10_try_update_nic_stats(struct efx_nic *efx)
 	rmb();
 	efx_nic_update_stats(efx_ef10_stat_desc, EF10_STAT_COUNT, stats_mask,
 			     stats, efx->stats_buffer.addr, false);
+	rmb();
 	generation_start = dma_stats[MC_CMD_MAC_GENERATION_START];
 	if (generation_end != generation_start)
 		return -EAGAIN;


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* Pull request: sfc 2013-10-08
From: Ben Hutchings @ 2013-10-08 17:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

[-- Attachment #1: Type: text/plain, Size: 1982 bytes --]

The following changes since commit 5e8a402f831dbe7ee831340a91439e46f0d38acd:

  tcp: do not forget FIN in tcp_shifted_skb() (2013-10-04 14:16:36 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc.git sfc-3.12

for you to fetch changes up to ecb1c9cc215cb5a4390b714d8b09de637f54fa3f:

  sfc: Only bind to EF10 functions with the LinkCtrl and Trusted flags (2013-10-07 20:11:19 +0100)

Some more fixes for EF10 support; hopefully the last lot:

1. Fixes for reading statistics, from Edward Cree and Jon Cooper.
2. Addition of ethtool statistics for packets dropped by the hardware
before they were associated with a specific function, from Edward Cree.
3. Only bind to functions that are in control of their associated port,
as the driver currently assumes this is the case.

Ben.

----------------------------------------------------------------
Ben Hutchings (1):
      sfc: Only bind to EF10 functions with the LinkCtrl and Trusted flags

Edward Cree (3):
      sfc: Fix internal indices of ethtool stats for EF10
      sfc: Refactor EF10 stat mask code to allow for more conditional stats
      sfc: Add PM and RXDP drop counters to ethtool stats

Jon Cooper (1):
      sfc: Add rmb() between reading stats and generation count to ensure consistency

Matthew Slattery (1):
      sfc: Add definitions for new stats counters and capability flag

 drivers/net/ethernet/sfc/ef10.c      | 87 +++++++++++++++++++++++++++---------
 drivers/net/ethernet/sfc/mcdi.c      | 18 +++++++-
 drivers/net/ethernet/sfc/mcdi_pcol.h | 56 ++++++++++++++++++++++-
 drivers/net/ethernet/sfc/nic.c       |  9 ++--
 drivers/net/ethernet/sfc/nic.h       | 12 +++++
 5 files changed, 151 insertions(+), 31 deletions(-)

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* RE: [PATCH net] vti: get rid of nf mark rule in prerouting
From: Saurabh Mohan @ 2013-10-08 16:48 UTC (permalink / raw)
  To: Christophe Gouault, netdev@vger.kernel.org; +Cc: David S. Miller, Cong Wang
In-Reply-To: <1381245682-15523-1-git-send-email-christophe.gouault@6wind.com>

> -----Original Message-----
> From: Christophe Gouault [mailto:christophe.gouault@6wind.com]
> Sent: Tuesday, October 08, 2013 8:21 AM
> To: netdev@vger.kernel.org
> Cc: David S. Miller; Cong Wang; Saurabh Mohan; Christophe Gouault
> Subject: [PATCH net] vti: get rid of nf mark rule in prerouting
> 
> This patch fixes and improves the use of vti interfaces (while lightly changing
> the way of configuring them).
> 

Good fix. Thanks for doing this!

^ permalink raw reply

* Re: bug in passing file descriptors
From: Steve Rago @ 2013-10-08 16:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, David Miller, Network Development,
	Michael Kerrisk-manpages, Eric W. Biederman
In-Reply-To: <20131008164108.GO6882@two.firstfloor.org>

On 10/08/2013 12:41 PM, Andi Kleen wrote:
>> I just want the semantics to be consistent.  If you want Linux to
>> always require applications that call recvmsg to provide a buffer
>> size of CMSG_SPACE bytes long to retrieve control information, then
>> fail the system call when the buffer is smaller.  But if you do
>> this, you risk breaking applications that work with FreeBSD, Mac OS
>> X, Solaris, and probably a few others.
>
> The primary concern is to be binary compatible with Linux.
>
> But not being compatible between 32bit and 64bit Linux processes on the same
> host would seem like a serious problem to me.
>
>> Regardless, copying 20 bytes and telling me you copied 24 is misleading and wrong.
>
> The question is could it break existing Linux applications to change it?
> And would it help with the 32/64bit compatibility?
>
> If not some other way to fix the compat layer would need to be found.
>
> -Andi
>

I'm not sure if a 64-bit process and a 32-bit process exchange file descriptors on the same system has a problem.  It 
certainly looks like the compat code does the right thing.  I can test this tonight if you want.  The discrepancy arises 
because file descriptors are 4-byte quantities and treated differently (for example, when more than one is specified, 
each one isn't padded to an 8-byte boundary).

The way I discovered the problem is that I had an example program in APUE that was validating that msg_controllen had 
the correct value after calling recvmsg().  It worked on my 32-bit platform, but when I recompiled it and ran it on my 
64-bit platform, the test failed, because msg_controllen was larger than the size that was sent via sendmsg().

Steve

^ permalink raw reply

* Re: [PATCH net-next] net: gro: allow to build full sized skb
From: Eric Dumazet @ 2013-10-08 16:50 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, netdev
In-Reply-To: <52542F77.2050308@candelatech.com>

On Tue, 2013-10-08 at 09:14 -0700, Ben Greear wrote:

> On multi-homed boxes you could easily have paths that would never need linearize
> and other paths that always need it, for whatever reason.
> 
> Maybe a per-netdev check for needs linearize instead of something global
> would be better...and maybe let users over-ride the default behaviour
> regardless?

Sure. Note that this has yet to be discussed.

I mentioned this in the changelog for completeness.

I know people just disabling GRO in forwarding setups, probably because
we lack control of latencies / bursts.

^ permalink raw reply

* Re: bug in passing file descriptors
From: Andi Kleen @ 2013-10-08 16:41 UTC (permalink / raw)
  To: Steve Rago
  Cc: Andy Lutomirski, Andi Kleen, David Miller, Network Development,
	Michael Kerrisk-manpages, Eric W. Biederman
In-Reply-To: <52543065.7000209@nec-labs.com>

> I just want the semantics to be consistent.  If you want Linux to
> always require applications that call recvmsg to provide a buffer
> size of CMSG_SPACE bytes long to retrieve control information, then
> fail the system call when the buffer is smaller.  But if you do
> this, you risk breaking applications that work with FreeBSD, Mac OS
> X, Solaris, and probably a few others.

The primary concern is to be binary compatible with Linux.

But not being compatible between 32bit and 64bit Linux processes on the same
host would seem like a serious problem to me.

> Regardless, copying 20 bytes and telling me you copied 24 is misleading and wrong.

The question is could it break existing Linux applications to change it?
And would it help with the 32/64bit compatibility?

If not some other way to fix the compat layer would need to be found.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH net-next v2 2/5] xen-netback: Add support for IPv6 checksum offload from guest
From: Wei Liu @ 2013-10-08 16:19 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Wei Liu, xen-devel@lists.xen.org, netdev@vger.kernel.org,
	David Vrabel, Ian Campbell
In-Reply-To: <9AAE0902D5BC7E449B7C8E4E778ABCD012F345@AMSPEX01CL01.citrite.net>

On Tue, Oct 08, 2013 at 02:50:27PM +0100, Paul Durrant wrote:
[...]
> > > -#define PKT_PROT_LEN    (ETH_HLEN + \
> > > -			 VLAN_HLEN + \
> > > -			 sizeof(struct iphdr) + MAX_IPOPTLEN + \
> > > -			 sizeof(struct tcphdr) + MAX_TCP_OPTION_SPACE)
> > > +#define PKT_PROT_LEN 128
> > >
> > 
> > Where does 128 come from?
> > 
> 
> It's just an arbitrary power of 2 that was chosen because it seems to
> cover most likely v6 headers and all v4 headers.
> 

Hmm... How about using the value of MAX_TCP_HEADER? I guess that can
cover all V4 / V6 headers.

MAX_TCP_HEADER varies, depending on configuration. To make sure we can
accommodate all guests packet we might need to use its maximum value
which can be as big as 128 + 128 + 48.

> > >  		if (recalculate_partial_csum) {
> > >  			struct tcphdr *tcph = tcp_hdr(skb);
> > > +
> > > +			header_size = skb->network_header +
> > > +				off +
> > > +				sizeof(struct tcphdr) +
> > > +				MAX_TCP_OPTION_SPACE;
> > > +			maybe_pull_tail(skb, header_size);
> > > +
> > 
> > I presume this function is checksum_setup stripped down to handle IPv4
> > packet. What's the purpose of changing its behaviour? Why the pull_tail
> > here?
> > 
> 
> We have to make sure that the TCP header is in the linear area as we
> are about to write to the checksum field. In practice, the 128 byte
> pull should guarantee this but in case that is varied later I wanted
> to make sure things did not start to fail in an add way.
> 

If you already set the pull size to maximum possible value then this
will not be necessary anymore, right?

> > > +	while ((off <= sizeof(struct ipv6hdr) + ntohs(ipv6h->payload_len))
> > &&
> > > +	       !done) {
> > > +		/* We only access at most the first 2 bytes of any option
> > header
> > > +		 * to determine the next one.
> > > +		 */
> > > +		header_size = skb->network_header + off + 2;
> > > +		maybe_pull_tail(skb, header_size);
> > > +
> > 
> > Will this cause performance problem? Is it possible that you pull too
> > many times?
> > 
> 
> I guess it means we may get two pulls for the TCP/UDP headers rather
> than one so could push the pulls into the individual cases if you
> think it will affect performance that badly.

Hmm... Not sure I get what you mean here. The main problem I'm seeing is
that maybe_pull_tail is called in every loop.

I would like to see as few pulls as possible because __pskb_pull_tail
can be expensive and only expected to use in "exceptional cases" (quoted
from the comment above that function).

Is it possible to pull TCP_MAX_HEADER bytes once to eliminate all other
pulls in checksum_setup{,_ipv4,_ipv6}?

> 
[...]
> > 
> > Can you make the logic explicit here?
> > 
> >    case IPPROTO_TCP:
> >    case IPPROTO_UDP:
> >         done = true;
> > 	break;
> >    default:
> >         break;
> > 
> > Another minor suggestion is that change "done" to "found" because you're
> > trying to find the two type of headers.
> > 
> 
> Yes, I could code it that way if you prefer.
> 

Explicity is better than implicity IMHO. After this change could you
also move the default branch (netdev_err) in the following "switch" to
the first "switch".

Wei.

^ permalink raw reply

* Re: bug in passing file descriptors
From: Steve Rago @ 2013-10-08 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, David Miller, Network Development,
	Michael Kerrisk-manpages, Eric W. Biederman
In-Reply-To: <CALCETrVoUrcuzEx_TURoGE8_MJtZLOz-UDz6m69Vo6MCWG0zjg@mail.gmail.com>

On 10/08/2013 12:02 PM, Andy Lutomirski wrote:
> On Tue, Oct 8, 2013 at 7:32 AM, Steve Rago <sar@nec-labs.com> wrote:
>> On 10/07/2013 06:55 PM, Andi Kleen wrote:
>>>
>>> David Miller <davem@davemloft.net> writes:
>>>
>>>> From: Steve Rago <sar@nec-labs.com>
>>>> Date: Mon, 7 Oct 2013 16:29:15 -0400
>>>>
>>>>> On 10/07/2013 03:42 PM, David Miller wrote:
>>>>>>
>>>>>> There is no compatability issue.
>>>>>>
>>>>>> 32-bit tasks will always see the 4-byte align/length.
>>>>>> 64-bit tasks will always see the 8-byte align/length.
>>>>>>
>>>>>
>>>>> Really?  So when I compile my application on a 32-bit Linux box and
>>>>> then try to run it on a 64-bit Linux box, you're not going to overrun
>>>>> my buffer when CMSG_SPACE led me to allocate an insufficient amount of
>>>>> memory needed to account for padding on the 64-bit platform?
>>>>
>>>>
>>>> We have a compatability layer that gives 32-bit applications the
>>>> same behavior as if they had run on a 32-bit machine.
>>>>
>>>> Search around for the MSG_MSG_COMPAT flag and how that is used in
>>>> net/socket.c
>>>
>>>
>>> But it seems the compat layer doesn't handle this correctly,
>>> otherwise Steve's original test case would work.
>>>
>>> Must be a bug somewhere in the compat layer.
>>>
>>> -Andi
>>>
>>
>> I did some research last night and I think the problem stems from an
>> underspecified standard.  CMSG_LEN and CMSG_SPACE seem to have originated
>> with RFC 2292, which has since been obsoleted by RFC 3542.  The difference
>> is that CMSG_SPACE accounts for padding at the end, which is needed when you
>> stuff multiple cmsghdr objects in the same buffer.  CMSG_LEN is required to
>> be used to initialize the cmsg_len member of the structure.  When you only
>> have one cmsghdr object in your call to recvmsg, it is unclear whether you
>> need to have a buffer as large as CMSG_SPACE or CMSG_LEN.  Historically,
>> BSD-based platforms never had these macros and didn't return the ancillary
>> data if the space provided by the application wasn't big enough.  Linux,
>> *which has a bug*, won't copy more bytes than cmsg_len specifies when the
>> application uses CMSG_LEN instead of CMSG_SPACE, but then lies to the
>> application by overwriting msg_controllen.  Look in put_cmsg() in
>> net/core/scm.c: "cmlen" is calculated using CM_LEN, and then msg_controllen
>> is checked against it to make sure you don't overwrite the user's buffer.
>> However, in recvmsg(), msg_controllen is overwritten by "msg_sys.msg_control
>> - cmsg_ptr".  msg_control was incremented by CMSG_SPACE at the end of
>> scm_detach_fds().
>>
>> The Linux kernel code seems to copy the right amount of data, even in the
>> compat case, as far as I can tell.  The only bug I see is that
>> msg_controllen returns from recvmsg() with the wrong value.
>
> I don't see how the other behavior would be any less surprising.
> Suppose you had a control message with CMSG_LEN=12 and CMSG_SPACE=16.
> Then, with the semantics you want, one of them takes 12 bytes, two of
> them take 28 bytes, three take 44 bytes, etc.
>
> That seems considerably more surprising than having then take 16, 32,
> and 48 bytes respectively.
>
> --Andy
>

I just want the semantics to be consistent.  If you want Linux to always require applications that call recvmsg to 
provide a buffer size of CMSG_SPACE bytes long to retrieve control information, then fail the system call when the 
buffer is smaller.  But if you do this, you risk breaking applications that work with FreeBSD, Mac OS X, Solaris, and 
probably a few others.

No wonder the Single UNIX Specification didn't standardize these two macros.

Regardless, copying 20 bytes and telling me you copied 24 is misleading and wrong.

Steve

^ permalink raw reply

* Re: [PATCH net-next] net: gro: allow to build full sized skb
From: Ben Greear @ 2013-10-08 16:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1381248143.12191.53.camel@edumazet-glaptop.roam.corp.google.com>

On 10/08/2013 09:02 AM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
> typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.
>
> It's relatively easy to extend the skb using frag_list to allow
> more frags to be appended into the last sk_buff.
>
> This still builds very efficient skbs, and allows reaching 45 MSS per
> skb.
>
> (45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
> sk_buff)
>
> High speed TCP flows benefit from this extension by lowering TCP stack
> cpu usage (less packets stored in receive queue, less ACK packets
> processed)
>
> Forwarding setups could be hurt, as such skbs will need to be
> linearized, although its not a new problem, as GRO could already
> provide skbs with a frag_list.
>
> We could make the 65536 bytes threshold a tunable to mitigate this.
>
> (First time we need to linearize skb in skb_needs_linearize(), we could
> lower the tunable to ~16*1460 so that following skb_gro_receive() calls
> build smaller skbs)

On multi-homed boxes you could easily have paths that would never need linearize
and other paths that always need it, for whatever reason.

Maybe a per-netdev check for needs linearize instead of something global
would be better...and maybe let users over-ride the default behaviour
regardless?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: bug in passing file descriptors
From: Andy Lutomirski @ 2013-10-08 16:02 UTC (permalink / raw)
  To: Steve Rago
  Cc: Andi Kleen, David Miller, Network Development,
	Michael Kerrisk-manpages, Eric W. Biederman
In-Reply-To: <52541769.1000306@nec-labs.com>

On Tue, Oct 8, 2013 at 7:32 AM, Steve Rago <sar@nec-labs.com> wrote:
> On 10/07/2013 06:55 PM, Andi Kleen wrote:
>>
>> David Miller <davem@davemloft.net> writes:
>>
>>> From: Steve Rago <sar@nec-labs.com>
>>> Date: Mon, 7 Oct 2013 16:29:15 -0400
>>>
>>>> On 10/07/2013 03:42 PM, David Miller wrote:
>>>>>
>>>>> There is no compatability issue.
>>>>>
>>>>> 32-bit tasks will always see the 4-byte align/length.
>>>>> 64-bit tasks will always see the 8-byte align/length.
>>>>>
>>>>
>>>> Really?  So when I compile my application on a 32-bit Linux box and
>>>> then try to run it on a 64-bit Linux box, you're not going to overrun
>>>> my buffer when CMSG_SPACE led me to allocate an insufficient amount of
>>>> memory needed to account for padding on the 64-bit platform?
>>>
>>>
>>> We have a compatability layer that gives 32-bit applications the
>>> same behavior as if they had run on a 32-bit machine.
>>>
>>> Search around for the MSG_MSG_COMPAT flag and how that is used in
>>> net/socket.c
>>
>>
>> But it seems the compat layer doesn't handle this correctly,
>> otherwise Steve's original test case would work.
>>
>> Must be a bug somewhere in the compat layer.
>>
>> -Andi
>>
>
> I did some research last night and I think the problem stems from an
> underspecified standard.  CMSG_LEN and CMSG_SPACE seem to have originated
> with RFC 2292, which has since been obsoleted by RFC 3542.  The difference
> is that CMSG_SPACE accounts for padding at the end, which is needed when you
> stuff multiple cmsghdr objects in the same buffer.  CMSG_LEN is required to
> be used to initialize the cmsg_len member of the structure.  When you only
> have one cmsghdr object in your call to recvmsg, it is unclear whether you
> need to have a buffer as large as CMSG_SPACE or CMSG_LEN.  Historically,
> BSD-based platforms never had these macros and didn't return the ancillary
> data if the space provided by the application wasn't big enough.  Linux,
> *which has a bug*, won't copy more bytes than cmsg_len specifies when the
> application uses CMSG_LEN instead of CMSG_SPACE, but then lies to the
> application by overwriting msg_controllen.  Look in put_cmsg() in
> net/core/scm.c: "cmlen" is calculated using CM_LEN, and then msg_controllen
> is checked against it to make sure you don't overwrite the user's buffer.
> However, in recvmsg(), msg_controllen is overwritten by "msg_sys.msg_control
> - cmsg_ptr".  msg_control was incremented by CMSG_SPACE at the end of
> scm_detach_fds().
>
> The Linux kernel code seems to copy the right amount of data, even in the
> compat case, as far as I can tell.  The only bug I see is that
> msg_controllen returns from recvmsg() with the wrong value.

I don't see how the other behavior would be any less surprising.
Suppose you had a control message with CMSG_LEN=12 and CMSG_SPACE=16.
Then, with the semantics you want, one of them takes 12 bytes, two of
them take 28 bytes, three take 44 bytes, etc.

That seems considerably more surprising than having then take 16, 32,
and 48 bytes respectively.

--Andy

^ permalink raw reply

* [PATCH net-next] net: gro: allow to build full sized skb
From: Eric Dumazet @ 2013-10-08 16:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.

It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.

This still builds very efficient skbs, and allows reaching 45 MSS per
skb.

(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)

High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)

Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.

We could make the 65536 bytes threshold a tunable to mitigate this.

(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c |   43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d81cff1..8ead744 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2936,32 +2936,30 @@ EXPORT_SYMBOL_GPL(skb_segment);
 
 int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 {
-	struct sk_buff *p = *head;
-	struct sk_buff *nskb;
-	struct skb_shared_info *skbinfo = skb_shinfo(skb);
-	struct skb_shared_info *pinfo = skb_shinfo(p);
-	unsigned int headroom;
-	unsigned int len = skb_gro_len(skb);
+	struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
 	unsigned int offset = skb_gro_offset(skb);
 	unsigned int headlen = skb_headlen(skb);
+	struct sk_buff *nskb, *lp, *p = *head;
+	unsigned int len = skb_gro_len(skb);
 	unsigned int delta_truesize;
+	unsigned int headroom;
 
-	if (p->len + len >= 65536)
+	if (unlikely(p->len + len >= 65536))
 		return -E2BIG;
 
-	if (pinfo->frag_list)
-		goto merge;
-	else if (headlen <= offset) {
+	lp = NAPI_GRO_CB(p)->last ?: p;
+	pinfo = skb_shinfo(lp);
+
+	if (headlen <= offset) {
 		skb_frag_t *frag;
 		skb_frag_t *frag2;
 		int i = skbinfo->nr_frags;
 		int nr_frags = pinfo->nr_frags + i;
 
-		offset -= headlen;
-
 		if (nr_frags > MAX_SKB_FRAGS)
-			return -E2BIG;
+			goto merge;
 
+		offset -= headlen;
 		pinfo->nr_frags = nr_frags;
 		skbinfo->nr_frags = 0;
 
@@ -2992,7 +2990,7 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		unsigned int first_offset;
 
 		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
-			return -E2BIG;
+			goto merge;
 
 		first_offset = skb->data -
 			       (unsigned char *)page_address(page) +
@@ -3010,7 +3008,10 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		delta_truesize = skb->truesize - SKB_DATA_ALIGN(sizeof(struct sk_buff));
 		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
 		goto done;
-	} else if (skb_gro_len(p) != pinfo->gso_size)
+	}
+	if (pinfo->frag_list)
+		goto merge;
+	if (skb_gro_len(p) != pinfo->gso_size)
 		return -E2BIG;
 
 	headroom = skb_headroom(p);
@@ -3062,16 +3063,24 @@ merge:
 
 	__skb_pull(skb, offset);
 
-	NAPI_GRO_CB(p)->last->next = skb;
+	if (!NAPI_GRO_CB(p)->last)
+		skb_shinfo(p)->frag_list = skb;
+	else
+		NAPI_GRO_CB(p)->last->next = skb;
 	NAPI_GRO_CB(p)->last = skb;
 	skb_header_release(skb);
+	lp = p;
 
 done:
 	NAPI_GRO_CB(p)->count++;
 	p->data_len += len;
 	p->truesize += delta_truesize;
 	p->len += len;
-
+	if (lp != p) {
+		lp->data_len += len;
+		lp->truesize += delta_truesize;
+		lp->len += len;
+	}
 	NAPI_GRO_CB(skb)->same_flow = 1;
 	return 0;
 }

^ permalink raw reply related

* [PATCH net] vti: get rid of nf mark rule in prerouting
From: Christophe Gouault @ 2013-10-08 15:21 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Cong Wang, Saurabh Mohan, Christophe Gouault

This patch fixes and improves the use of vti interfaces (while
lightly changing the way of configuring them).

Currently:

- it is necessary to identify and mark inbound IPsec
  packets destined to each vti interface, via netfilter rules in
  the mangle table at prerouting hook.

- the vti module cannot retrieve the right tunnel in input since
  commit b9959fd3: vti tunnels all have an i_key, but the tunnel lookup
  is done with flag TUNNEL_NO_KEY, so there no chance to retrieve them.

- the i_key is used by the outbound processing as a mark to lookup
  for the right SP and SA bundle.

This patch uses the o_key to store the vti mark (instead of i_key) and
enables:

- to avoid the need for previously marking the inbound skbuffs via a
  netfilter rule.
- to properly retrieve the right tunnel in input, only based on the IPsec
  packet outer addresses.
- to properly perform an inbound policy check (using the tunnel o_key
  as a mark).
- to properly perform an outbound SPD and SAD lookup (using the tunnel
  o_key as a mark).
- to keep the current mark of the skbuff. The skbuff mark is neither
  used nor changed by the vti interface. Only the vti interface o_key
  is used.

SAs have a wildcard mark.
SPs have a mark equal to the vti interface o_key.

The vti interface must be created as follows (i_key = 0, o_key = mark):

   ip link add vti1 mode vti local 1.1.1.1 remote 2.2.2.2 okey 1

The SPs attached to vti1 must be created as follows (mark = vti1 o_key):

   ip xfrm policy add dir out mark 1 tmpl src 1.1.1.1 dst 2.2.2.2 \
      proto esp mode tunnel
   ip xfrm policy add dir in  mark 1 tmpl src 2.2.2.2 dst 1.1.1.1 \
      proto esp mode tunnel

The SAs are created with the default wildcard mark. There is no
distinction between global vs. vti SAs. Just their addresses will
possibly link them to a vti interface:

   ip xfrm state add src 1.1.1.1 dst 2.2.2.2 proto esp spi 1000 mode tunnel \
                 enc "cbc(aes)" "azertyuiopqsdfgh"

   ip xfrm state add src 2.2.2.2 dst 1.1.1.1 proto esp spi 2000 mode tunnel \
                 enc "cbc(aes)" "sqbdhgqsdjqjsdfh"

To avoid matching "global" (not vti) SPs in vti interfaces, global SPs
should no use the default wildcard mark, but explicitly match mark 0.

To avoid a double SPD lookup in input and output (in global and vti SPDs),
the NOPOLICY and NOXFRM options should be set on the vti interfaces:

   echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_policy
   echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_xfrm

The outgoing traffic is steered to vti1 by a route via the vti interface:

   ip route add 192.168.0.0/16 dev vti1

The incoming IPsec traffic is steered to vti1 because its outer addresses
match the vti1 tunnel configuration.

Signed-off-by: Christophe Gouault <christophe.gouault@6wind.com>
---
This is is both a fix and enhancement patch. However, there are 2 ways
of fixing the inbound processing bug:
- either keep the current configuration model (ikey + netfilter rule)
  and change the tunnel lookup method. This patch would then be reverted
  by the enhancement (this sounds counterproductive).
- or directly change the configuration model (okey, no netfilter rule) and keep
  the current tunnel lookup method.

 net/ipv4/ip_vti.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index e805e7b..6e87f85 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -125,8 +125,17 @@ static int vti_rcv(struct sk_buff *skb)
 				  iph->saddr, iph->daddr, 0);
 	if (tunnel != NULL) {
 		struct pcpu_tstats *tstats;
+		u32 oldmark = skb->mark;
+		int ret;
 
-		if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
+
+		/* temporarily mark the skb with the tunnel o_key, to
+		 * only match policies with this mark.
+		 */
+		skb->mark = be32_to_cpu(tunnel->parms.o_key);
+		ret = xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb);
+		skb->mark = oldmark;
+		if (!ret)
 			return -1;
 
 		tstats = this_cpu_ptr(tunnel->dev->tstats);
@@ -135,7 +144,6 @@ static int vti_rcv(struct sk_buff *skb)
 		tstats->rx_bytes += skb->len;
 		u64_stats_update_end(&tstats->syncp);
 
-		skb->mark = 0;
 		secpath_reset(skb);
 		skb->dev = tunnel->dev;
 		return 1;
@@ -167,7 +175,7 @@ static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	memset(&fl4, 0, sizeof(fl4));
 	flowi4_init_output(&fl4, tunnel->parms.link,
-			   be32_to_cpu(tunnel->parms.i_key), RT_TOS(tos),
+			   be32_to_cpu(tunnel->parms.o_key), RT_TOS(tos),
 			   RT_SCOPE_UNIVERSE,
 			   IPPROTO_IPIP, 0,
 			   dst, tiph->saddr, 0, 0);
-- 
1.7.10.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox