* Re: [PATCH] bpf: cgroup: annotate pointers in struct cgroup_bpf with __rcu
From: David Miller @ 2016-12-17 16:14 UTC (permalink / raw)
To: daniel; +Cc: ast, daniel, netdev
In-Reply-To: <20161215095321.10571-1-daniel@zonque.org>
From: Daniel Mack <daniel@zonque.org>
Date: Thu, 15 Dec 2016 10:53:21 +0100
> The member 'effective' in 'struct cgroup_bpf' is protected by RCU.
> Annotate it accordingly to squelch a sparse warning.
>
> Signed-off-by: Daniel Mack <daniel@zonque.org>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 0/2] inet: Fixes for inet_csk_get_port and soreusport
From: David Miller @ 2016-12-17 16:13 UTC (permalink / raw)
To: tom; +Cc: netdev, kernel-team, jbacik, eric.dumazet, raigatgoog
In-Reply-To: <20161215005416.1561632-1-tom@herbertland.com>
From: Tom Herbert <tom@herbertland.com>
Date: Wed, 14 Dec 2016 16:54:14 -0800
> This patch set fixes a couple of issues I noticed while debugging our
> softlockup issue in inet_csk_get_port.
>
> - Don't allow jump into port scan in inet_csk_get_port if function
> was called with non-zero port number (looking up explicit port
> number).
> - When inet_csk_get_port is called with zero port number (ie. perform
> scan) an reuseport is set on the socket, don't match sockets that
> also have reuseport set. The intent from the user should be
> to get a new port number and then explictly bind other
> sockets to that number using soreuseport.
>
> Tested:
>
> Ran first patch on production workload with no ill effect.
>
> For second patch, ran a little listener application and first
> demonstrated that unbound sockets with soreuseport can indeed
> be bound to unrelated soreuseport sockets.
Series applied, thanks Tom.
^ permalink raw reply
* Re: [PATCH net] bpf, test_verifier: fix a test case error result on unprivileged
From: David Miller @ 2016-12-17 15:52 UTC (permalink / raw)
To: daniel; +Cc: ast, netdev
In-Reply-To: <639d61f73c907b704001ed2b115208998990eb38.1481762158.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 15 Dec 2016 01:39:10 +0100
> Running ./test_verifier as unprivileged lets 1 out of 98 tests fail:
>
> [...]
> #71 unpriv: check that printk is disallowed FAIL
> Unexpected error message!
> 0: (7a) *(u64 *)(r10 -8) = 0
> 1: (bf) r1 = r10
> 2: (07) r1 += -8
> 3: (b7) r2 = 8
> 4: (bf) r3 = r1
> 5: (85) call bpf_trace_printk#6
> unknown func bpf_trace_printk#6
> [...]
>
> The test case is correct, just that the error outcome changed with
> ebb676daa1a3 ("bpf: Print function name in addition to function id").
> Same as with e00c7b216f34 ("bpf: fix multiple issues in selftest suite
> and samples") issue 2), so just fix up the function name.
>
> Fixes: ebb676daa1a3 ("bpf: Print function name in addition to function id")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Applied.
^ permalink raw reply
* Re: [PATCH net] bpf: fix regression on verifier pruning wrt map lookups
From: David Miller @ 2016-12-17 15:51 UTC (permalink / raw)
To: daniel; +Cc: ast, jbacik, tgraf, netdev
In-Reply-To: <fb7031038adc3b04973c8484ba4e9d7aa250cb32.1481761253.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 15 Dec 2016 01:30:06 +0100
> Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
> registers") introduced a regression where existing programs stopped
> loading due to reaching the verifier's maximum complexity limit,
> whereas prior to this commit they were loading just fine; the affected
> program has roughly 2k instructions.
>
> What was found is that state pruning couldn't be performed effectively
> anymore due to mismatches of the verifier's register state, in particular
> in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per
> se, but rather that verifier needs to perform a lot more work for the
> same program with regards to involved map lookups.
...
> Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Thomas Graf <tgraf@suug.ch>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
Applied.
^ permalink raw reply
* Re: [PATCH net] net: vrf: Drop conntrack data after pass through VRF device on Tx
From: David Miller @ 2016-12-17 15:48 UTC (permalink / raw)
To: dsa; +Cc: netdev
In-Reply-To: <1481754671-22519-1-git-send-email-dsa@cumulusnetworks.com>
From: David Ahern <dsa@cumulusnetworks.com>
Date: Wed, 14 Dec 2016 14:31:11 -0800
> Locally originated traffic in a VRF fails in the presence of a POSTROUTING
> rule. For example,
>
> $ iptables -t nat -A POSTROUTING -s 11.1.1.0/24 -j MASQUERADE
> $ ping -I red -c1 11.1.1.3
> ping: Warning: source address might be selected on device other than red.
> PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
> ping: sendmsg: Operation not permitted
>
> Worse, the above causes random corruption resulting in a panic in random
> places (I have not seen a consistent backtrace).
>
> Call nf_reset to drop the conntrack info following the pass through the
> VRF device. The nf_reset is needed on Tx but not Rx because of the order
> in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
> device and on Tx it is is before the real egress device. Connection
> tracking should be tied to the real egress device and not the VRF device.
>
> Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device")
> Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] net: vrf: Fix NAT within a VRF
From: David Miller @ 2016-12-17 15:46 UTC (permalink / raw)
To: dsa; +Cc: netdev
In-Reply-To: <1481742378-28165-1-git-send-email-dsa@cumulusnetworks.com>
From: David Ahern <dsa@cumulusnetworks.com>
Date: Wed, 14 Dec 2016 11:06:18 -0800
> Connection tracking with VRF is broken because the pass through the VRF
> device drops the connection tracking info. Removing the call to nf_reset
> allows DNAT and MASQUERADE to work across interfaces within a VRF.
>
> Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Applied and queued up for -stable, thanks David.
^ permalink raw reply
* Re: [PATCH net 0/2] net/sched: cls_flower: Fix mask handling
From: David Miller @ 2016-12-17 15:45 UTC (permalink / raw)
To: paulb; +Cc: netdev, jiri, ogerlitz, roid, shahark, hadarh
In-Reply-To: <1481734858-37474-1-git-send-email-paulb@mellanox.com>
From: Paul Blakey <paulb@mellanox.com>
Date: Wed, 14 Dec 2016 19:00:56 +0200
> The series fix how the mask is being handled.
I guess this is reasonable, series applied, thanks.
^ permalink raw reply
* Re: Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF
From: Theodore Ts'o @ 2016-12-17 15:41 UTC (permalink / raw)
To: kernel-hardening
Cc: Jason, linux, ak, davem, David.Laight, djb, ebiggers3, hannes,
jeanphilippe.aumasson, linux-crypto, linux-kernel, luto, netdev,
tom, torvalds, vegard.nossum
In-Reply-To: <20161217021503.32767.qmail@ns.sciencehorizons.net>
On Fri, Dec 16, 2016 at 09:15:03PM -0500, George Spelvin wrote:
> >> - Ted, Andy Lutorminski and I will try to figure out a construction of
> >> get_random_long() that we all like.
We don't have to find the most optimal solution right away; we can
approach this incrementally, after all.
So long as we replace get_random_{long,int}() with something which is
(a) strictly better in terms of security given today's use of MD5, and
(b) which is strictly *faster* than the current construction on 32-bit
and 64-bit systems, we can do that, and can try to make it be faster
while maintaining some minimum level of security which is sufficient
for all current users of get_random_{long,int}() and which can be
clearly artificulated for future users of get_random_{long,int}().
The main worry at this point I have is benchmarking siphash on a
32-bit system. It may be that simply batching the chacha20 output so
that we're using the urandom construction more efficiently is the
better way to go, since that *does* meet the criteron of strictly more
secure and strictly faster than the current MD5 solution. I'm open to
using siphash, but I want to see the the 32-bit numbers first.
As far as half-siphash is concerned, it occurs to me that the main
problem will be those users who need to guarantee that output can't be
guessed over a long period of time. For example, if you have a
long-running process, then the output needs to remain unguessable over
potentially months or years, or else you might be weakening the ASLR
protections. If on the other hand, the hash table or the process will
be going away in a matter of seconds or minutes, the requirements with
respect to cryptographic strength go down significantly.
Now, maybe this doesn't matter that much if we can guarantee (or make
assumptions) that the attacker doesn't have unlimited access the
output stream of get_random_{long,int}(), or if it's being used in an
anti-DOS use case where it ultimately only needs to be harder than
alternate ways of attacking the system.
Rekeying every five minutes doesn't necessarily help the with respect
to ASLR, but it might reduce the amount of the output stream that
would be available to the attacker in order to be able to attack the
get_random_{long,int}() generator, and it also reduces the value of
doing that attack to only compromising the ASLR for those processes
started within that five minute window.
Cheers,
- Ted
P.S. I'm using ASLR as an example use case, above; of course we will
need to make similar eximainations of the other uses of
get_random_{long,int}().
P.P.S. We might also want to think about potentially defining
get_random_{long,int}() to be unambiguously strong, and then creating
a get_weak_random_{long,int}() which on platforms where performance
might be a consideration, it uses a weaker algorithm perhaps with some
kind of rekeying interval.
^ permalink raw reply
* Re: ipv6: handle -EFAULT from skb_copy_bits
From: David Miller @ 2016-12-17 15:41 UTC (permalink / raw)
To: davej; +Cc: netdev
In-Reply-To: <20161214154729.g4yg4mxcgkvpe5w7@codemonkey.org.uk>
From: Dave Jones <davej@codemonkey.org.uk>
Date: Wed, 14 Dec 2016 10:47:29 -0500
> It seems to be possible to craft a packet for sendmsg that triggers
> the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:
>
> RIP: 0010:[<ffffffff817c6390>] [<ffffffff817c6390>] rawv6_sendmsg+0xc30/0xc40
> RSP: 0018:ffff881f6c4a7c18 EFLAGS: 00010282
> RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
> RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
> RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
> R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
> R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80
>
> Call Trace:
> [<ffffffff8118ba23>] ? unmap_page_range+0x693/0x830
> [<ffffffff81772697>] inet_sendmsg+0x67/0xa0
> [<ffffffff816d93f8>] sock_sendmsg+0x38/0x50
> [<ffffffff816d982f>] SYSC_sendto+0xef/0x170
> [<ffffffff816da27e>] SyS_sendto+0xe/0x10
> [<ffffffff81002910>] do_syscall_64+0x50/0xa0
> [<ffffffff817f7cbc>] entry_SYSCALL64_slow_path+0x25/0x25
>
> Handle this in rawv6_push_pending_frames and jump to the failure path.
>
> Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Hmmm, that's interesting. Becaue the code in __ip6_append_data(), which
sets up the ->cork.base.length value, seems to be defensively trying to
avoid this possibility.
For example, it checks things like:
if (cork->length + length > mtu - headersize && ipc6->dontfrag &&
(sk->sk_protocol == IPPROTO_UDP ||
sk->sk_protocol == IPPROTO_RAW)) {
This is why the transport offset plus the length should never exceed
the total length for that skb_copy_bits() call.
Perhaps this protocol check in the code above is incomplete? Do you
know what the sk->sk_protocol value was when that BUG triggered? That
might shine some light on what is really happening here.
Thanks.
^ permalink raw reply
* Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments
From: Thomas Petazzoni @ 2016-12-17 15:26 UTC (permalink / raw)
To: David Miller
Cc: netdev, linux-arm-kernel, jason, andrew, sebastian.hesselbarth,
gregory.clement, mw, stefanc, nadavh, hannah, yehuday,
raphael.glon, stable
In-Reply-To: <20161217.102057.1497114610120684110.davem@davemloft.net>
Hello,
On Sat, 17 Dec 2016 10:20:57 -0500 (EST), David Miller wrote:
> You're really destroying cache locality, and making things overly
> complicated, by having two arrays. Actually this is now the third in
> this structure alone. That's crazy.
>
> Just have one array for the TX ring software state:
>
> struct tx_buff_info {
> struct sk_buff *skb;
> dma_addr_t dma_addr;
> unsigned int size;
> };
>
> Then in the per-cpu TX struct:
>
> struct tx_buff_info *info;
>
> This way every data access by the cpu for processing a ring entry
> will be localized, increasing cache hit rates.
>
> This also significantly simplifies the code that allocates and
> frees this memory.
Yes, I was thinking of moving towards a single array, as it's indeed
crazy to have three arrays for that. However, since it's a fix going
into stable, I also wanted to keep it as simple/straightforward as
possible and avoid refactoring other parts of the code.
If you however believe moving to one array should be done as part of
the fix, I'll do this.
Thanks for your feedback,
Thomas Petazzoni
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
^ permalink raw reply
* Re: [PATCH v3] net: macb: Added PCI wrapper for Platform Driver.
From: David Miller @ 2016-12-17 15:24 UTC (permalink / raw)
To: bfolta
Cc: nicolas.ferre, niklas.cassel, alexandre.torgue, satananda.burla,
rvatsavayi, simon.horman, linux-kernel, netdev, rafalo
In-Reply-To: <SN1PR0701MB19516D29452ABA2695B2F061CC9A0@SN1PR0701MB1951.namprd07.prod.outlook.com>
From: Bartosz Folta <bfolta@cadence.com>
Date: Wed, 14 Dec 2016 06:39:15 +0000
> There are hardware PCI implementations of Cadence GEM network
> controller. This patch will allow to use such hardware with reuse of
> existing Platform Driver.
>
> Signed-off-by: Bartosz Folta <bfolta@cadence.com>
> ---
> Changed in v3:
> Fixed dependencies in Kconfig.
> ---
> Changed in v2:
> Respin to net-next. Changed patch formatting.
Applied.
^ permalink raw reply
* Re: [PATCH net] ibmveth: calculate gso_segs for large packets
From: David Miller @ 2016-12-17 15:23 UTC (permalink / raw)
To: tlfalcon
Cc: netdev, brking, marcelo.leitner, pradeeps, jmaxwell37, zdai,
eric.dumazet
In-Reply-To: <1481674509-14256-1-git-send-email-tlfalcon@linux.vnet.ibm.com>
From: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Date: Tue, 13 Dec 2016 18:15:09 -0600
> Include calculations to compute the number of segments
> that comprise an aggregated large packet.
>
> Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Applied.
^ permalink raw reply
* Re: [PATCH] [v2] net: qcom/emac: don't try to claim clocks on ACPI systems
From: David Miller @ 2016-12-17 15:23 UTC (permalink / raw)
To: timur; +Cc: f.fainelli, netdev, cov, alokc
In-Reply-To: <1481672942-20024-1-git-send-email-timur@codeaurora.org>
From: Timur Tabi <timur@codeaurora.org>
Date: Tue, 13 Dec 2016 17:49:02 -0600
> On ACPI systems, clocks are not available to drivers directly. They are
> handled exclusively by ACPI and/or firmware, so there is no clock driver.
> Calls to clk_get() always fail, so we should not even attempt to claim
> any clocks on ACPI systems.
>
> Signed-off-by: Timur Tabi <timur@codeaurora.org>
> ---
>
> Notes:
> v2: move check into functions
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF
From: George Spelvin @ 2016-12-17 15:21 UTC (permalink / raw)
To: tom
Cc: ak, davem, David.Laight, djb, ebiggers3, hannes, Jason,
jeanphilippe.aumasson, kernel-hardening, linux-crypto,
linux-kernel, linux, luto, netdev, torvalds, tytso, vegard.nossum
In-Reply-To: <CALx6S351VFRZmEQphRQy6YtmZYPnOtTN7=XiNrJmhWJGv4HUBg@mail.gmail.com>
To follow up on my comments that your benchmark results were peculiar,
here's my benchmark code.
It just computes the hash of all n*(n+1)/2 possible non-empty substrings
of a buffer of n (called "max" below) bytes. "cpb" is "cycles per byte".
(The average length is (n+2)/3, c.f. https://oeis.org/A000292)
On x86-32, HSipHash is asymptotically twice the speed of SipHash,
rising to 2.5x for short strings:
SipHash/HSipHash benchmark, sizeof(long) = 4
SipHash: max= 4 cycles= 10495 cpb=524.7500 (sum=47a4f5554869fa97)
HSipHash: max= 4 cycles= 3400 cpb=170.0000 (sum=146a863e)
SipHash: max= 8 cycles= 24468 cpb=203.9000 (sum=21c41a86355affcc)
HSipHash: max= 8 cycles= 9237 cpb= 76.9750 (sum=d3b5e0cd)
SipHash: max= 16 cycles= 94622 cpb=115.9583 (sum=26d816b72721e48f)
HSipHash: max= 16 cycles= 34499 cpb= 42.2782 (sum=16bb7475)
SipHash: max= 32 cycles= 418767 cpb= 69.9811 (sum=dd5a97694b8a832d)
HSipHash: max= 32 cycles= 156695 cpb= 26.1857 (sum=eed00fcb)
SipHash: max= 64 cycles= 2119152 cpb= 46.3101 (sum=a2a725aecc09ed00)
HSipHash: max= 64 cycles= 1008678 cpb= 22.0428 (sum=99b9f4f)
SipHash: max= 128 cycles= 12728659 cpb= 35.5788 (sum=420878cd20272817)
HSipHash: max= 128 cycles= 5452931 cpb= 15.2419 (sum=f1f4ad18)
SipHash: max= 256 cycles= 38931946 cpb= 13.7615 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles= 13807312 cpb= 4.8805 (sum=ceeafcc1)
SipHash: max= 512 cycles= 205537380 cpb= 9.1346 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles= 103420960 cpb= 4.5963 (sum=7f15a313)
SipHash: max=1024 cycles=1540259472 cpb= 8.5817 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 796090824 cpb= 4.4355 (sum=d8f3374f)
On x86-64, SipHash is consistently faster, asymptotically approaching 2x
for long strings:
SipHash/HSipHash benchmark, sizeof(long) = 8
SipHash: max= 4 cycles= 2642 cpb=132.1000 (sum=47a4f5554869fa97)
HSipHash: max= 4 cycles= 2498 cpb=124.9000 (sum=146a863e)
SipHash: max= 8 cycles= 5270 cpb= 43.9167 (sum=21c41a86355affcc)
HSipHash: max= 8 cycles= 7140 cpb= 59.5000 (sum=d3b5e0cd)
SipHash: max= 16 cycles= 19950 cpb= 24.4485 (sum=26d816b72721e48f)
HSipHash: max= 16 cycles= 23546 cpb= 28.8554 (sum=16bb7475)
SipHash: max= 32 cycles= 80188 cpb= 13.4004 (sum=dd5a97694b8a832d)
HSipHash: max= 32 cycles= 101218 cpb= 16.9148 (sum=eed00fcb)
SipHash: max= 64 cycles= 373286 cpb= 8.1575 (sum=a2a725aecc09ed00)
HSipHash: max= 64 cycles= 535568 cpb= 11.7038 (sum=99b9f4f)
SipHash: max= 128 cycles= 2075224 cpb= 5.8006 (sum=420878cd20272817)
HSipHash: max= 128 cycles= 3336820 cpb= 9.3270 (sum=f1f4ad18)
SipHash: max= 256 cycles= 14276278 cpb= 5.0463 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles= 28847880 cpb= 10.1970 (sum=ceeafcc1)
SipHash: max= 512 cycles= 50135180 cpb= 2.2281 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles= 86145916 cpb= 3.8286 (sum=7f15a313)
SipHash: max=1024 cycles= 334111900 cpb= 1.8615 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 640432452 cpb= 3.5682 (sum=d8f3374f)
Here's the code; compile with -DSELFTEST. (The main purpose of
printing the sum is to prevent dead code elimination.)
#if SELFTEST
#include <stdint.h>
#include <stdlib.h>
static inline uint64_t rol64(uint64_t word, unsigned int shift)
{
return word << shift | word >> (64 - shift);
}
static inline uint32_t rol32(uint32_t word, unsigned int shift)
{
return word << shift | word >> (32 - shift);
}
static inline uint64_t get_unaligned_le64(void const *p)
{
return *(uint64_t const *)p;
}
static inline uint32_t get_unaligned_le32(void const *p)
{
return *(uint32_t const *)p;
}
static inline uint64_t le64_to_cpup(uint64_t const *p)
{
return *p;
}
static inline uint32_t le32_to_cpup(uint32_t const *p)
{
return *p;
}
#else
#include <linux/bitops.h> /* For rol64 */
#include <linux/cryptohash.h>
#include <asm/byteorder.h>
#include <asm/unaligned.h>
#endif
/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))
/*
* The complete SipRound. Note that, when unrolled twice like below,
* the 32-bit rotates drop out on 32-bit machines.
*/
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))
/*
* This is rolled up more than most implementations, resulting in about
* 55% the code size. Speed is a few precent slower. A crude benchmark
* (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
* produces the following timings (in usec):
*
* i386 i386 i386 x86_64 x86_64 x86_64 x86_64
* Length small unroll halfmd4 small unroll halfmd4 teahash
* 1..4 1069 1029 1608 195 160 399 690
* 1..8 2483 2381 3851 410 360 988 1659
* 1..12 4303 4152 6207 690 618 1642 2690
* 1..16 6122 5931 8668 968 876 2363 3786
* 1..20 8348 8137 11245 1323 1185 3162 5567
* 1..24 10580 10327 13935 1657 1504 4066 7635
* 1..28 13211 12956 16803 2069 1871 5028 9759
* 1..32 15843 15572 19725 2470 2260 6084 11932
* 1..36 18864 18609 24259 2934 2678 7566 14794
* 1..1024 5890194 6130242 10264816 881933 881244 3617392 7589036
*
* The performance penalty is quite minor, decreasing for long strings,
* and it's significantly faster than half_md4, so I'm going for the
* I-cache win.
*/
uint64_t
siphash24(char const *in, size_t len, uint32_t const seed[4])
{
uint64_t a = 0x736f6d6570736575; /* somepseu */
uint64_t b = 0x646f72616e646f6d; /* dorandom */
uint64_t c = 0x6c7967656e657261; /* lygenera */
uint64_t d = 0x7465646279746573; /* tedbytes */
uint64_t m = 0;
uint8_t padbyte = len;
m = seed[2] | (uint64_t)seed[3] << 32;
b ^= m;
d ^= m;
m = seed[0] | (uint64_t)seed[1] << 32;
/* a ^= m; is done in loop below */
c ^= m;
/*
* By using the same SipRound code for all iterations, we
* save space, at the expense of some branch prediction. But
* branch prediction is hard because of variable length anyway.
*/
len = len/8 + 3; /* Now number of rounds to perform */
do {
a ^= m;
switch (--len) {
unsigned bytes;
default: /* Full words */
d ^= m = get_unaligned_le64(in);
in += 8;
break;
case 2: /* Final partial word */
/*
* We'd like to do one 64-bit fetch rather than
* mess around with bytes, but reading past the end
* might hit a protection boundary. Fortunately,
* we know that protection boundaries are aligned,
* so we can consider only three cases:
* - The remainder occupies zero words
* - The remainder fits into one word
* - The remainder straddles two words
*/
bytes = padbyte & 7;
if (bytes == 0) {
m = 0;
} else {
unsigned offset = (unsigned)(uintptr_t)in & 7;
if (offset + bytes <= 8) {
m = le64_to_cpup((uint64_t const *)
(in - offset));
m >>= 8*offset;
} else {
m = get_unaligned_le64(in);
}
m &= ((uint64_t)1 << 8*bytes) - 1;
}
/* Could use | or +, but ^ allows associativity */
d ^= m ^= (uint64_t)padbyte << 56;
break;
case 1: /* Beginning of finalization */
m = 0;
c ^= 0xff;
/*FALLTHROUGH*/
case 0: /* Second half of finalization */
break;
}
SIP_ROUND(a, b, c, d);
SIP_ROUND(a, b, c, d);
} while (len);
return a ^ b ^ c ^ d;
}
#undef SIP_ROUND
#undef SIP_MIX
#define HSIP_MIX(a, b, s) ((a) += (b), (b) = rol32(b, s), (b) ^= (a))
/*
* These are the PRELIMINARY rotate constants suggested by
* Jean-Philippe Aumasson. Update to final when available.
*/
#define HSIP_ROUND(a, b, c, d) \
(HSIP_MIX(a, b, 5), HSIP_MIX(c, d, 8), (a) = rol32(a, 16), \
HSIP_MIX(c, b, 7), HSIP_MIX(a, d, 13), (c) = rol32(c, 16))
uint32_t
hsiphash24(char const *in, size_t len, uint32_t const key[2])
{
uint32_t c = key[0];
uint32_t d = key[1];
uint32_t a = 0x6c796765 ^ 0x736f6d65;
uint32_t b = d ^ 0x74656462 ^ 0x646f7261;
uint32_t m = c;
uint8_t padbyte = len;
/*
* By using the same SipRound code for all iterations, we
* save space, at the expense of some branch prediction. But
* branch prediction is hard because of variable length anyway.
*/
len = len/sizeof(m) + 3; /* Now number of rounds to perform */
do {
a ^= m;
switch (--len) {
unsigned bytes;
default: /* Full words */
d ^= m = get_unaligned_le32(in);
in += sizeof(m);
break;
case 2: /* Final partial word */
/*
* We'd like to do one 32-bit fetch rather than
* mess around with bytes, but reading past the end
* might hit a protection boundary. Fortunately,
* we know that protection boundaries are aligned,
* so we can consider only three cases:
* - The remainder occupies zero words
* - The remainder fits into one word
* - The remainder straddles two words
*/
bytes = padbyte & 3;
if (bytes == 0) {
m = 0;
} else {
unsigned offset = (unsigned)(uintptr_t)in & 3;
if (offset + bytes <= 4) {
m = le32_to_cpup((uint32_t const *)
(in - offset));
m >>= 8*offset;
} else {
m = get_unaligned_le32(in);
}
m &= ((uint32_t)1 << 8*bytes) - 1;
}
/* Could use | or +, but ^ allows associativity */
d ^= m ^= (uint32_t)padbyte << 24;
break;
case 1: /* Beginning of finalization */
m = 0;
c ^= 0xff;
/*FALLTHROUGH*/
case 0: /* Second half of finalization */
break;
}
HSIP_ROUND(a, b, c, d);
HSIP_ROUND(a, b, c, d);
} while (len);
return a ^ b ^ c ^ d;
// return c + d;
}
#undef HSIP_ROUND
#undef HSIP_MIX
/*
* No objection to EXPORT_SYMBOL, but we should probably figure out
* how the seed[] array should work first. Homework for the first
* person to want to call it from a module!
*/
#if SELFTEST
#include <stdio.h>
static uint64_t rdtsc()
{
uint32_t eax, edx;
asm volatile ("rdtsc" : "=a" (eax), "=d" (edx));
return (uint64_t)edx << 32 | eax;
}
int
main(void)
{
static char const buf[1024] = { 0 };
unsigned max;
static const uint32_t key[4] = { 1, 2, 3, 4 };
printf("SipHash/HSipHash benchmark, sizeof(long) = %u\n",
(unsigned)sizeof(long));
for (unsigned max = 4; max <= 1024; max *= 2) {
uint64_t sum1 = 0;
uint32_t sum2 = 0;
uint64_t cycles;
uint32_t bytes = 0;
/* A less lazy person could figure out the closed form */
for (int i = 1; i <= max; i++)
bytes += i * (max + 1 - i);
cycles = rdtsc();
for (int i = 1; i <= max; i++)
for (int j = 0; j <= max-i; j++)
sum1 += siphash24(buf+j, i, key);
cycles = rdtsc() - cycles;
printf(" SipHash: max=%4u cycles=%10llu cpb=%8.4f (sum=%llx)\n",
max, cycles, (double)cycles/bytes, sum1);
cycles = rdtsc();
for (int i = 1; i <= max; i++)
for (int j = 0; j <= max-i; j++)
sum2 += hsiphash24(buf+j, i, key);
cycles = rdtsc() - cycles;
printf("HSipHash: max=%4u cycles=%10llu cpb=%8.4f (sum=%lx)\n",
max, cycles, (double)cycles/bytes, sum2);
}
return 0;
}
#endif
^ permalink raw reply
* Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments
From: David Miller @ 2016-12-17 15:20 UTC (permalink / raw)
To: thomas.petazzoni
Cc: netdev, linux-arm-kernel, jason, andrew, sebastian.hesselbarth,
gregory.clement, mw, stefanc, nadavh, hannah, yehuday,
raphael.glon, stable
In-Reply-To: <1481647995-7213-1-git-send-email-thomas.petazzoni@free-electrons.com>
From: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Date: Tue, 13 Dec 2016 17:53:15 +0100
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index 1026c45..d168b13 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -791,6 +791,8 @@ struct mvpp2_txq_pcpu {
> /* Array of transmitted buffers' physical addresses */
> dma_addr_t *tx_buffs;
>
> + size_t *tx_data_size;
> +
You're really destroying cache locality, and making things overly
complicated, by having two arrays. Actually this is now the third in
this structure alone. That's crazy.
Just have one array for the TX ring software state:
struct tx_buff_info {
struct sk_buff *skb;
dma_addr_t dma_addr;
unsigned int size;
};
Then in the per-cpu TX struct:
struct tx_buff_info *info;
This way every data access by the cpu for processing a ring entry
will be localized, increasing cache hit rates.
This also significantly simplifies the code that allocates and
frees this memory.
^ permalink raw reply
* Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF
From: Jeffrey Walton @ 2016-12-17 14:55 UTC (permalink / raw)
To: Jason A. Donenfeld
Cc: Netdev, kernel-hardening, LKML, linux-crypto, David Laight,
Ted Tso, Hannes Frederic Sowa, Linus Torvalds, Eric Biggers,
Tom Herbert, George Spelvin, Vegard Nossum, ak, davem, luto,
Jean-Philippe Aumasson, Daniel J . Bernstein
In-Reply-To: <20161215203003.31989-2-Jason@zx2c4.com>
> diff --git a/lib/test_siphash.c b/lib/test_siphash.c
> new file mode 100644
> index 000000000000..93549e4e22c5
> --- /dev/null
> +++ b/lib/test_siphash.c
> @@ -0,0 +1,83 @@
> +/* Test cases for siphash.c
> + *
> + * Copyright (C) 2016 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
> + *
> + * This file is provided under a dual BSD/GPLv2 license.
> + *
> + * SipHash: a fast short-input PRF
> + * https://131002.net/siphash/
> + *
> + * This implementation is specifically for SipHash2-4.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/siphash.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> +#include <linux/module.h>
> +
> +/* Test vectors taken from official reference source available at:
> + * https://131002.net/siphash/siphash24.c
> + */
> +static const u64 test_vectors[64] = {
> + 0x726fdb47dd0e0e31ULL, 0x74f839c593dc67fdULL, 0x0d6c8009d9a94f5aULL,
> + 0x85676696d7fb7e2dULL, 0xcf2794e0277187b7ULL, 0x18765564cd99a68dULL,
> + 0xcbc9466e58fee3ceULL, 0xab0200f58b01d137ULL, 0x93f5f5799a932462ULL,
> + 0x9e0082df0ba9e4b0ULL, 0x7a5dbbc594ddb9f3ULL, 0xf4b32f46226bada7ULL,
> + 0x751e8fbc860ee5fbULL, 0x14ea5627c0843d90ULL, 0xf723ca908e7af2eeULL,
> + 0xa129ca6149be45e5ULL, 0x3f2acc7f57c29bdbULL, 0x699ae9f52cbe4794ULL,
> + 0x4bc1b3f0968dd39cULL, 0xbb6dc91da77961bdULL, 0xbed65cf21aa2ee98ULL,
> + 0xd0f2cbb02e3b67c7ULL, 0x93536795e3a33e88ULL, 0xa80c038ccd5ccec8ULL,
> + 0xb8ad50c6f649af94ULL, 0xbce192de8a85b8eaULL, 0x17d835b85bbb15f3ULL,
> + 0x2f2e6163076bcfadULL, 0xde4daaaca71dc9a5ULL, 0xa6a2506687956571ULL,
> + 0xad87a3535c49ef28ULL, 0x32d892fad841c342ULL, 0x7127512f72f27cceULL,
> + 0xa7f32346f95978e3ULL, 0x12e0b01abb051238ULL, 0x15e034d40fa197aeULL,
> + 0x314dffbe0815a3b4ULL, 0x027990f029623981ULL, 0xcadcd4e59ef40c4dULL,
> + 0x9abfd8766a33735cULL, 0x0e3ea96b5304a7d0ULL, 0xad0c42d6fc585992ULL,
> + 0x187306c89bc215a9ULL, 0xd4a60abcf3792b95ULL, 0xf935451de4f21df2ULL,
> + 0xa9538f0419755787ULL, 0xdb9acddff56ca510ULL, 0xd06c98cd5c0975ebULL,
> + 0xe612a3cb9ecba951ULL, 0xc766e62cfcadaf96ULL, 0xee64435a9752fe72ULL,
> + 0xa192d576b245165aULL, 0x0a8787bf8ecb74b2ULL, 0x81b3e73d20b49b6fULL,
> + 0x7fa8220ba3b2eceaULL, 0x245731c13ca42499ULL, 0xb78dbfaf3a8d83bdULL,
> + 0xea1ad565322a1a0bULL, 0x60e61c23a3795013ULL, 0x6606d7e446282b93ULL,
> + 0x6ca4ecb15c5f91e1ULL, 0x9f626da15c9625f3ULL, 0xe51b38608ef25f57ULL,
> + 0x958a324ceb064572ULL
> +};
> +static const siphash_key_t test_key =
> + { 0x0706050403020100ULL , 0x0f0e0d0c0b0a0908ULL };
> +
> +static int __init siphash_test_init(void)
> +{
> + u8 in[64] __aligned(SIPHASH_ALIGNMENT);
> + u8 in_unaligned[65];
> + u8 i;
> + int ret = 0;
> +
> + for (i = 0; i < 64; ++i) {
> + in[i] = i;
> + in_unaligned[i + 1] = i;
> + if (siphash(in, i, test_key) != test_vectors[i]) {
> + pr_info("self-test aligned %u: FAIL\n", i + 1);
> + ret = -EINVAL;
> + }
> + if (siphash_unaligned(in_unaligned + 1, i, test_key) != test_vectors[i]) {
> + pr_info("self-test unaligned %u: FAIL\n", i + 1);
> + ret = -EINVAL;
> + }
> + }
> + if (!ret)
> + pr_info("self-tests: pass\n");
> + return ret;
> +}
> +
> +static void __exit siphash_test_exit(void)
> +{
> +}
> +
> +module_init(siphash_test_init);
> +module_exit(siphash_test_exit);
> +
> +MODULE_AUTHOR("Jason A. Donenfeld <Jason@zx2c4.com>");
> +MODULE_LICENSE("Dual BSD/GPL");
> --
> 2.11.0
>
I believe the output of SipHash depends upon endianness. Folks who
request a digest through the af_alg interface will likely expect a
byte array.
I think that means on little endian machines, values like element 0
must be reversed byte reversed:
0x726fdb47dd0e0e31ULL => 31,0e,0e,dd,47,db,6f,72
If I am not mistaken, that value (and other tv's) are returned here:
return (v0 ^ v1) ^ (v2 ^ v3);
It may be prudent to include the endian reversal in the test to ensure
big endian machines produce expected results. Some closely related
testing on an old Apple PowerMac G5 revealed that result needed to be
reversed before returning it to a caller.
Jeff
^ permalink raw reply
* DO YOU NEED A LOAN??
From: bancoleite @ 2016-12-17 14:48 UTC (permalink / raw)
To: Recipients
LOAN
^ permalink raw reply
* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-17 13:26 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Tom Herbert, Craig Gallek, Eric Dumazet,
Linux Kernel Network Developers
In-Reply-To: <e32b5375-38fb-1fd8-d97a-bf740e368205@stressinduktion.org>
> On Dec 17, 2016, at 6:09 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
>
>> On 16.12.2016 23:50, Josef Bacik wrote:
>>> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>>
>>>>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>>>
>>>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>>> <hannes@stressinduktion.org> wrote:
>>>>>>>
>>>>>>> Hi Josef,
>>>>>>>
>>>>>>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>>>> <tom@herbertland.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think there may be some suspicious code in
>>>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>>> tb_found there is:
>>>>>>>>>>>
>>>>>>>>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>>>
>>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>>>> uid))) &&
>>>>>>>>>>> smallest_size == -1)
>>>>>>>>>>> goto success;
>>>>>>>>>>> if
>>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>>> tb, true)) {
>>>>>>>>>>> if ((reuse ||
>>>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>>
>>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>> uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>>> smallest_size != -1 &&
>>>>>>>>>>> --attempts >=
>>>>>>>>>>> 0) {
>>>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>>>> goto again;
>>>>>>>>>>> }
>>>>>>>>>>> goto fail_unlock;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> AFAICT there is redundancy in these two conditionals. The
>>>>>>>>>>> same
>>>>>>>>>>> clause
>>>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>>>> is true
>>>>>>>>>>> the
>>>>>>>>>>> first conditional should be hit, goto done, and the second
>>>>>>>>>>> will
>>>>>>>>>>> never
>>>>>>>>>>> evaluate that part to true-- unless the sk is changed (do
>>>>>>>>>>> we need
>>>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>>>
>>>>>>>>>> That's an interesting point... It looks like this function also
>>>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>>> beginning
>>>>>>>>>> with several spin_lock(&head->lock) to exclusively
>>>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps
>>>>>>>>>> the full
>>>>>>>>>> bh
>>>>>>>>>> disable variant was preventing the timers in your stack
>>>>>>>>>> trace from
>>>>>>>>>> running interleaved with this function before?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Could be, although dropping the lock shouldn't be able to
>>>>>>>>> affect the
>>>>>>>>> search state. TBH, I'm a little lost in reading function, the
>>>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>>>>> that
>>>>>>>>> function and also in every call to inet_csk_bind_conflict. I
>>>>>>>>> wonder
>>>>>>>>> if
>>>>>>>>> we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>>>> allowed if the port number (snum) is explicitly specified.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ok first I have data for you Hannes, here's the time distributions
>>>>>>>> before during and after the lockup (with all the debugging in
>>>>>>>> place
>>>>>>>> the
>>>>>>>> box eventually recovers). I've attached it as a text file
>>>>>>>> since it is
>>>>>>>> long.
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>>
>>>>>>>> Second is I was thinking about why we would spend so much time
>>>>>>>> doing
>>>>>>>> the
>>>>>>>> ->owners list, and obviously it's because of the massive amount of
>>>>>>>> timewait sockets on the owners list. I wrote the following
>>>>>>>> dumb patch
>>>>>>>> and tested it and the problem has disappeared completely. Now
>>>>>>>> I don't
>>>>>>>> know if this is right at all, but I thought it was weird we
>>>>>>>> weren't
>>>>>>>> copying the soreuseport option from the original socket onto
>>>>>>>> the twsk.
>>>>>>>> Is there are reason we aren't doing this currently? Does this
>>>>>>>> help
>>>>>>>> explain what is happening? Thanks,
>>>>>>>
>>>>>>>
>>>>>>> The patch is interesting and a good clue, but I am immediately a bit
>>>>>>> concerned that we don't copy/tag the socket with the uid also to
>>>>>>> keep
>>>>>>> the security properties for SO_REUSEPORT. I have to think a bit more
>>>>>>> about this.
>>>>>>>
>>>>>>> We have seen hangs during connect. I am afraid this patch
>>>>>>> wouldn't help
>>>>>>> there while also guaranteeing uniqueness.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yeah so I looked at the code some more and actually my patch is
>>>>>> really
>>>>>> bad. If sk2->sk_reuseport is set we'll look at
>>>>>> sk2->sk_reuseport_cb, which
>>>>>> is outside of the timewait sock, so that's definitely bad.
>>>>>>
>>>>>> But we should at least be setting it to 0 so that we don't do this
>>>>>> normally. Unfortunately simply setting it to 0 doesn't fix the
>>>>>> problem. So
>>>>>> for some reason having ->sk_reuseport set to 1 on a timewait
>>>>>> socket makes
>>>>>> this problem non-existent, which is strange.
>>>>>>
>>>>>> So back to the drawing board I guess. I wonder if doing what craig
>>>>>> suggested and batching the timewait timer expires so it hurts less
>>>>>> would
>>>>>> accomplish the same results. Thanks,
>>>>>
>>>>>
>>>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This
>>>>> is the
>>>>> code
>>>>>
>>>>> if ((!reuse || !sk2->sk_reuse ||
>>>>> sk2->sk_state == TCP_LISTEN) &&
>>>>> (!reuseport || !sk2->sk_reuseport ||
>>>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>>> (sk2->sk_state != TCP_TIME_WAIT &&
>>>>> !uid_eq(uid, sock_i_uid(sk2))))) {
>>>>>
>>>>> if (!sk2->sk_rcv_saddr ||
>>>>> !sk->sk_rcv_saddr
>>>>> ||
>>>>> sk2->sk_rcv_saddr ==
>>>>> sk->sk_rcv_saddr)
>>>>> break;
>>>>> }
>>>>>
>>>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport
>>>>> == 1.
>>>>> But now we are using reuseport, so sk->sk_reuseport_cb should be
>>>>> non-NULL
>>>>> right? So really setting the timewait sock's sk_reuseport should
>>>>> have no
>>>>> bearing on how this loop plays out right? Thanks,
>>>>
>>>>
>>>>
>>>> So more messing around and I noticed that we basically don't do the
>>>> tb->fastreuseport logic at all if we've ended up with a non
>>>> SO_REUSEPORT
>>>> socket on that tb. So before I fully understood what I was doing I
>>>> fixed it
>>>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT
>>>> socket, we reset tb->fastreuseport to 1 and set the uid to match the
>>>> uid of
>>>> the socket. This made the problem go away. Tom pointed out that if
>>>> we bind
>>>> to the same port on a different address and we have a non SO_REUSEPORT
>>>> socket with the same address on this tb then we'd be screwed with my
>>>> code.
>>>>
>>>> Which brings me to his proposed solution. We need another hash
>>>> table that
>>>> is indexed based on the binding address. Then each node corresponds
>>>> to one
>>>> address/port binding, with non-SO_REUSEPORT entries having only one
>>>> entry,
>>>> and normal SO_REUSEPORT entries having many. This cleans up the
>>>> need to
>>>> search all the possible sockets on any given tb, we just go and look
>>>> at the
>>>> one we care about. Does this make sense? Thanks,
>>>>
>>> Hi Josef,
>>>
>>> Thinking about it some more the hash table won't work because of the
>>> rules of binding different addresses to the same port. What I think we
>>> can do is to change inet_bind_bucket to be structure that contains all
>>> the information used to detect conflicts (reuse*, if, address, uid,
>>> etc.) and a list of sockets that share that exact same information--
>>> for instance all socket in timewait state create through some listener
>>> socket should wind up on single bucket. When we do the bind_conflict
>>> function we only should have to walk this buckets, not the full list
>>> of sockets.
>>>
>>> Thoughts on this?
>>
>> This sounds good, maybe tb->owners be a list of say
>>
>> struct inet_unique_shit {
>> struct sock_common sk;
>> struct hlist socks;
>> };
>>
>> Then we make inet_unique_shit like twsks', just copy the relevant
>> information, then hang the real sockets off of the socks hlist.
>> Something like that? Thanks,
>
> As a counter idea: can we push up a flag to the inet_bind_bucket that we
> check in the fast- way style that indicates that we have wildcarded
> non-reuse(port) in there, so we can skip the bind_bucket much more
> quickly? This wouldn't require a new data structure.
So take my current duct tape fix and augment it with more information in the bind bucket? I'm not sure how to make this work without at least having a list of the binded addrs as well to make sure we are really ok. I suppose we could save the fastreuseport address that last succeeded to make it work properly, but I'd have to make it protocol agnostic and then have a callback to have the protocol to make sure we don't have to do the bind_conflict run. Is that what you were thinking of? Thanks,
Josef
^ permalink raw reply
* Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF
From: George Spelvin @ 2016-12-17 12:42 UTC (permalink / raw)
To: Jason
Cc: ak, davem, David.Laight, djb, ebiggers3, hannes,
jeanphilippe.aumasson, kernel-hardening, linux-crypto,
linux-kernel, linux, luto, netdev, tom, torvalds, tytso,
vegard.nossum
In-Reply-To: <CAHmME9rxCYfwyF6EADWqpAEt+yqCPgCLUVH0FPdAy7r-oPnrRg@mail.gmail.com>
BTW, here's some SipHash code I wrote for Linux a while ago.
My target application was ext4 directory hashing, resulting in different
implementation choices, although I still think that a rolled-up
implementation like this is reasonable. Reducing I-cache impact speeds
up the calling code.
One thing I'd like to suggest you steal is the way it handles the
fetch of the final partial word. It's a lot smaller and faster than
an 8-way case statement.
#include <linux/bitops.h> /* For rol64 */
#include <linux/cryptohash.h>
#include <asm/byteorder.h>
#include <asm/unaligned.h>
/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))
/*
* The complete SipRound. Note that, when unrolled twice like below,
* the 32-bit rotates drop out on 32-bit machines.
*/
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))
/*
* This is rolled up more than most implementations, resulting in about
* 55% the code size. Speed is a few precent slower. A crude benchmark
* (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
* produces the following timings (in usec):
*
* i386 i386 i386 x86_64 x86_64 x86_64 x86_64
* Length small unroll halfmd4 small unroll halfmd4 teahash
* 1..4 1069 1029 1608 195 160 399 690
* 1..8 2483 2381 3851 410 360 988 1659
* 1..12 4303 4152 6207 690 618 1642 2690
* 1..16 6122 5931 8668 968 876 2363 3786
* 1..20 8348 8137 11245 1323 1185 3162 5567
* 1..24 10580 10327 13935 1657 1504 4066 7635
* 1..28 13211 12956 16803 2069 1871 5028 9759
* 1..32 15843 15572 19725 2470 2260 6084 11932
* 1..36 18864 18609 24259 2934 2678 7566 14794
* 1..1024 5890194 6130242 10264816 881933 881244 3617392 7589036
*
* The performance penalty is quite minor, decreasing for long strings,
* and it's significantly faster than half_md4, so I'm going for the
* I-cache win.
*/
uint64_t
siphash24(char const *in, size_t len, uint32_t const seed[4])
{
uint64_t a = 0x736f6d6570736575; /* somepseu */
uint64_t b = 0x646f72616e646f6d; /* dorandom */
uint64_t c = 0x6c7967656e657261; /* lygenera */
uint64_t d = 0x7465646279746573; /* tedbytes */
uint64_t m = 0;
uint8_t padbyte = len;
/*
* Mix in the 128-bit hash seed. This is in a format convenient
* to the ext3/ext4 code. Please feel free to adapt the
* */
if (seed) {
m = seed[2] | (uint64_t)seed[3] << 32;
b ^= m;
d ^= m;
m = seed[0] | (uint64_t)seed[1] << 32;
/* a ^= m; is done in loop below */
c ^= m;
}
/*
* By using the same SipRound code for all iterations, we
* save space, at the expense of some branch prediction. But
* branch prediction is hard because of variable length anyway.
*/
len = len/8 + 3; /* Now number of rounds to perform */
do {
a ^= m;
switch (--len) {
unsigned bytes;
default: /* Full words */
d ^= m = get_unaligned_le64(in);
in += 8;
break;
case 2: /* Final partial word */
/*
* We'd like to do one 64-bit fetch rather than
* mess around with bytes, but reading past the end
* might hit a protection boundary. Fortunately,
* we know that protection boundaries are aligned,
* so we can consider only three cases:
* - The remainder occupies zero words
* - The remainder fits into one word
* - The remainder straddles two words
*/
bytes = padbyte & 7;
if (bytes == 0) {
m = 0;
} else {
unsigned offset = (unsigned)(uintptr_t)in & 7;
if (offset + bytes <= 8) {
m = le64_to_cpup((uint64_t const *)
(in - offset));
m >>= 8*offset;
} else {
m = get_unaligned_le64(in);
}
m &= ((uint64_t)1 << 8*bytes) - 1;
}
/* Could use | or +, but ^ allows associativity */
d ^= m ^= (uint64_t)padbyte << 56;
break;
case 1: /* Beginning of finalization */
m = 0;
c ^= 0xff;
/*FALLTHROUGH*/
case 0: /* Second half of finalization */
break;
}
SIP_ROUND(a, b, c, d);
SIP_ROUND(a, b, c, d);
} while (len);
return a ^ b ^ c ^ d;
}
#undef SIP_ROUND
#undef SIP_MIX
/*
* No objection to EXPORT_SYMBOL, but we should probably figure out
* how the seed[] array should work first. Homework for the first
* person to want to call it from a module!
*/
^ permalink raw reply
* Re: Soft lockup in inet_put_port on 4.6
From: Hannes Frederic Sowa @ 2016-12-17 11:08 UTC (permalink / raw)
To: Josef Bacik, Tom Herbert
Cc: Craig Gallek, Eric Dumazet, Linux Kernel Network Developers
In-Reply-To: <1481928610.17731.0@smtp.office365.com>
On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@fb.com> wrote:
>>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>
>>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@fb.com> wrote:
>>>>>
>>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>> <hannes@stressinduktion.org> wrote:
>>>>>>
>>>>>> Hi Josef,
>>>>>>
>>>>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>>
>>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@herbertland.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>>> <kraigatgoog@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>>> <tom@herbertland.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I think there may be some suspicious code in
>>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>> tb_found there is:
>>>>>>>>>>
>>>>>>>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>>
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>>> uid))) &&
>>>>>>>>>> smallest_size == -1)
>>>>>>>>>> goto success;
>>>>>>>>>> if
>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>> tb, true)) {
>>>>>>>>>> if ((reuse ||
>>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>> smallest_size != -1 &&
>>>>>>>>>> --attempts >=
>>>>>>>>>> 0) {
>>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>>> goto again;
>>>>>>>>>> }
>>>>>>>>>> goto fail_unlock;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> AFAICT there is redundancy in these two conditionals. The
>>>>>>>>>> same
>>>>>>>>>> clause
>>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>>> is true
>>>>>>>>>> the
>>>>>>>>>> first conditional should be hit, goto done, and the second
>>>>>>>>>> will
>>>>>>>>>> never
>>>>>>>>>> evaluate that part to true-- unless the sk is changed (do
>>>>>>>>>> we need
>>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>>
>>>>>>>>> That's an interesting point... It looks like this function also
>>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>> beginning
>>>>>>>>> with several spin_lock(&head->lock) to exclusively
>>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps
>>>>>>>>> the full
>>>>>>>>> bh
>>>>>>>>> disable variant was preventing the timers in your stack
>>>>>>>>> trace from
>>>>>>>>> running interleaved with this function before?
>>>>>>>>
>>>>>>>>
>>>>>>>> Could be, although dropping the lock shouldn't be able to
>>>>>>>> affect the
>>>>>>>> search state. TBH, I'm a little lost in reading function, the
>>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>>>> that
>>>>>>>> function and also in every call to inet_csk_bind_conflict. I
>>>>>>>> wonder
>>>>>>>> if
>>>>>>>> we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>>> allowed if the port number (snum) is explicitly specified.
>>>>>>>
>>>>>>>
>>>>>>> Ok first I have data for you Hannes, here's the time distributions
>>>>>>> before during and after the lockup (with all the debugging in
>>>>>>> place
>>>>>>> the
>>>>>>> box eventually recovers). I've attached it as a text file
>>>>>>> since it is
>>>>>>> long.
>>>>>>
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>>> Second is I was thinking about why we would spend so much time
>>>>>>> doing
>>>>>>> the
>>>>>>> ->owners list, and obviously it's because of the massive amount of
>>>>>>> timewait sockets on the owners list. I wrote the following
>>>>>>> dumb patch
>>>>>>> and tested it and the problem has disappeared completely. Now
>>>>>>> I don't
>>>>>>> know if this is right at all, but I thought it was weird we
>>>>>>> weren't
>>>>>>> copying the soreuseport option from the original socket onto
>>>>>>> the twsk.
>>>>>>> Is there are reason we aren't doing this currently? Does this
>>>>>>> help
>>>>>>> explain what is happening? Thanks,
>>>>>>
>>>>>>
>>>>>> The patch is interesting and a good clue, but I am immediately a bit
>>>>>> concerned that we don't copy/tag the socket with the uid also to
>>>>>> keep
>>>>>> the security properties for SO_REUSEPORT. I have to think a bit more
>>>>>> about this.
>>>>>>
>>>>>> We have seen hangs during connect. I am afraid this patch
>>>>>> wouldn't help
>>>>>> there while also guaranteeing uniqueness.
>>>>>
>>>>>
>>>>>
>>>>> Yeah so I looked at the code some more and actually my patch is
>>>>> really
>>>>> bad. If sk2->sk_reuseport is set we'll look at
>>>>> sk2->sk_reuseport_cb, which
>>>>> is outside of the timewait sock, so that's definitely bad.
>>>>>
>>>>> But we should at least be setting it to 0 so that we don't do this
>>>>> normally. Unfortunately simply setting it to 0 doesn't fix the
>>>>> problem. So
>>>>> for some reason having ->sk_reuseport set to 1 on a timewait
>>>>> socket makes
>>>>> this problem non-existent, which is strange.
>>>>>
>>>>> So back to the drawing board I guess. I wonder if doing what craig
>>>>> suggested and batching the timewait timer expires so it hurts less
>>>>> would
>>>>> accomplish the same results. Thanks,
>>>>
>>>>
>>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This
>>>> is the
>>>> code
>>>>
>>>> if ((!reuse || !sk2->sk_reuse ||
>>>> sk2->sk_state == TCP_LISTEN) &&
>>>> (!reuseport || !sk2->sk_reuseport ||
>>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>> (sk2->sk_state != TCP_TIME_WAIT &&
>>>> !uid_eq(uid, sock_i_uid(sk2))))) {
>>>>
>>>> if (!sk2->sk_rcv_saddr ||
>>>> !sk->sk_rcv_saddr
>>>> ||
>>>> sk2->sk_rcv_saddr ==
>>>> sk->sk_rcv_saddr)
>>>> break;
>>>> }
>>>>
>>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport
>>>> == 1.
>>>> But now we are using reuseport, so sk->sk_reuseport_cb should be
>>>> non-NULL
>>>> right? So really setting the timewait sock's sk_reuseport should
>>>> have no
>>>> bearing on how this loop plays out right? Thanks,
>>>
>>>
>>>
>>> So more messing around and I noticed that we basically don't do the
>>> tb->fastreuseport logic at all if we've ended up with a non
>>> SO_REUSEPORT
>>> socket on that tb. So before I fully understood what I was doing I
>>> fixed it
>>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT
>>> socket, we reset tb->fastreuseport to 1 and set the uid to match the
>>> uid of
>>> the socket. This made the problem go away. Tom pointed out that if
>>> we bind
>>> to the same port on a different address and we have a non SO_REUSEPORT
>>> socket with the same address on this tb then we'd be screwed with my
>>> code.
>>>
>>> Which brings me to his proposed solution. We need another hash
>>> table that
>>> is indexed based on the binding address. Then each node corresponds
>>> to one
>>> address/port binding, with non-SO_REUSEPORT entries having only one
>>> entry,
>>> and normal SO_REUSEPORT entries having many. This cleans up the
>>> need to
>>> search all the possible sockets on any given tb, we just go and look
>>> at the
>>> one we care about. Does this make sense? Thanks,
>>>
>> Hi Josef,
>>
>> Thinking about it some more the hash table won't work because of the
>> rules of binding different addresses to the same port. What I think we
>> can do is to change inet_bind_bucket to be structure that contains all
>> the information used to detect conflicts (reuse*, if, address, uid,
>> etc.) and a list of sockets that share that exact same information--
>> for instance all socket in timewait state create through some listener
>> socket should wind up on single bucket. When we do the bind_conflict
>> function we only should have to walk this buckets, not the full list
>> of sockets.
>>
>> Thoughts on this?
>
> This sounds good, maybe tb->owners be a list of say
>
> struct inet_unique_shit {
> struct sock_common sk;
> struct hlist socks;
> };
>
> Then we make inet_unique_shit like twsks', just copy the relevant
> information, then hang the real sockets off of the socks hlist.
> Something like that? Thanks,
As a counter idea: can we push up a flag to the inet_bind_bucket that we
check in the fast- way style that indicates that we have wildcarded
non-reuse(port) in there, so we can skip the bind_bucket much more
quickly? This wouldn't require a new data structure.
Thanks,
Hannes
^ permalink raw reply
* mlx4: Bug in XDP_TX + 16 rx-queues
From: Martin KaFai Lau @ 2016-12-17 10:18 UTC (permalink / raw)
To: Saeed Mahameed, Tariq Toukan; +Cc: netdev@vger.kernel.org, Alexei Starovoitov
Hi All,
I have been debugging with XDP_TX and 16 rx-queues.
1) When 16 rx-queues is used and an XDP prog is doing XDP_TX,
it seems that the packet cannot be XDP_TX out if the pkt
is received from some particular CPUs (/rx-queues).
2) If 8 rx-queues is used, it does not have problem.
3) The 16 rx-queues problem also went away after reverting these
two patches:
15fca2c8eb41 net/mlx4_en: Add ethtool statistics for XDP cases
67f8b1dcb9ee net/mlx4_en: Refactor the XDP forwarding rings scheme
4) I can reproduce the problem by running samples/bof/xdp_ip_tunnel at
the receiver side. The sender side sends out TCP packets with
source port ranging from 1 to 1024. At the sender side also, do
a tcpdump to capture the ip-tunnel packet reflected by xdp_ip_tunnel.
With 8 rx-queues, I can get all 1024 packets back. With 16 rx-queues,
I can only get 512 packets back. It is a 40 CPUs machine.
I also checked the rx*_xdp_tx counters (from ethtool -S eth0) to ensure
the xdp prog has XDP_TX-ed it out.
Not saying that 67f8b1dcb9ee is 100% the cause because there are other
changes since then. It is merely a brain dump on what I have already
tried.
Tariq/Saeed, any thoughts? I can easily test some patches in
my setup.
Thanks,
--Martin
^ permalink raw reply
* Re: [PATCH net 2/2] bpf: fix overflow in prog accounting
From: Daniel Borkmann @ 2016-12-17 10:03 UTC (permalink / raw)
To: kbuild test robot; +Cc: kbuild-all, davem, ast, netdev
In-Reply-To: <201612171034.Pd1bvSuT%fengguang.wu@intel.com>
On 12/17/2016 03:52 AM, kbuild test robot wrote:
> Hi Daniel,
>
> [auto build test ERROR on net/master]
>
> url: https://github.com/0day-ci/linux/commits/Daniel-Borkmann/bpf-dynamically-allocate-digest-scratch-buffer/20161217-090046
> config: sparc-defconfig (attached as .config)
> compiler: sparc-linux-gcc (GCC) 6.2.0
> reproduce:
> wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=sparc
>
> All errors (new ones prefixed by >>):
>
> kernel/bpf/core.c: In function '__bpf_prog_charge':
>>> kernel/bpf/core.c:80:50: error: 'struct user_struct' has no member named 'locked_vm'; did you mean 'locked_shm'?
> user_bufs = atomic_long_add_return(pages, &user->locked_vm);
> ^~
> kernel/bpf/core.c:82:32: error: 'struct user_struct' has no member named 'locked_vm'; did you mean 'locked_shm'?
> atomic_long_sub(pages, &user->locked_vm);
> ^~
> kernel/bpf/core.c: In function '__bpf_prog_uncharge':
> kernel/bpf/core.c:93:31: error: 'struct user_struct' has no member named 'locked_vm'; did you mean 'locked_shm'?
> atomic_long_sub(pages, &user->locked_vm);
Argh, right, I'll send v2 later today.
^ permalink raw reply
* Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list
From: Xin Long @ 2016-12-17 9:56 UTC (permalink / raw)
To: Neil Horman
Cc: network dev, linux-sctp, davem, Marcelo Ricardo Leitner,
Vlad Yasevich, Daniel Borkmann
In-Reply-To: <20160824103854.GA13154@hmsreliant.think-freely.org>
On Wed, Aug 24, 2016 at 6:38 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Wed, Aug 24, 2016 at 01:14:27PM +0800, Xin Long wrote:
>> >> > Ah, I see what you're doing. Ok, this makes some sense, at least on the receive
>> >> > side, when you get a cookie unpacked and modify the remote peers address list,
>> >> > it makes sense to check for duplicates. On the local side however, I would,
>> >> > instead of checking it when the list gets copied, I'd check it when the master
>> >> > list gets updated (in the NETDEV_UP event notifier for the local address list,
>> >>
>> >> I was thinking about to check it in the NETDEV_UP, yes it can make the
>> >> master list has no duplicated addresses. But what if two same addresses
>> >> events come, and they come from different NICs (though I can't point out
>> >> the valid use case), then we filter there.
>> >>
>> > That I think would be a bug in the protocol code. For the ipv4 case, all
>> > addresses are owned by the system and the same addresses added to multiple
>> > interfaces should not be allowed. The same is true of ipv6 case. The only
>> > exception there is a link local address and that should still be unique within
>> > the context of an address/dev tuple.
>> >
>> understand, just sounds a little harsh. :-)
>>
>> For now, does it make sense to you to just leave as the master list
>> is, and check
>> the duplicate address when sctp is really binding them ?
>>
> I would think so, yes.
Hi, Neil,
About this patch, I think we are on the page, right ?
If yes, I will repost v2, but other than improving some changelog,
no other change compare to v1. Do you agree ?
^ permalink raw reply
* Re: [TSN RFC v2 0/9] TSN driver for the kernel
From: Henrik Austad @ 2016-12-17 9:05 UTC (permalink / raw)
To: Richard Cochran
Cc: linux-kernel, Henrik Austad, linux-media, alsa-devel, netdev
In-Reply-To: <20161216220530.GA25258@netboy>
[-- Attachment #1: Type: text/plain, Size: 9109 bytes --]
Hi Richard,
On Fri, Dec 16, 2016 at 11:05:30PM +0100, Richard Cochran wrote:
> On Fri, Dec 16, 2016 at 06:59:04PM +0100, henrik@austad.us wrote:
> > The driver is directed via ConfigFS as we need userspace to handle
> > stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and
> > whatever other management is needed.
>
> I complained about configfs before, but you didn't listen.
Yes you did, I remember quite well, and no, I didn't listen :)
At the time, there were other issues that I had to address. The
configfs-part is fairly isolated. As I tried to explain the last round,
the *reason* I've used ConfigFS thus far, is because it makes it pretty
easy from userspace to signal the driver to create a new alsa-device.
And the reason I haven't changed configfs, is because so far, that part has
worked fairly well and have made testing quite easy. At this stage, *this*
is what is helpful, not a perfect interface. This does not mean that
configfs is set in stone.
To clearify:
I'm sending out a new set now because, what I have works _fairly_ well for
testing and a way to see what you can do with AVB. Using spotify to play
music on random machines is quite entertaining.
It is by no means -done-, nor do I consider it done. I have been tight on
time, and instead of sitting in an office polishing on some code, I thought
it better to send out a new (and not done) set of patches so that others
could see it still being worked on. If this turned out to be noise-only, I
appologize!
> > 2 new fields in netdev_ops have been introduced, and the Intel
> > igb-driver has been updated (as this an AVB-capable NIC which is
> > available as a PCI-e card).
>
> The igb hacks show that you are on the wrong track. We can and should
> be able to support TSN without resorting to driver specific hacks and
> module parameters.
I was not able to find a sane way to change the mode of the NIC, some of
the settings required to enable Qav-mode must be done when bringing the
NIC up, so I needed hooks in _probe().
ANother elemnt needed is a way for tsn_core to ascertain if a NIC is
capable of TSN or not (this would be ndo_tsn_capable)
Then finally, you need to update values in a per-tx-queue manner when a new
stream is ready (hence ndo_tsn_link_configure).
What you mean by 'driver specific hacks' is not obvious though, TSN
requires a set of fairly standardized parameters (priority code points,
size of frames to send in a new stream and so on), adding this to the
hw-registers in the NIC is an operation that will be common for all
TSN-capable NICs.
> > Before reading on - this is not even beta, but I'd really appreciate if
> > people would comment on the overall architecture and perhaps provide
> > some pointers to where I should improve/fix/update
>
> As I said before about V1, this architecture stinks.
I like feedback when it's short, sweet and to the point
2 out of 3 ain't that bad ;)
> You appear to have continued hacking along and posted the same design
> again. Did you even address any of the points I raised back then?
So you did raise a lot of good points the last round, and no, I have not
had the time to address them properly. That does not mean I do not *want*
to (apart from configfs actually having worked quite nicely thus far and
'shim' being a name I like ;)
From the last round of discussion:
> 1. A proper userland stack for AVDECC, MAAP, FQTSS, and so on. The
> OpenAVB project does not offer much beyond simple examples.
Yes, I fully agree, as far as I know, no-one is working on this. That being
said, I have not paid much attention the userspace tooling lately. But all
of this must be handled in userspace, having avdecc in the kernel would be
an utter nightmare :)
> 2. A user space audio application that puts it all together, making
> use of the services in #1, the linuxptp gPTP service, the ALSA
> services, and the network connections. This program will have all
> the knowledge about packet formats, AV encodings, and the local HW
> capabilities. This program cannot yet be written, as we still need
> some kernel work in the audio and networking subsystems.
And therein lies the problem. It cannot yet be written, so we have to start
in *some* end. And as I repeatedly stated in June, I'm at an RFC here,
trying to spark some interest and lure other developers in :)
Also, I really do not want a media-application to care about _where_ the
frames are going. Sure, I see the issue of configuring a link, but that can
be done from _outside_ the media-application. VLC (or aplay, or totem, or
.. take your pick) should not have to worry about this.
Applications that require finer control over timestamping is easier to
adapt to AVB than all the others, I'd rather add special knobs for those
who are interested than adding a set of knobs that -all- applications must
be aware of.
Could be that we are talking about the same thing, just from different
perspectives.
> * Kernel Space
>
> 1. Providing frames with a future transmit time. For normal sockets,
> this can be in the CMESG data. For mmap'ed buffers, we will need a
> new format. (I think Arnd is working on a new layout.)
I need to revisit that discussion again I think.
> 2. Time based qdisc for transmitted frames. For MACs that support
> this (like the i210), we only have to place the frame into the
> correct queue. For normal HW, we want to be able to reserve a time
> window in which non-TSN frames are blocked. This is some work, but
> in the end it should be a generic solution that not only works
> "perfectly" with TSN HW but also provides best effort service using
> any NIC.
Yes, indeed, that would be one good solution, and quite a lot of work.
> 3. ALSA support for tunable AD/DA clocks. The rate of the Listener's
> DA clock must match that of the Talker and the other Listeners.
To nitpick a bit, all AD/DAs should match that of the gPTP grandmaster
(which in most settings would be the Talker). But yes, you need to adjust
the AD/DA. SRC is slow and painful, best to avoid.
> Either you adjust it in HW using a VCO or similar, or you do
> adaptive sample rate conversion in the application. (And that is
> another reason for *not* having a shared kernel buffer.) For the
> Talker, either you adjust the AD clock to match the PTP time, or
> you measure the frequency offset.
Yes, some hook into adjusting the clock is needed, I wonder if this is
possible via V4L2, or of the monitor-world is a completely different beast.
> 4. ALSA support for time triggered playback. The patch series
> completely ignore the critical issue of media clock recovery. The
> Listener must buffer the stream in order to play it exactly at a
> specified time. It cannot simply send the stream ASAP to the audio
> HW, because some other Listener might need longer. AFAICT, there is
> nothing in ALSA that allows you to say, sample X should be played at
> time Y.
Yes, and this requires a lot of change to ALSA (and probably something in
V4L2 as well?), so before we get to that, perhaps have a set of patches
that does this best effort and *then* work on getting time-triggered
playback into the kernel?
Another item that was brought up last round was getting timing-information
to/from ALSA, See driver/media/avb/avb_alsa.c, as a start it updates the
time for last incoming/outgoing frame so that userspace can get that
information. Probably buggy as heck :)
* Back to your email from last night*
> You are trying to put tons of code into the kernel that really belongs
> in user space, and at the same time, you omit critical functions that
> only the kernel can provide.
Some (well, to be honest, most) of the of the critical functions that my
driver omits, are omitted because they require substantial effort to
implement - and befor there's a need for this, that won't happen. So,
consider the TSN-driver such a need!
I'd love to use a qdisc that uses a time-triggered transmit, that would
drop the need for a lot of the stuff in tsn_core.c. The same goes for
time-triggered playback in media.
> > There are at least one AVB-driver (the AV-part of TSN) in the kernel
> > already.
>
> And which driver is that?
Ah, a proverbial slip of the changelog, we visited this the last iteration,
that would be the ravb-driver (which is an AVB capable NIC), but it does
not include much in the way of AVB-support *In* kernel. Sorry about that!
Since then, the iMX7 from NXP has arrived, and this also has HW-support for
TSN, but not in the kernel AFAICT.
So, the next issue I plan to tackle, is how I do buffers, the current
approach where tsn_core allocates memory is on its way out and I'll let the
shim (which means alsa/v4l2) will provide a buffer. Then I'll start looking
at qdisc.
Thanks!
--
Henrik Austad
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* Re: [TSN RFC v2 5/9] Add TSN header for the driver
From: Henrik Austad @ 2016-12-17 8:53 UTC (permalink / raw)
To: Richard Cochran
Cc: linux-kernel, Henrik Austad, linux-media, alsa-devel, netdev,
David S. Miller
In-Reply-To: <20161216220938.GB25258@netboy>
[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]
On Fri, Dec 16, 2016 at 11:09:38PM +0100, Richard Cochran wrote:
> On Fri, Dec 16, 2016 at 06:59:09PM +0100, henrik@austad.us wrote:
> > +/*
> > + * List of current subtype fields in the common header of AVTPDU
> > + *
> > + * Note: AVTPDU is a remnant of the standards from when it was AVB.
> > + *
> > + * The list has been updated with the recent values from IEEE 1722, draft 16.
> > + */
> > +enum avtp_subtype {
> > + TSN_61883_IIDC = 0, /* IEC 61883/IIDC Format */
> > + TSN_MMA_STREAM, /* MMA Streams */
> > + TSN_AAF, /* AVTP Audio Format */
> > + TSN_CVF, /* Compressed Video Format */
> > + TSN_CRF, /* Clock Reference Format */
> > + TSN_TSCF, /* Time-Synchronous Control Format */
> > + TSN_SVF, /* SDI Video Format */
> > + TSN_RVF, /* Raw Video Format */
> > + /* 0x08 - 0x6D reserved */
> > + TSN_AEF_CONTINOUS = 0x6e, /* AES Encrypted Format Continous */
> > + TSN_VSF_STREAM, /* Vendor Specific Format Stream */
> > + /* 0x70 - 0x7e reserved */
> > + TSN_EF_STREAM = 0x7f, /* Experimental Format Stream */
> > + /* 0x80 - 0x81 reserved */
> > + TSN_NTSCF = 0x82, /* Non Time-Synchronous Control Format */
> > + /* 0x83 - 0xed reserved */
> > + TSN_ESCF = 0xec, /* ECC Signed Control Format */
> > + TSN_EECF, /* ECC Encrypted Control Format */
> > + TSN_AEF_DISCRETE, /* AES Encrypted Format Discrete */
> > + /* 0xef - 0xf9 reserved */
> > + TSN_ADP = 0xfa, /* AVDECC Discovery Protocol */
> > + TSN_AECP, /* AVDECC Enumeration and Control Protocol */
> > + TSN_ACMP, /* AVDECC Connection Management Protocol */
> > + /* 0xfd reserved */
> > + TSN_MAAP = 0xfe, /* MAAP Protocol */
> > + TSN_EF_CONTROL, /* Experimental Format Control */
> > +};
>
> The kernel shouldn't be in the business of assembling media packets.
No, but assembling the packets and shipping frames to a destination is not
neccessarily the same thing.
A nice workflow would be to signal to the shim that "I'm sending a
compressed video format" and then the shim/tsn_core will ship out the
frames over the network - and then you need to set TSN_CVF as subtype in
each header.
That does not that mean you should do H.264 encode/decode *in* the kernel
Perhaps this is better placed in include/uapi/tsn.h so that userspace and
kernel share the same header?
--
Henrik Austad
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox