* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Guenter Roeck @ 2018-11-01 15:28 UTC (permalink / raw)
To: Trond Myklebust
Cc: paul.burton@mips.com, linux-kernel@vger.kernel.org,
ralf@linux-mips.org, jlayton@kernel.org,
linuxppc-dev@lists.ozlabs.org, bfields@fieldses.org,
linux-mips@linux-mips.org, linux-nfs@vger.kernel.org,
akpm@linux-foundation.org, anna.schumaker@netapp.com,
jhogan@kernel.org, netdev@vger.kernel.org, davem@davemloft.net,
arnd@arndb.de, paulus@samba.org, mpe@ellerman.id.au
In-Reply-To: <d7fe095d8d1f848b5742a5b3e8cce9f89e0c1c8d.camel@hammerspace.com>
On Thu, Nov 01, 2018 at 06:30:08AM +0000, Trond Myklebust wrote:
[ ... ]
> >
> > For my part I agree that this would be a much better solution. The
> > argument
> > that it is not always absolutely guaranteed that atomics don't wrap
> > doesn't
> > really hold for me because it looks like they all do. On top of that,
> > there
> > is an explicit atomic_dec_if_positive() and
> > atomic_fetch_add_unless(),
> > which to me strongly suggests that they _are_ supposed to wrap.
> > Given the cost of adding a comparison to each atomic operation to
> > prevent it from wrapping, anything else would not really make sense
> > to me.
>
> That's a hypothesis, not a proven fact. There are architectures out
> there that do not wrap signed integers, hence my question.
>
If what you say is correct, the kernel is in big trouble on those architectures.
atomic_inc_return() is used all over the place in the kernel with the assumption
that each returned value differs from the previous value (ie the value is used
as cookie, session ID, or for similar purposes).
Guenter
^ permalink raw reply
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Trond Myklebust @ 2018-11-01 15:22 UTC (permalink / raw)
To: mark.rutland@arm.com, peterz@infradead.org
Cc: linux-kernel@vger.kernel.org, ralf@linux-mips.org,
jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
bfields@fieldses.org, linux-mips@linux-mips.org,
linux@roeck-us.net, linux-nfs@vger.kernel.org,
akpm@linux-foundation.org, will.deacon@arm.com,
boqun.feng@gmail.com, paul.burton@mips.com,
anna.schumaker@netapp.com, jhogan@kernel.org,
netdev@vger.kernel.org, "davem@dave
In-Reply-To: <20181101145926.GE3178@hirez.programming.kicks-ass.net>
On Thu, 2018-11-01 at 15:59 +0100, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 01:18:46PM +0000, Mark Rutland wrote:
> > > My one question (and the reason why I went with cmpxchg() in the
> > > first
> > > place) would be about the overflow behaviour for
> > > atomic_fetch_inc() and
> > > friends. I believe those functions should be OK on x86, so that
> > > when we
> > > overflow the counter, it behaves like an unsigned value and wraps
> > > back
> > > around. Is that the case for all architectures?
> > >
> > > i.e. are atomic_t/atomic64_t always guaranteed to behave like
> > > u32/u64
> > > on increment?
> > >
> > > I could not find any documentation that explicitly stated that
> > > they
> > > should.
> >
> > Peter, Will, I understand that the atomic_t/atomic64_t ops are
> > required
> > to wrap per 2's-complement. IIUC the refcount code relies on this.
> >
> > Can you confirm?
>
> There is quite a bit of core code that hard assumes 2s-complement.
> Not
> only for atomics but for any signed integer type. Also see the kernel
> using -fno-strict-overflow which implies -fwrapv, which defines
> signed
> overflow to behave like 2s-complement (and rids us of that particular
> UB).
Fair enough, but there have also been bugfixes to explicitly fix unsafe
C standards assumptions for signed integers. See, for instance commit
5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
from Paul McKenney.
Anyhow, if the atomic maintainers are willing to stand up and state for
the record that the atomic counters are guaranteed to wrap modulo 2^n
just like unsigned integers, then I'm happy to take Paul's patch.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com
^ permalink raw reply
* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Masami Hiramatsu @ 2018-11-01 15:20 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
Daniel Borkmann, Brendan Gregg, Christian Brauner, Aleksa Sarai,
netdev, linux-doc, linux-kernel
In-Reply-To: <20181101083551.3805-2-cyphar@cyphar.com>
Hi Aleksa,
On Thu, 1 Nov 2018 19:35:50 +1100
Aleksa Sarai <cyphar@cyphar.com> wrote:
[...]
> diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
Please split the test case as an independent patch.
> new file mode 100644
> index 000000000000..03146c6a1a3c
> --- /dev/null
> +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
> @@ -0,0 +1,25 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0+
> +# description: Kretprobe dynamic event with a stacktrace
> +
> +[ -f kprobe_events ] || exit_unsupported # this is configurable
> +
> +echo 0 > events/enable
> +echo 1 > options/stacktrace
> +
> +echo 'r:teststackprobe sched_fork $retval' > kprobe_events
> +grep teststackprobe kprobe_events
> +test -d events/kprobes/teststackprobe
Hmm, what happen if we have 2 or more kretprobes on same stack?
It seems you just save stack in pre_handler, but that stack can already
includes another kretprobe's trampline address...
Thank you,
> +
> +clear_trace
> +echo 1 > events/kprobes/teststackprobe/enable
> +( echo "forked")
> +echo 0 > events/kprobes/teststackprobe/enable
> +
> +# Make sure we don't see kretprobe_trampoline and we see _do_fork.
> +! grep 'kretprobe' trace
> +grep '_do_fork' trace
> +
> +echo '-:teststackprobe' >> kprobe_events
> +clear_trace
> +test -d events/kprobes/teststackprobe && exit_fail || exit_pass
> --
> 2.19.1
>
--
Masami Hiramatsu <mhiramat@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Peter Zijlstra @ 2018-11-01 14:59 UTC (permalink / raw)
To: Mark Rutland
Cc: Trond Myklebust, linux@roeck-us.net, paul.burton@mips.com,
linux-kernel@vger.kernel.org, ralf@linux-mips.org,
jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
bfields@fieldses.org, linux-mips@linux-mips.org,
linux-nfs@vger.kernel.org, akpm@linux-foundation.org,
anna.schumaker@netapp.com, jhogan@kernel.org,
netdev@vger.kernel.org, davem@davemloft.net,
"arnd@arndb.de"
In-Reply-To: <20181101131846.biyilr2msonljmij@lakrids.cambridge.arm.com>
On Thu, Nov 01, 2018 at 01:18:46PM +0000, Mark Rutland wrote:
> > My one question (and the reason why I went with cmpxchg() in the first
> > place) would be about the overflow behaviour for atomic_fetch_inc() and
> > friends. I believe those functions should be OK on x86, so that when we
> > overflow the counter, it behaves like an unsigned value and wraps back
> > around. Is that the case for all architectures?
> >
> > i.e. are atomic_t/atomic64_t always guaranteed to behave like u32/u64
> > on increment?
> >
> > I could not find any documentation that explicitly stated that they
> > should.
>
> Peter, Will, I understand that the atomic_t/atomic64_t ops are required
> to wrap per 2's-complement. IIUC the refcount code relies on this.
>
> Can you confirm?
There is quite a bit of core code that hard assumes 2s-complement. Not
only for atomics but for any signed integer type. Also see the kernel
using -fno-strict-overflow which implies -fwrapv, which defines signed
overflow to behave like 2s-complement (and rids us of that particular
UB).
^ permalink raw reply
* Re: [PATCH v2 3/3] Bluetooth: hci_qca: Set HCI_QUIRK_USE_BDADDR_PROPERTY for wcn3990
From: Balakrishna Godavarthi @ 2018-11-01 14:37 UTC (permalink / raw)
To: Matthias Kaehlcke
Cc: Marcel Holtmann, Johan Hedberg, David S . Miller, Loic Poulain,
linux-bluetooth, netdev, linux-kernel, Brian Norris,
Dmitry Grinberg
In-Reply-To: <20181030004415.237101-4-mka@chromium.org>
On 2018-10-30 06:14, Matthias Kaehlcke wrote:
> Set quirk for wcn3990 to read BD_ADDR from a firmware node property.
>
> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> ---
> Changes in v2:
> - patch added to the series
>
> tested with https://lore.kernel.org/patchwork/patch/1003830
> ("Bluetooth: hci_qca: Add helper to set device address.")
> ---
> drivers/bluetooth/hci_qca.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
> index f036c8f98ea3..0535833caa52 100644
> --- a/drivers/bluetooth/hci_qca.c
> +++ b/drivers/bluetooth/hci_qca.c
> @@ -1193,6 +1193,7 @@ static int qca_setup(struct hci_uart *hu)
> * setup for every hci up.
> */
> set_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks);
> + set_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks);
> hu->hdev->shutdown = qca_power_off;
> ret = qca_wcn3990_init(hu);
> if (ret)
Tested-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
--
Regards
Balakrishna.
^ permalink raw reply
* Re: [PATCH v2 1/3] Bluetooth: Add quirk for reading BD_ADDR from fwnode property
From: Balakrishna Godavarthi @ 2018-11-01 14:37 UTC (permalink / raw)
To: Matthias Kaehlcke
Cc: Marcel Holtmann, Johan Hedberg, David S . Miller, Loic Poulain,
linux-bluetooth, netdev, linux-kernel, Brian Norris,
Dmitry Grinberg
In-Reply-To: <20181030004415.237101-2-mka@chromium.org>
On 2018-10-30 06:14, Matthias Kaehlcke wrote:
> Add HCI_QUIRK_USE_BDADDR_PROPERTY to allow controllers to retrieve
> the public Bluetooth address from the firmware node property
> 'local-bd-address'. If quirk is set and the property does not exist
> or is invalid the controller is marked as unconfigured.
>
> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> Reviewed-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
> ---
> Changes in v2:
> - added check for return value of ->setup()
> - only read BD_ADDR from the property if it isn't assigned yet. This
> is needed to support configuration from user space
> - refactored the branch of the new quirk to get rid of 'bd_addr_set'
> - added 'Reviewed-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>'
> tag
> ---
> include/net/bluetooth/hci.h | 12 ++++++++++
> net/bluetooth/hci_core.c | 45 +++++++++++++++++++++++++++++++++++++
> net/bluetooth/mgmt.c | 6 +++--
> 3 files changed, 61 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
> index c36dc1e20556..fbba43e9bef5 100644
> --- a/include/net/bluetooth/hci.h
> +++ b/include/net/bluetooth/hci.h
> @@ -158,6 +158,18 @@ enum {
> */
> HCI_QUIRK_INVALID_BDADDR,
>
> + /* When this quirk is set, the public Bluetooth address
> + * initially reported by HCI Read BD Address command
> + * is considered invalid. The public BD Address can be
> + * specified in the fwnode property 'local-bd-address'.
> + * If this property does not exist or is invalid controller
> + * configuration is required before this device can be used.
> + *
> + * This quirk can be set before hci_register_dev is called or
> + * during the hdev->setup vendor callback.
> + */
> + HCI_QUIRK_USE_BDADDR_PROPERTY,
> +
> /* When this quirk is set, the duplicate filtering during
> * scanning is based on Bluetooth devices addresses. To allow
> * RSSI based updates, restart scanning if needed.
> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
> index 7352fe85674b..d4149005a661 100644
> --- a/net/bluetooth/hci_core.c
> +++ b/net/bluetooth/hci_core.c
> @@ -30,6 +30,7 @@
> #include <linux/rfkill.h>
> #include <linux/debugfs.h>
> #include <linux/crypto.h>
> +#include <linux/property.h>
> #include <asm/unaligned.h>
>
> #include <net/bluetooth/bluetooth.h>
> @@ -1355,6 +1356,36 @@ int hci_inquiry(void __user *arg)
> return err;
> }
>
> +/**
> + * hci_dev_get_bd_addr_from_property - Get the Bluetooth Device
> Address
> + * (BD_ADDR) for a HCI device from
> + * a firmware node property.
> + * @hdev: The HCI device
> + *
> + * Search the firmware node for 'local-bd-address'.
> + *
> + * All-zero BD addresses are rejected, because those could be
> properties
> + * that exist in the firmware tables, but were not updated by the
> firmware. For
> + * example, the DTS could define 'local-bd-address', with zero BD
> addresses.
> + */
> +static int hci_dev_get_bd_addr_from_property(struct hci_dev *hdev)
> +{
> + struct fwnode_handle *fwnode = dev_fwnode(hdev->dev.parent);
> + bdaddr_t ba;
> + int ret;
> +
> + ret = fwnode_property_read_u8_array(fwnode, "local-bd-address",
> + (u8 *)&ba, sizeof(ba));
> + if (ret < 0)
> + return ret;
> + if (!bacmp(&ba, BDADDR_ANY))
> + return -ENODATA;
> +
> + hdev->public_addr = ba;
> +
> + return 0;
> +}
> +
> static int hci_dev_do_open(struct hci_dev *hdev)
> {
> int ret = 0;
> @@ -1422,6 +1453,20 @@ static int hci_dev_do_open(struct hci_dev *hdev)
> if (hdev->setup)
> ret = hdev->setup(hdev);
>
> + if (ret)
> + goto setup_failed;
> +
> + if (test_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks)) {
> + if (!bacmp(&hdev->public_addr, BDADDR_ANY))
> + hci_dev_get_bd_addr_from_property(hdev);
> +
> + if (!bacmp(&hdev->public_addr, BDADDR_ANY) ||
> + !hdev->set_bdaddr ||
> + hdev->set_bdaddr(hdev, &hdev->public_addr))
> + hci_dev_set_flag(hdev, HCI_UNCONFIGURED);
> + }
> +
> +setup_failed:
> /* The transport driver can set these quirks before
> * creating the HCI device or in its setup callback.
> *
> diff --git a/net/bluetooth/mgmt.c b/net/bluetooth/mgmt.c
> index ccce954f8146..fae84353d030 100644
> --- a/net/bluetooth/mgmt.c
> +++ b/net/bluetooth/mgmt.c
> @@ -551,7 +551,8 @@ static bool is_configured(struct hci_dev *hdev)
> !hci_dev_test_flag(hdev, HCI_EXT_CONFIGURED))
> return false;
>
> - if (test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) &&
> + if ((test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) ||
> + test_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks)) &&
> !bacmp(&hdev->public_addr, BDADDR_ANY))
> return false;
>
> @@ -566,7 +567,8 @@ static __le32 get_missing_options(struct hci_dev
> *hdev)
> !hci_dev_test_flag(hdev, HCI_EXT_CONFIGURED))
> options |= MGMT_OPTION_EXTERNAL_CONFIG;
>
> - if (test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) &&
> + if ((test_bit(HCI_QUIRK_INVALID_BDADDR, &hdev->quirks) ||
> + test_bit(HCI_QUIRK_USE_BDADDR_PROPERTY, &hdev->quirks)) &&
> !bacmp(&hdev->public_addr, BDADDR_ANY))
> options |= MGMT_OPTION_PUBLIC_ADDRESS;
Tested by dts entry and also via btmgnt tool
Tested-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
--
Regards
Balakrishna.
^ permalink raw reply
* Re: [PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable
From: Stephen Hemminger @ 2018-11-01 5:34 UTC (permalink / raw)
To: Jeff Barnhill; +Cc: netdev, davem, kuznet, yoshfuji
In-Reply-To: <20181101001438.9446-1-0xeffeff@gmail.com>
On Thu, 1 Nov 2018 00:14:38 +0000
Jeff Barnhill <0xeffeff@gmail.com> wrote:
> diff --git a/include/net/addrconf.h b/include/net/addrconf.h
> index 14b789a123e7..799af1a037d1 100644
> --- a/include/net/addrconf.h
> +++ b/include/net/addrconf.h
> @@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
> const struct in6_addr *addr);
> bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
> const struct in6_addr *addr);
> +int anycast_init(void);
> +void anycast_cleanup(void);
One minor nit that should be fixed.
To avoid any potential naming conflicts, please prefix all ipv6 global symbols
with ipv6_
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Christoph Paasch @ 2018-11-01 3:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <0ce864f0-38b9-59cc-18ea-e071afca347d@gmail.com>
On 31/10/18 - 17:53:22, Eric Dumazet wrote:
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> >
>
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
>
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
> very few packets per second per flow)
>
> Maybe you could elaborate on the need of having one UDP socket per connection.
I let Leif chime in on that as the ask came from him. Leif & his team are
implementing Quic in the Apache Traffic Server.
One advantage I can see is that it would allow to benefit from fq_pacing as
one could set sk_pacing_rate simply on the socket. That way there is no need
to implement the pacing in the user-space anymore.
> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand. Only after the connect() a full 4-tuple match
> > is happening. So, one can't really create this kind of a server that has
> > a connected UDP-socket per Quic connection.
> >
> > So, what is needed is an "atomic bind & connect" that basically
> > prevents any incoming traffic until the connect() call has been issued
> > at which point the full 4-tuple is known.
> >
> >
> > This patchset implements this functionality and exposes a socket-option
> > to do this.
> >
> > Usage would be:
> >
> > int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> >
> > int val = 1;
> > setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> >
> > bind(fd, (struct sockaddr *)&src, sizeof(src));
> >
> > /* At this point, incoming traffic will never match on this socket */
> >
> > connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> >
> > /* Only now incoming traffic will reach the socket */
> >
> >
> >
> > There is literally an infinite number of ways on how to implement it,
> > which is why I first send it out as an RFC. With this approach here I
> > chose the least invasive one, just preventing the match on the incoming
> > path.
> >
> >
> > The reason for choosing a SOL_SOCKET socket-option and not at the
> > SOL_UDP-level is because that functionality actually could be useful for
> > other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> > by binding to the source-IP and the destination-IP at the same time.
>
> Passive TCP flows can not benefit from this idea.
>
> Active TCP flows can already do that, I do not really understand what you are suggesting.
What we had here is that we wanted to let a server initiate more than 64K
connections *while* binding also to a source-IP.
With TCP the bind() would then pick a source-port and we ended up hitting the
64K limit. If we could do an atomic "bind + connect", source-port selection
could ensure that the 4-tuple is unique.
Or has something changed in recent times that allows to use the 4-tuple
matching when doing this with TCP?
Christoph
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 5:17 UTC (permalink / raw)
To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <d04b03fd-7a1a-2f0e-6031-f13ccc6b6b48@gmail.com>
On 10/31/2018 10:08 PM, Eric Dumazet wrote:
> Our plan is to use EDT model for UDP packets, so that we can
> still use one (not connected) UDP socket per cpu/thread.
>
> We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21.
>
> UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp,
> and pacing will happen just fine.
>
List of EDT patches in reverse order if you want to take a look :
3f80e08f40cdb308589a49077c87632fa4508b21 tcp: add tcp_reset_xmit_timer() helper
4c16128b6271e70c8743178e90cccee147858503 net: loopback: clear skb->tstamp before netif_rx()
79861919b8896e14b8e5707242721f2312c57ae4 tcp: fix TCP_REPAIR xmit queue setup
825e1c523d5000f067a1614e4a66bb282a2d373c tcp: cdg: use tcp high resolution clock cache
864e5c090749448e879e86bec06ee396aa2c19c5 tcp: optimize tcp internal pacing
7baf33bdac37da65ddce3adf4daa8c7805802174 net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
a7a2563064e963bc5e3f39f533974f2730c0ff56 tcp: mitigate scheduling jitter in EDT pacing model
76a9ebe811fb3d0605cb084f1ae6be5610541865 net: extend sk_pacing_rate to unsigned long
5f6188a8003d080e3753b8f14f4a5a2325ae1ff6 tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
fb420d5d91c1274d5966917725e71f27ed092a85 tcp/fq: move back to CLOCK_MONOTONIC
90caf67b01fabdd51b6cdeeb23b29bf73901df90 net_sched: sch_fq: remove dead code dealing with retransmits
c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
ab408b6dc7449c0f791e9e5f8de72fa7428584f2 tcp: switch tcp and sch_fq to new earliest departure time model
fd2bca2aa7893586887b2370e90e85bd0abc805e tcp: switch internal pacing timer to CLOCK_TAI
d3edd06ea8ea9e03de6567fda80b8be57e21a537 tcp: provide earliest departure time in skb->tstamp
9799ccb0e984a5c1311b22a212e7ff96e8b736de tcp: add tcp_wstamp_ns socket field
142537e419234c396890a22806b8644dce21b132 net_sched: sch_fq: switch to CLOCK_TAI
2fd66ffba50716fc5ab481c48db643af3bda2276 tcp: introduce tcp_skb_timestamp_us() helper
72b0094f918294e6cb8cf5c3b4520d928fbb1a57 tcp: switch tcp_clock_ns() to CLOCK_TAI base
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 5:08 UTC (permalink / raw)
To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181101035050.GO80792@MacBook-Pro-19.local>
On 10/31/2018 08:50 PM, Christoph Paasch wrote:
> On 31/10/18 - 17:53:22, Eric Dumazet wrote:
>> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>>> Implementations of Quic might want to create a separate socket for each
>>> Quic-connection by creating a connected UDP-socket.
>>>
>>
>> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
>>
>> It would add a huge overhead in term of memory usage in the kernel,
>> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
>> very few packets per second per flow)
>>
>> Maybe you could elaborate on the need of having one UDP socket per connection.
>
> I let Leif chime in on that as the ask came from him. Leif & his team are
> implementing Quic in the Apache Traffic Server.
>
>
> One advantage I can see is that it would allow to benefit from fq_pacing as
> one could set sk_pacing_rate simply on the socket. That way there is no need
> to implement the pacing in the user-space anymore.
Our plan is to use EDT model for UDP packets, so that we can
still use one (not connected) UDP socket per cpu/thread.
We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21.
UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp,
and pacing will happen just fine.
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Christoph Paasch @ 2018-11-01 5:07 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <74393778-62e8-76f5-3bfc-ae280b407278@gmail.com>
> On Oct 31, 2018, at 10:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
>> On 10/31/2018 08:50 PM, Christoph Paasch wrote:
>>
>> What we had here is that we wanted to let a server initiate more than 64K
>> connections *while* binding also to a source-IP.
>> With TCP the bind() would then pick a source-port and we ended up hitting the
>> 64K limit. If we could do an atomic "bind + connect", source-port selection
>> could ensure that the 4-tuple is unique.
>>
>> Or has something changed in recent times that allows to use the 4-tuple
>> matching when doing this with TCP?
>
>
> Well, yes, although it is not really recent (this came with linux-4.2)
>
> You can now bind to an address only, and let the sport being automatically chosen at connect()
Oh, I didn’t knew about this socket option. Thanks :-)
Christoph
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
>
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 5:04 UTC (permalink / raw)
To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181101035050.GO80792@MacBook-Pro-19.local>
On 10/31/2018 08:50 PM, Christoph Paasch wrote:
> What we had here is that we wanted to let a server initiate more than 64K
> connections *while* binding also to a source-IP.
> With TCP the bind() would then pick a source-port and we ended up hitting the
> 64K limit. If we could do an atomic "bind + connect", source-port selection
> could ensure that the 4-tuple is unique.
>
> Or has something changed in recent times that allows to use the 4-tuple
> matching when doing this with TCP?
Well, yes, although it is not really recent (this came with linux-4.2)
You can now bind to an address only, and let the sport being automatically chosen at connect()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
^ permalink raw reply
* RE: [PATCH v2] usbnet: smsc95xx: disable carrier check while suspending
From: RaghuramChary.Jallipalli @ 2018-11-01 13:58 UTC (permalink / raw)
To: frieder.schrempf, steve.glendinning, UNGLinuxDriver
Cc: davem, netdev, linux-usb, linux-kernel, stable
In-Reply-To: <1541022739-24678-1-git-send-email-frieder.schrempf@kontron.de>
> We need to make sure, that the carrier check polling is disabled while
> suspending. Otherwise we can end up with usbnet_read_cmd() being issued
> when only usbnet_read_cmd_nopm() is allowed. If this happens, read
> operations lock up.
>
> Fixes: d69d169493 ("usbnet: smsc95xx: fix link detection for disabled
> autonegotiation")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Reviewed-by: Raghuram Chary J <RaghuramChary.Jallipalli@microchip.com>
-Raghu
^ permalink raw reply
* [PATCH net] rxrpc: Fix lockup due to no error backoff after ack transmit error
From: David Howells @ 2018-11-01 13:39 UTC (permalink / raw)
To: netdev; +Cc: linux-afs, linux-kernel, dhowells
If the network becomes (partially) unavailable, say by disabling IPv6, the
background ACK transmission routine can get itself into a tizzy by
proposing immediate ACK retransmission. Since we're in the call event
processor, that happens immediately without returning to the workqueue
manager.
The condition should clear after a while when either the network comes back
or the call times out.
Fix this by:
(1) When re-proposing an ACK on failed Tx, don't schedule it immediately.
This will allow a certain amount of time to elapse before we try
again.
(2) Enforce a return to the workqueue manager after a certain number of
iterations of the call processing loop.
(3) Add a backoff delay that increases the delay on deferred ACKs by a
jiffy per failed transmission to a limit of HZ. The backoff delay is
cleared on a successful return from kernel_sendmsg().
(4) Cancel calls immediately if the opening sendmsg fails. The layer
above can arrange retransmission or rotate to another server.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/ar-internal.h | 1 +
net/rxrpc/call_event.c | 18 ++++++++++++++----
net/rxrpc/output.c | 35 +++++++++++++++++++++++++++++++----
3 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 382196e57a26..bc628acf4f4f 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -611,6 +611,7 @@ struct rxrpc_call {
* not hard-ACK'd packet follows this.
*/
rxrpc_seq_t tx_top; /* Highest Tx slot allocated. */
+ u16 tx_backoff; /* Delay to insert due to Tx failure */
/* TCP-style slow-start congestion control [RFC5681]. Since the SMSS
* is fixed, we keep these numbers in terms of segments (ie. DATA
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 8e7434e92097..468efc3660c0 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -123,6 +123,7 @@ static void __rxrpc_propose_ACK(struct rxrpc_call *call, u8 ack_reason,
else
ack_at = expiry;
+ ack_at += READ_ONCE(call->tx_backoff);
ack_at += now;
if (time_before(ack_at, call->ack_at)) {
WRITE_ONCE(call->ack_at, ack_at);
@@ -311,6 +312,7 @@ void rxrpc_process_call(struct work_struct *work)
container_of(work, struct rxrpc_call, processor);
rxrpc_serial_t *send_ack;
unsigned long now, next, t;
+ unsigned int iterations = 0;
rxrpc_see_call(call);
@@ -319,6 +321,11 @@ void rxrpc_process_call(struct work_struct *work)
call->debug_id, rxrpc_call_states[call->state], call->events);
recheck_state:
+ /* Limit the number of times we do this before returning to the manager */
+ iterations++;
+ if (iterations > 5)
+ goto requeue;
+
if (test_and_clear_bit(RXRPC_CALL_EV_ABORT, &call->events)) {
rxrpc_send_abort_packet(call);
goto recheck_state;
@@ -447,13 +454,16 @@ void rxrpc_process_call(struct work_struct *work)
rxrpc_reduce_call_timer(call, next, now, rxrpc_timer_restart);
/* other events may have been raised since we started checking */
- if (call->events && call->state < RXRPC_CALL_COMPLETE) {
- __rxrpc_queue_call(call);
- goto out;
- }
+ if (call->events && call->state < RXRPC_CALL_COMPLETE)
+ goto requeue;
out_put:
rxrpc_put_call(call, rxrpc_call_put);
out:
_leave("");
+ return;
+
+requeue:
+ __rxrpc_queue_call(call);
+ goto out;
}
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 189418888839..736aa9281100 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -34,6 +34,21 @@ struct rxrpc_abort_buffer {
static const char rxrpc_keepalive_string[] = "";
+/*
+ * Increase Tx backoff on transmission failure and clear it on success.
+ */
+static void rxrpc_tx_backoff(struct rxrpc_call *call, int ret)
+{
+ if (ret < 0) {
+ u16 tx_backoff = READ_ONCE(call->tx_backoff);
+
+ if (tx_backoff < HZ)
+ WRITE_ONCE(call->tx_backoff, tx_backoff + 1);
+ } else {
+ WRITE_ONCE(call->tx_backoff, 0);
+ }
+}
+
/*
* Arrange for a keepalive ping a certain time after we last transmitted. This
* lets the far side know we're still interested in this call and helps keep
@@ -210,6 +225,7 @@ int rxrpc_send_ack_packet(struct rxrpc_call *call, bool ping,
else
trace_rxrpc_tx_packet(call->debug_id, &pkt->whdr,
rxrpc_tx_point_call_ack);
+ rxrpc_tx_backoff(call, ret);
if (call->state < RXRPC_CALL_COMPLETE) {
if (ret < 0) {
@@ -218,7 +234,7 @@ int rxrpc_send_ack_packet(struct rxrpc_call *call, bool ping,
rxrpc_propose_ACK(call, pkt->ack.reason,
ntohs(pkt->ack.maxSkew),
ntohl(pkt->ack.serial),
- true, true,
+ false, true,
rxrpc_propose_ack_retry_tx);
} else {
spin_lock_bh(&call->lock);
@@ -300,7 +316,7 @@ int rxrpc_send_abort_packet(struct rxrpc_call *call)
else
trace_rxrpc_tx_packet(call->debug_id, &pkt.whdr,
rxrpc_tx_point_call_abort);
-
+ rxrpc_tx_backoff(call, ret);
rxrpc_put_connection(conn);
return ret;
@@ -413,6 +429,7 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb,
else
trace_rxrpc_tx_packet(call->debug_id, &whdr,
rxrpc_tx_point_call_data_nofrag);
+ rxrpc_tx_backoff(call, ret);
if (ret == -EMSGSIZE)
goto send_fragmentable;
@@ -445,9 +462,18 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb,
rxrpc_reduce_call_timer(call, expect_rx_by, nowj,
rxrpc_timer_set_for_normal);
}
- }
- rxrpc_set_keepalive(call);
+ rxrpc_set_keepalive(call);
+ } else {
+ /* Cancel the call if the initial transmission fails,
+ * particularly if that's due to network routing issues that
+ * aren't going away anytime soon. The layer above can arrange
+ * the retransmission.
+ */
+ if (!test_and_set_bit(RXRPC_CALL_BEGAN_RX_TIMER, &call->flags))
+ rxrpc_set_call_completion(call, RXRPC_CALL_LOCAL_ERROR,
+ RX_USER_ABORT, ret);
+ }
_leave(" = %d [%u]", ret, call->peer->maxdata);
return ret;
@@ -506,6 +532,7 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb,
else
trace_rxrpc_tx_packet(call->debug_id, &whdr,
rxrpc_tx_point_call_data_frag);
+ rxrpc_tx_backoff(call, ret);
up_write(&conn->params.local->defrag_sem);
goto done;
^ permalink raw reply related
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Mark Rutland @ 2018-11-01 13:18 UTC (permalink / raw)
To: Trond Myklebust
Cc: linux@roeck-us.net, paul.burton@mips.com,
linux-kernel@vger.kernel.org, ralf@linux-mips.org,
jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
bfields@fieldses.org, linux-mips@linux-mips.org,
linux-nfs@vger.kernel.org, akpm@linux-foundation.org,
anna.schumaker@netapp.com, jhogan@kernel.org,
netdev@vger.kernel.org, davem@davemloft.net, arnd@arndb.de,
"paulus@samba.org" <paulu
In-Reply-To: <4e2438a23d2edf03368950a72ec058d1d299c32e.camel@hammerspace.com>
[adding the atomic maintainers]
On Thu, Nov 01, 2018 at 12:17:31AM +0000, Trond Myklebust wrote:
> On Wed, 2018-10-31 at 23:32 +0000, Paul Burton wrote:
> > (Copying SunRPC & net maintainers.)
> >
> > Hi Guenter,
> >
> > On Wed, Oct 31, 2018 at 03:02:53PM -0700, Guenter Roeck wrote:
> > > The alternatives I can see are
> > > - Do not use cmpxchg64() outside architecture code (ie drop its use
> > > from the offending driver, and keep doing the same whenever the
> > > problem comes up again).
> > > or
> > > - Introduce something like ARCH_HAS_CMPXCHG64 and use it to
> > > determine if cmpxchg64 is supported or not.
> > >
> > > Any preference ?
> >
> > My preference would be option 1 - avoiding cmpxchg64() where
> > possible in generic code. I wouldn't be opposed to the Kconfig
> > option if there are cases where cmpxchg64() can really help
> > performance though.
> >
> > The last time I'm aware of this coming up the affected driver was
> > modified to avoid cmpxchg64() [1].
> >
> > In this particular case I have no idea why
> > net/sunrpc/auth_gss/gss_krb5_seal.c is using cmpxchg64() at all.
> > It's essentially reinventing atomic64_fetch_inc() which is already
> > provided everywhere via CONFIG_GENERIC_ATOMIC64 & the spinlock
> > approach. At least for atomic64_* functions the assumption that all
> > access will be performed using those same functions seems somewhat
> > reasonable.
> >
> > So how does the below look? Trond?
>
> My one question (and the reason why I went with cmpxchg() in the first
> place) would be about the overflow behaviour for atomic_fetch_inc() and
> friends. I believe those functions should be OK on x86, so that when we
> overflow the counter, it behaves like an unsigned value and wraps back
> around. Is that the case for all architectures?
>
> i.e. are atomic_t/atomic64_t always guaranteed to behave like u32/u64
> on increment?
>
> I could not find any documentation that explicitly stated that they
> should.
Peter, Will, I understand that the atomic_t/atomic64_t ops are required
to wrap per 2's-complement. IIUC the refcount code relies on this.
Can you confirm?
Thanks,
Mark.
> Cheers
> Trond
>
> > Thanks,
> > Paul
> >
> > [1] https://patchwork.ozlabs.org/cover/891284/
> >
> > ---
> > diff --git a/include/linux/sunrpc/gss_krb5.h
> > b/include/linux/sunrpc/gss_krb5.h
> > index 131424cefc6a..02c0412e368c 100644
> > --- a/include/linux/sunrpc/gss_krb5.h
> > +++ b/include/linux/sunrpc/gss_krb5.h
> > @@ -107,8 +107,8 @@ struct krb5_ctx {
> > u8 Ksess[GSS_KRB5_MAX_KEYLEN]; /*
> > session key */
> > u8 cksum[GSS_KRB5_MAX_KEYLEN];
> > s32 endtime;
> > - u32 seq_send;
> > - u64 seq_send64;
> > + atomic_t seq_send;
> > + atomic64_t seq_send64;
> > struct xdr_netobj mech_used;
> > u8 initiator_sign[GSS_KRB5_MAX_KEYLEN];
> > u8 acceptor_sign[GSS_KRB5_MAX_KEYLEN];
> > @@ -118,9 +118,6 @@ struct krb5_ctx {
> > u8 acceptor_integ[GSS_KRB5_MAX_KEYLEN];
> > };
> >
> > -extern u32 gss_seq_send_fetch_and_inc(struct krb5_ctx *ctx);
> > -extern u64 gss_seq_send64_fetch_and_inc(struct krb5_ctx *ctx);
> > -
> > /* The length of the Kerberos GSS token header */
> > #define GSS_KRB5_TOK_HDR_LEN (16)
> >
> > diff --git a/net/sunrpc/auth_gss/gss_krb5_mech.c
> > b/net/sunrpc/auth_gss/gss_krb5_mech.c
> > index 7f0424dfa8f6..eab71fc7af3e 100644
> > --- a/net/sunrpc/auth_gss/gss_krb5_mech.c
> > +++ b/net/sunrpc/auth_gss/gss_krb5_mech.c
> > @@ -274,6 +274,7 @@ get_key(const void *p, const void *end,
> > static int
> > gss_import_v1_context(const void *p, const void *end, struct
> > krb5_ctx *ctx)
> > {
> > + u32 seq_send;
> > int tmp;
> >
> > p = simple_get_bytes(p, end, &ctx->initiate, sizeof(ctx-
> > >initiate));
> > @@ -315,9 +316,10 @@ gss_import_v1_context(const void *p, const void
> > *end, struct krb5_ctx *ctx)
> > p = simple_get_bytes(p, end, &ctx->endtime, sizeof(ctx-
> > >endtime));
> > if (IS_ERR(p))
> > goto out_err;
> > - p = simple_get_bytes(p, end, &ctx->seq_send, sizeof(ctx-
> > >seq_send));
> > + p = simple_get_bytes(p, end, &seq_send, sizeof(seq_send));
> > if (IS_ERR(p))
> > goto out_err;
> > + atomic_set(&ctx->seq_send, seq_send);
> > p = simple_get_netobj(p, end, &ctx->mech_used);
> > if (IS_ERR(p))
> > goto out_err;
> > @@ -607,6 +609,7 @@ static int
> > gss_import_v2_context(const void *p, const void *end, struct
> > krb5_ctx *ctx,
> > gfp_t gfp_mask)
> > {
> > + u64 seq_send64;
> > int keylen;
> >
> > p = simple_get_bytes(p, end, &ctx->flags, sizeof(ctx->flags));
> > @@ -617,14 +620,15 @@ gss_import_v2_context(const void *p, const void
> > *end, struct krb5_ctx *ctx,
> > p = simple_get_bytes(p, end, &ctx->endtime, sizeof(ctx-
> > >endtime));
> > if (IS_ERR(p))
> > goto out_err;
> > - p = simple_get_bytes(p, end, &ctx->seq_send64, sizeof(ctx-
> > >seq_send64));
> > + p = simple_get_bytes(p, end, &seq_send64, sizeof(seq_send64));
> > if (IS_ERR(p))
> > goto out_err;
> > + atomic64_set(&ctx->seq_send64, seq_send64);
> > /* set seq_send for use by "older" enctypes */
> > - ctx->seq_send = ctx->seq_send64;
> > - if (ctx->seq_send64 != ctx->seq_send) {
> > - dprintk("%s: seq_send64 %lx, seq_send %x overflow?\n",
> > __func__,
> > - (unsigned long)ctx->seq_send64, ctx->seq_send);
> > + atomic_set(&ctx->seq_send, seq_send64);
> > + if (seq_send64 != atomic_read(&ctx->seq_send)) {
> > + dprintk("%s: seq_send64 %llx, seq_send %x overflow?\n",
> > __func__,
> > + seq_send64, atomic_read(&ctx->seq_send));
> > p = ERR_PTR(-EINVAL);
> > goto out_err;
> > }
> > diff --git a/net/sunrpc/auth_gss/gss_krb5_seal.c
> > b/net/sunrpc/auth_gss/gss_krb5_seal.c
> > index b4adeb06660b..48fe4a591b54 100644
> > --- a/net/sunrpc/auth_gss/gss_krb5_seal.c
> > +++ b/net/sunrpc/auth_gss/gss_krb5_seal.c
> > @@ -123,30 +123,6 @@ setup_token_v2(struct krb5_ctx *ctx, struct
> > xdr_netobj *token)
> > return krb5_hdr;
> > }
> >
> > -u32
> > -gss_seq_send_fetch_and_inc(struct krb5_ctx *ctx)
> > -{
> > - u32 old, seq_send = READ_ONCE(ctx->seq_send);
> > -
> > - do {
> > - old = seq_send;
> > - seq_send = cmpxchg(&ctx->seq_send, old, old + 1);
> > - } while (old != seq_send);
> > - return seq_send;
> > -}
> > -
> > -u64
> > -gss_seq_send64_fetch_and_inc(struct krb5_ctx *ctx)
> > -{
> > - u64 old, seq_send = READ_ONCE(ctx->seq_send);
> > -
> > - do {
> > - old = seq_send;
> > - seq_send = cmpxchg64(&ctx->seq_send64, old, old + 1);
> > - } while (old != seq_send);
> > - return seq_send;
> > -}
> > -
> > static u32
> > gss_get_mic_v1(struct krb5_ctx *ctx, struct xdr_buf *text,
> > struct xdr_netobj *token)
> > @@ -177,7 +153,7 @@ gss_get_mic_v1(struct krb5_ctx *ctx, struct
> > xdr_buf *text,
> >
> > memcpy(ptr + GSS_KRB5_TOK_HDR_LEN, md5cksum.data,
> > md5cksum.len);
> >
> > - seq_send = gss_seq_send_fetch_and_inc(ctx);
> > + seq_send = atomic_fetch_inc(&ctx->seq_send);
> >
> > if (krb5_make_seq_num(ctx, ctx->seq, ctx->initiate ? 0 : 0xff,
> > seq_send, ptr + GSS_KRB5_TOK_HDR_LEN, ptr
> > + 8))
> > @@ -205,7 +181,7 @@ gss_get_mic_v2(struct krb5_ctx *ctx, struct
> > xdr_buf *text,
> >
> > /* Set up the sequence number. Now 64-bits in clear
> > * text and w/o direction indicator */
> > - seq_send_be64 = cpu_to_be64(gss_seq_send64_fetch_and_inc(ctx));
> > + seq_send_be64 = cpu_to_be64(atomic64_fetch_inc(&ctx-
> > >seq_send64));
> > memcpy(krb5_hdr + 8, (char *) &seq_send_be64, 8);
> >
> > if (ctx->initiate) {
> > diff --git a/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > b/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > index 962fa84e6db1..5cdde6cb703a 100644
> > --- a/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > +++ b/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > @@ -228,7 +228,7 @@ gss_wrap_kerberos_v1(struct krb5_ctx *kctx, int
> > offset,
> >
> > memcpy(ptr + GSS_KRB5_TOK_HDR_LEN, md5cksum.data,
> > md5cksum.len);
> >
> > - seq_send = gss_seq_send_fetch_and_inc(kctx);
> > + seq_send = atomic_fetch_inc(&kctx->seq_send);
> >
> > /* XXX would probably be more efficient to compute checksum
> > * and encrypt at the same time: */
> > @@ -475,7 +475,7 @@ gss_wrap_kerberos_v2(struct krb5_ctx *kctx, u32
> > offset,
> > *be16ptr++ = 0;
> >
> > be64ptr = (__be64 *)be16ptr;
> > - *be64ptr = cpu_to_be64(gss_seq_send64_fetch_and_inc(kctx));
> > + *be64ptr = cpu_to_be64(atomic64_fetch_inc(&kctx->seq_send64));
> >
> > err = (*kctx->gk5e->encrypt_v2)(kctx, offset, buf, pages);
> > if (err)
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space
>
>
^ permalink raw reply
* Re: [PATCH net] net: hns3: Fix for out-of-bounds access when setting pfc back pressure
From: John Garry @ 2018-11-01 12:49 UTC (permalink / raw)
To: Yunsheng Lin, davem
Cc: huangdaode, xuwei5, liguozhu, Yisen.Zhuang, linuxarm, salil.mehta,
lipeng321, netdev, linux-kernel
In-Reply-To: <1541074363-98630-1-git-send-email-linyunsheng@huawei.com>
On 01/11/2018 12:12, Yunsheng Lin wrote:
> The vport should be initialized to hdev->vport for each bp group,
> otherwise it will cause out-of-bounds access and bp setting not
> correct problem.
>
> [ 35.254124] BUG: KASAN: slab-out-of-bounds in hclge_pause_setup_hw+0x2a0/0x3f8 [hclge]
> [ 35.254126] Read of size 2 at addr ffff803b6651581a by task kworker/0:1/14
>
> [ 35.254132] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 4.19.0-rc7-hulk+ #85
> [ 35.254133] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - B052 (V0.52) 09/14/2018
> [ 35.254141] Workqueue: events work_for_cpu_fn
> [ 35.254144] Call trace:
> [ 35.254147] dump_backtrace+0x0/0x2f0
> [ 35.254149] show_stack+0x24/0x30
> [ 35.254154] dump_stack+0x110/0x184
> [ 35.254157] print_address_description+0x168/0x2b0
> [ 35.254160] kasan_report+0x184/0x310
> [ 35.254162] __asan_load2+0x7c/0xa0
> [ 35.254170] hclge_pause_setup_hw+0x2a0/0x3f8 [hclge]
> [ 35.254177] hclge_tm_init_hw+0x794/0x9f0 [hclge]
> [ 35.254184] hclge_tm_schd_init+0x48/0x58 [hclge]
> [ 35.254191] hclge_init_ae_dev+0x778/0x1168 [hclge]
> [ 35.254196] hnae3_register_ae_dev+0x14c/0x298 [hnae3]
> [ 35.254206] hns3_probe+0x88/0xa8 [hns3]
> [ 35.254210] local_pci_probe+0x7c/0xf0
> [ 35.254212] work_for_cpu_fn+0x34/0x50
> [ 35.254214] process_one_work+0x4d4/0xa38
> [ 35.254216] worker_thread+0x55c/0x8d8
> [ 35.254219] kthread+0x1b0/0x1b8
> [ 35.254222] ret_from_fork+0x10/0x1c
>
> [ 35.254224] The buggy address belongs to the page:
> [ 35.254228] page:ffff7e00ed994400 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
> [ 35.273835] flags: 0xfffff8000008000(head)
> [ 35.282007] raw: 0fffff8000008000 dead000000000100 dead000000000200 0000000000000000
> [ 35.282010] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> [ 35.282012] page dumped because: kasan: bad access detected
>
> [ 35.282014] Memory state around the buggy address:
> [ 35.282017] ffff803b66515700: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282019] ffff803b66515780: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282021] >ffff803b66515800: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282022] ^
> [ 35.282024] ffff803b66515880: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282026] ffff803b66515900: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282028] ==================================================================
> [ 35.282029] Disabling lock debugging due to kernel taint
> [ 35.282747] hclge driver initialization finished.
>
> Fixes: 67bf2541f4b9 ("net: hns3: Fixes the back pressure setting when sriov is enabled")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
> index aa5cb98..0c9c6ae 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
> @@ -1168,11 +1168,12 @@ static int hclge_pfc_setup_hw(struct hclge_dev *hdev)
> */
> static int hclge_bp_setup_hw(struct hclge_dev *hdev, u8 tc)
> {
> - struct hclge_vport *vport = hdev->vport;
> u32 i, k, qs_bitmap;
> int ret;
>
> for (i = 0; i < HCLGE_BP_GRP_NUM; i++) {
> + struct hclge_vport *vport = hdev->vport;
> +
It seems to me that the bug is fixed but the code is not much better.
Since variable vport is only referenced in the inner loop, I would
define it there.
> qs_bitmap = 0;
>
> for (k = 0; k < hdev->num_alloc_vport; k++) {
>
static int hclge_bp_setup_hw(struct hclge_dev *hdev, u8 tc)
{
- struct hclge_vport *vport = hdev->vport;
- u32 i, k, qs_bitmap;
- int ret;
+ int ret, i;
for (i = 0; i < HCLGE_BP_GRP_NUM; i++) {
- qs_bitmap = 0;
+ u32 qs_bitmap = 0;
+ int k;
for (k = 0; k < hdev->num_alloc_vport; k++) {
+ struct hclge_vport *vport = &hdev->vport[k];
u16 qs_id = vport->qs_offset + tc;
u8 grp, sub_grp;
...
if (i == grp)
qs_bitmap |= (1 << sub_grp);
-
- vport++;
... so this seems better and safer, as you're explicitly limiting vport
within bounds of hdev->num_alloc_vport.
^ permalink raw reply
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-01 3:37 UTC (permalink / raw)
To: Paweł Staszewski, netdev
In-Reply-To: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl>
On 10/31/18 3:57 PM, Paweł Staszewski wrote:
> Hi
>
> So maybee someone will be interested how linux kernel handles normal
> traffic (not pktgen :) )
>
>
> Server HW configuration:
>
> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>
> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>
>
> Server software:
>
> FRR - as routing daemon
>
> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa
> node)
>
> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
>
>
> Maximum traffic that server can handle:
>
> Bandwidth
>
> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> input: /proc/net/dev type: rate
> \ iface Rx Tx Total
> ==============================================================================
>
> enp175s0f1: 28.51 Gb/s 37.24 Gb/s
> 65.74 Gb/s
> enp175s0f0: 38.07 Gb/s 28.44 Gb/s
> 66.51 Gb/s
> ------------------------------------------------------------------------------
>
> total: 66.58 Gb/s 65.67 Gb/s
> 132.25 Gb/s
>
>
> Packets per second:
>
> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> input: /proc/net/dev type: rate
> - iface Rx Tx Total
> ==============================================================================
>
> enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s
> enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s
> ------------------------------------------------------------------------------
>
> total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s
>
>
> After reaching that limits nics on the upstream side (more RX traffic)
> start to drop packets
>
>
> I just dont understand that server can't handle more bandwidth
> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX
> side are increasing.
>
> Was thinking that maybee reached some pcie x16 limit - but x16 8GT is
> 126Gbit - and also when testing with pktgen i can reach more bw and pps
> (like 4x more comparing to normal internet traffic)
>
> And wondering if there is something that can be improved here.
This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement. You
need the vlan changes I have queued up though.
^ permalink raw reply
* Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
From: David Ahern @ 2018-11-01 2:53 UTC (permalink / raw)
To: Jeff Barnhill, davem; +Cc: netdev, Alexey Kuznetsov, yoshfuji
In-Reply-To: <CAL6e_pfzPzt=rAxjWKAWHQqdrqejZ5e6vA1YoB3nGyc3_jeJeA@mail.gmail.com>
On 10/31/18 6:02 PM, Jeff Barnhill wrote:
> I'll follow this email with a new patch using ifacaddr6 instead of
> creating a new struct. I ended up using fib6_nh.nh_dev to get the net,
> instead of adding a back pointer to idev. It seems that idev was
> recently removed in lieu of this, so if this is incorrect, please let
> me know. Hopefully, I got the locking correct.
That's correct. Make sure that the anycast code can not be accessed for
reject routes which will not have a device set. Should be ok, but double
check.
^ permalink raw reply
* [PATCH v1 net] net: dsa: microchip: initialize mutex before use
From: Tristram.Ha @ 2018-11-01 2:49 UTC (permalink / raw)
To: David S. Miller
Cc: Tristram Ha, Andrew Lunn, Florian Fainelli, Pavel Machek,
UNGLinuxDriver, netdev
From: Tristram Ha <Tristram.Ha@microchip.com>
Initialize mutex before use. Avoid kernel complaint when
CONFIG_DEBUG_LOCK_ALLOC is enabled.
Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
---
v1
- Remove comment
drivers/net/dsa/microchip/ksz_common.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
index 54e0ca6..86b6464 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
{
int i;
- mutex_init(&dev->reg_mutex);
- mutex_init(&dev->stats_mutex);
- mutex_init(&dev->alu_mutex);
- mutex_init(&dev->vlan_mutex);
-
dev->ds->ops = &ksz_switch_ops;
for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
@@ -1206,6 +1201,11 @@ int ksz_switch_register(struct ksz_device *dev)
if (dev->pdata)
dev->chip_id = dev->pdata->chip_id;
+ mutex_init(&dev->reg_mutex);
+ mutex_init(&dev->stats_mutex);
+ mutex_init(&dev->alu_mutex);
+ mutex_init(&dev->vlan_mutex);
+
if (ksz_switch_detect(dev))
return -EINVAL;
--
1.9.1
^ permalink raw reply related
* Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
From: David Ahern @ 2018-11-01 2:48 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yoann P., Stephen Hemminger, netdev
In-Reply-To: <20181030183420.181b3d51@redhat.com>
[ sorry, too many distractions and I forgot to respond ]
On 10/30/18 11:34 AM, Stefano Brivio wrote:
> On Tue, 30 Oct 2018 10:34:45 -0600
> David Ahern <dsahern@gmail.com> wrote:
>
>> A more flexible approach is to use format strings to allow users to
>> customize the output order and whitespace as well. So for ss and your
>> column list (winging it here):
>>
>> netid = %N
>> state = %S
>> recv Q = %Qr
>> send Q = %Qs
>> local address = %Al
>> lport port = %Pl
>> remote address = %Ar
>> remote port = %Pr
>> process data = %p
>> ...
>>
>> then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n"
>
> I like the idea indeed, but I see two issues with ss:
>
> - the current column abstraction is rather lightweight, things are
> already buffered in the defined column order so we don't have to jump
> back and forth in the buffer while rendering. Doing that needs some
> extra care to avoid a performance hit, but it's probably doable, I
> can put that on my to-do list
The ss command is always a pain; it's much better in newer releases but
I am always having to adjust terminal width.
>
> - how would you model automatic spacing in a format string? Should we
> support width specifiers? Disable automatic spacing if a format
> string is given? It might even make sense to allow partial automatic
Follow the format string for whitespace and order, and
yes, on the width specifiers if possible.
> spacing with a special character in the format string, that is:
>
> "%S.%Qr.%Qs %Al:%Pl %Ar:%Pr %p\n"
>
> would mean "align everything to the right, distribute remaining
> whitespace between %S, %Qr and %Qs". But it looks rather complicated
> at a glance.
>
My concern here is that once this goes in for 1 command, the others in
iproute2 need to follow suit - meaning same syntax style for all
commands. Given that I'd prefer we get a reasonable consensus on syntax
that will work across commands -- ss, ip, tc. If it is as simple as
column names with a fixed order, that is fine but just give proper
consideration given the impact.
The 'ip' syntax for example gets ugly quick with the various link types
and their options. We don't need to allow every detail of each link type
to be customized, but there are common attributes for links (e.g., mtu,
ifindex, link flags, lladdr), addresses, and link types such as bridges
and bonds where we can improve the amount of data thrown at a user -- a
better, more customizable version of what the brief option targeted.
^ permalink raw reply
* Re: [PATCH net] openvswitch: Fix push/pop ethernet validation
From: David Miller @ 2018-11-01 1:38 UTC (permalink / raw)
To: jcaamano; +Cc: netdev, pshelar
In-Reply-To: <20181031175203.23808-1-jcaamano@suse.com>
From: Jaime Caamaño Ruiz <jcaamano@suse.com>
Date: Wed, 31 Oct 2018 18:52:03 +0100
> When there are both pop and push ethernet header actions among the
> actions to be applied to a packet, an unexpected EINVAL (Invalid
> argument) error is obtained. This is due to mac_proto not being reset
> correctly when those actions are validated.
>
> Reported-at:
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html
> Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions")
> Signed-off-by: Jaime Caamaño Ruiz <jcaamano@suse.com>
Applied and queued up for -stable.
^ permalink raw reply
* Geschäfts- / Projektkredit 1,5%
From: SafetyNet Credit @ 2018-11-01 1:29 UTC (permalink / raw)
--
Schönen Tag.
Wir hoffen, Sie gut zu treffen.
Benötigen Sie einen dringenden Kredit für ein Geschäft oder ein Projekt?
Wir bieten Kredite zu 1,5% und wir können Ihre Transaktion innerhalb von
3 bis 5 Arbeitstagen abschließen.
Wenn Sie ernsthaft an einem Kredit interessiert sind, empfehlen wir
Ihnen, unten die Einzelheiten zur Bearbeitung Ihrer Transaktion
anzugeben.
Vollständiger Name:..............
Darlehensbetrag: ..............
Darlehensdauer: ..............
Darlehen Zweck:..............
Telefon:..............
Wir erwarten Ihre Darlehensdaten wie oben beschrieben für die Abwicklung
Ihrer Transaktion.
Freundliche Grüße.
Wilson Rog.
Buchhalter / Berater
^ permalink raw reply
* Re: [net 0/8][pull request] Intel Wired LAN Driver Updates 2018-10-31
From: David Miller @ 2018-11-01 1:23 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20181031194254.16417-1-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 31 Oct 2018 12:42:46 -0700
> This series contains a various collection of fixes.
...
> The following are changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:
> net: mvpp2: Fix affinity hint allocation
> and are available in the git repository at:
> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE
Pulled, thanks Jeff.
^ permalink raw reply
* Re: [PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return
From: Roman Gushchin @ 2018-11-01 1:11 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: ast@kernel.org, netdev@vger.kernel.org
In-Reply-To: <20181031230555.3371-3-daniel@iogearbox.net>
On Thu, Nov 01, 2018 at 12:05:53AM +0100, Daniel Borkmann wrote:
> In the verifier there is no such semantics where registers with
> PTR_TO_MAP_VALUE type have an id assigned to them. This is only
> used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
> test against NULL has been pattern matched and type transformed
> into PTR_TO_MAP_VALUE.
>
> Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Roman Gushchin <guro@fb.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
Looks good to me.
Acked-by: Roman Gushchin <guro@fb.com>
Thanks!
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 0:53 UTC (permalink / raw)
To: Christoph Paasch, netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181031232635.33750-1-cpaasch@apple.com>
On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
>
Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
It would add a huge overhead in term of memory usage in the kernel,
and lots of epoll events to manage (say a QUIC server with one million flows, receiving
very few packets per second per flow)
Maybe you could elaborate on the need of having one UDP socket per connection.
> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand. Only after the connect() a full 4-tuple match
> is happening. So, one can't really create this kind of a server that has
> a connected UDP-socket per Quic connection.
>
> So, what is needed is an "atomic bind & connect" that basically
> prevents any incoming traffic until the connect() call has been issued
> at which point the full 4-tuple is known.
>
>
> This patchset implements this functionality and exposes a socket-option
> to do this.
>
> Usage would be:
>
> int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
>
> int val = 1;
> setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
>
> bind(fd, (struct sockaddr *)&src, sizeof(src));
>
> /* At this point, incoming traffic will never match on this socket */
>
> connect(fd, (struct sockaddr *)&dst, sizeof(dst));
>
> /* Only now incoming traffic will reach the socket */
>
>
>
> There is literally an infinite number of ways on how to implement it,
> which is why I first send it out as an RFC. With this approach here I
> chose the least invasive one, just preventing the match on the incoming
> path.
>
>
> The reason for choosing a SOL_SOCKET socket-option and not at the
> SOL_UDP-level is because that functionality actually could be useful for
> other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> by binding to the source-IP and the destination-IP at the same time.
Passive TCP flows can not benefit from this idea.
Active TCP flows can already do that, I do not really understand what you are suggesting.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox