* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Christoph Paasch @ 2018-11-01 3:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <0ce864f0-38b9-59cc-18ea-e071afca347d@gmail.com>
On 31/10/18 - 17:53:22, Eric Dumazet wrote:
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> >
>
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
>
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
> very few packets per second per flow)
>
> Maybe you could elaborate on the need of having one UDP socket per connection.
I let Leif chime in on that as the ask came from him. Leif & his team are
implementing Quic in the Apache Traffic Server.
One advantage I can see is that it would allow to benefit from fq_pacing as
one could set sk_pacing_rate simply on the socket. That way there is no need
to implement the pacing in the user-space anymore.
> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand. Only after the connect() a full 4-tuple match
> > is happening. So, one can't really create this kind of a server that has
> > a connected UDP-socket per Quic connection.
> >
> > So, what is needed is an "atomic bind & connect" that basically
> > prevents any incoming traffic until the connect() call has been issued
> > at which point the full 4-tuple is known.
> >
> >
> > This patchset implements this functionality and exposes a socket-option
> > to do this.
> >
> > Usage would be:
> >
> > int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> >
> > int val = 1;
> > setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> >
> > bind(fd, (struct sockaddr *)&src, sizeof(src));
> >
> > /* At this point, incoming traffic will never match on this socket */
> >
> > connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> >
> > /* Only now incoming traffic will reach the socket */
> >
> >
> >
> > There is literally an infinite number of ways on how to implement it,
> > which is why I first send it out as an RFC. With this approach here I
> > chose the least invasive one, just preventing the match on the incoming
> > path.
> >
> >
> > The reason for choosing a SOL_SOCKET socket-option and not at the
> > SOL_UDP-level is because that functionality actually could be useful for
> > other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> > by binding to the source-IP and the destination-IP at the same time.
>
> Passive TCP flows can not benefit from this idea.
>
> Active TCP flows can already do that, I do not really understand what you are suggesting.
What we had here is that we wanted to let a server initiate more than 64K
connections *while* binding also to a source-IP.
With TCP the bind() would then pick a source-port and we ended up hitting the
64K limit. If we could do an atomic "bind + connect", source-port selection
could ensure that the 4-tuple is unique.
Or has something changed in recent times that allows to use the 4-tuple
matching when doing this with TCP?
Christoph
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 5:17 UTC (permalink / raw)
To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <d04b03fd-7a1a-2f0e-6031-f13ccc6b6b48@gmail.com>
On 10/31/2018 10:08 PM, Eric Dumazet wrote:
> Our plan is to use EDT model for UDP packets, so that we can
> still use one (not connected) UDP socket per cpu/thread.
>
> We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21.
>
> UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp,
> and pacing will happen just fine.
>
List of EDT patches in reverse order if you want to take a look :
3f80e08f40cdb308589a49077c87632fa4508b21 tcp: add tcp_reset_xmit_timer() helper
4c16128b6271e70c8743178e90cccee147858503 net: loopback: clear skb->tstamp before netif_rx()
79861919b8896e14b8e5707242721f2312c57ae4 tcp: fix TCP_REPAIR xmit queue setup
825e1c523d5000f067a1614e4a66bb282a2d373c tcp: cdg: use tcp high resolution clock cache
864e5c090749448e879e86bec06ee396aa2c19c5 tcp: optimize tcp internal pacing
7baf33bdac37da65ddce3adf4daa8c7805802174 net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
a7a2563064e963bc5e3f39f533974f2730c0ff56 tcp: mitigate scheduling jitter in EDT pacing model
76a9ebe811fb3d0605cb084f1ae6be5610541865 net: extend sk_pacing_rate to unsigned long
5f6188a8003d080e3753b8f14f4a5a2325ae1ff6 tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
fb420d5d91c1274d5966917725e71f27ed092a85 tcp/fq: move back to CLOCK_MONOTONIC
90caf67b01fabdd51b6cdeeb23b29bf73901df90 net_sched: sch_fq: remove dead code dealing with retransmits
c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
ab408b6dc7449c0f791e9e5f8de72fa7428584f2 tcp: switch tcp and sch_fq to new earliest departure time model
fd2bca2aa7893586887b2370e90e85bd0abc805e tcp: switch internal pacing timer to CLOCK_TAI
d3edd06ea8ea9e03de6567fda80b8be57e21a537 tcp: provide earliest departure time in skb->tstamp
9799ccb0e984a5c1311b22a212e7ff96e8b736de tcp: add tcp_wstamp_ns socket field
142537e419234c396890a22806b8644dce21b132 net_sched: sch_fq: switch to CLOCK_TAI
2fd66ffba50716fc5ab481c48db643af3bda2276 tcp: introduce tcp_skb_timestamp_us() helper
72b0094f918294e6cb8cf5c3b4520d928fbb1a57 tcp: switch tcp_clock_ns() to CLOCK_TAI base
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 5:08 UTC (permalink / raw)
To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181101035050.GO80792@MacBook-Pro-19.local>
On 10/31/2018 08:50 PM, Christoph Paasch wrote:
> On 31/10/18 - 17:53:22, Eric Dumazet wrote:
>> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>>> Implementations of Quic might want to create a separate socket for each
>>> Quic-connection by creating a connected UDP-socket.
>>>
>>
>> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
>>
>> It would add a huge overhead in term of memory usage in the kernel,
>> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
>> very few packets per second per flow)
>>
>> Maybe you could elaborate on the need of having one UDP socket per connection.
>
> I let Leif chime in on that as the ask came from him. Leif & his team are
> implementing Quic in the Apache Traffic Server.
>
>
> One advantage I can see is that it would allow to benefit from fq_pacing as
> one could set sk_pacing_rate simply on the socket. That way there is no need
> to implement the pacing in the user-space anymore.
Our plan is to use EDT model for UDP packets, so that we can
still use one (not connected) UDP socket per cpu/thread.
We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21.
UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp,
and pacing will happen just fine.
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Christoph Paasch @ 2018-11-01 5:07 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <74393778-62e8-76f5-3bfc-ae280b407278@gmail.com>
> On Oct 31, 2018, at 10:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
>> On 10/31/2018 08:50 PM, Christoph Paasch wrote:
>>
>> What we had here is that we wanted to let a server initiate more than 64K
>> connections *while* binding also to a source-IP.
>> With TCP the bind() would then pick a source-port and we ended up hitting the
>> 64K limit. If we could do an atomic "bind + connect", source-port selection
>> could ensure that the 4-tuple is unique.
>>
>> Or has something changed in recent times that allows to use the 4-tuple
>> matching when doing this with TCP?
>
>
> Well, yes, although it is not really recent (this came with linux-4.2)
>
> You can now bind to an address only, and let the sport being automatically chosen at connect()
Oh, I didn’t knew about this socket option. Thanks :-)
Christoph
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
>
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 5:04 UTC (permalink / raw)
To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181101035050.GO80792@MacBook-Pro-19.local>
On 10/31/2018 08:50 PM, Christoph Paasch wrote:
> What we had here is that we wanted to let a server initiate more than 64K
> connections *while* binding also to a source-IP.
> With TCP the bind() would then pick a source-port and we ended up hitting the
> 64K limit. If we could do an atomic "bind + connect", source-port selection
> could ensure that the 4-tuple is unique.
>
> Or has something changed in recent times that allows to use the 4-tuple
> matching when doing this with TCP?
Well, yes, although it is not really recent (this came with linux-4.2)
You can now bind to an address only, and let the sport being automatically chosen at connect()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
^ permalink raw reply
* RE: [PATCH v2] usbnet: smsc95xx: disable carrier check while suspending
From: RaghuramChary.Jallipalli @ 2018-11-01 13:58 UTC (permalink / raw)
To: frieder.schrempf, steve.glendinning, UNGLinuxDriver
Cc: davem, netdev, linux-usb, linux-kernel, stable
In-Reply-To: <1541022739-24678-1-git-send-email-frieder.schrempf@kontron.de>
> We need to make sure, that the carrier check polling is disabled while
> suspending. Otherwise we can end up with usbnet_read_cmd() being issued
> when only usbnet_read_cmd_nopm() is allowed. If this happens, read
> operations lock up.
>
> Fixes: d69d169493 ("usbnet: smsc95xx: fix link detection for disabled
> autonegotiation")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Reviewed-by: Raghuram Chary J <RaghuramChary.Jallipalli@microchip.com>
-Raghu
^ permalink raw reply
* [PATCH net] rxrpc: Fix lockup due to no error backoff after ack transmit error
From: David Howells @ 2018-11-01 13:39 UTC (permalink / raw)
To: netdev; +Cc: linux-afs, linux-kernel, dhowells
If the network becomes (partially) unavailable, say by disabling IPv6, the
background ACK transmission routine can get itself into a tizzy by
proposing immediate ACK retransmission. Since we're in the call event
processor, that happens immediately without returning to the workqueue
manager.
The condition should clear after a while when either the network comes back
or the call times out.
Fix this by:
(1) When re-proposing an ACK on failed Tx, don't schedule it immediately.
This will allow a certain amount of time to elapse before we try
again.
(2) Enforce a return to the workqueue manager after a certain number of
iterations of the call processing loop.
(3) Add a backoff delay that increases the delay on deferred ACKs by a
jiffy per failed transmission to a limit of HZ. The backoff delay is
cleared on a successful return from kernel_sendmsg().
(4) Cancel calls immediately if the opening sendmsg fails. The layer
above can arrange retransmission or rotate to another server.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/ar-internal.h | 1 +
net/rxrpc/call_event.c | 18 ++++++++++++++----
net/rxrpc/output.c | 35 +++++++++++++++++++++++++++++++----
3 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 382196e57a26..bc628acf4f4f 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -611,6 +611,7 @@ struct rxrpc_call {
* not hard-ACK'd packet follows this.
*/
rxrpc_seq_t tx_top; /* Highest Tx slot allocated. */
+ u16 tx_backoff; /* Delay to insert due to Tx failure */
/* TCP-style slow-start congestion control [RFC5681]. Since the SMSS
* is fixed, we keep these numbers in terms of segments (ie. DATA
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 8e7434e92097..468efc3660c0 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -123,6 +123,7 @@ static void __rxrpc_propose_ACK(struct rxrpc_call *call, u8 ack_reason,
else
ack_at = expiry;
+ ack_at += READ_ONCE(call->tx_backoff);
ack_at += now;
if (time_before(ack_at, call->ack_at)) {
WRITE_ONCE(call->ack_at, ack_at);
@@ -311,6 +312,7 @@ void rxrpc_process_call(struct work_struct *work)
container_of(work, struct rxrpc_call, processor);
rxrpc_serial_t *send_ack;
unsigned long now, next, t;
+ unsigned int iterations = 0;
rxrpc_see_call(call);
@@ -319,6 +321,11 @@ void rxrpc_process_call(struct work_struct *work)
call->debug_id, rxrpc_call_states[call->state], call->events);
recheck_state:
+ /* Limit the number of times we do this before returning to the manager */
+ iterations++;
+ if (iterations > 5)
+ goto requeue;
+
if (test_and_clear_bit(RXRPC_CALL_EV_ABORT, &call->events)) {
rxrpc_send_abort_packet(call);
goto recheck_state;
@@ -447,13 +454,16 @@ void rxrpc_process_call(struct work_struct *work)
rxrpc_reduce_call_timer(call, next, now, rxrpc_timer_restart);
/* other events may have been raised since we started checking */
- if (call->events && call->state < RXRPC_CALL_COMPLETE) {
- __rxrpc_queue_call(call);
- goto out;
- }
+ if (call->events && call->state < RXRPC_CALL_COMPLETE)
+ goto requeue;
out_put:
rxrpc_put_call(call, rxrpc_call_put);
out:
_leave("");
+ return;
+
+requeue:
+ __rxrpc_queue_call(call);
+ goto out;
}
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 189418888839..736aa9281100 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -34,6 +34,21 @@ struct rxrpc_abort_buffer {
static const char rxrpc_keepalive_string[] = "";
+/*
+ * Increase Tx backoff on transmission failure and clear it on success.
+ */
+static void rxrpc_tx_backoff(struct rxrpc_call *call, int ret)
+{
+ if (ret < 0) {
+ u16 tx_backoff = READ_ONCE(call->tx_backoff);
+
+ if (tx_backoff < HZ)
+ WRITE_ONCE(call->tx_backoff, tx_backoff + 1);
+ } else {
+ WRITE_ONCE(call->tx_backoff, 0);
+ }
+}
+
/*
* Arrange for a keepalive ping a certain time after we last transmitted. This
* lets the far side know we're still interested in this call and helps keep
@@ -210,6 +225,7 @@ int rxrpc_send_ack_packet(struct rxrpc_call *call, bool ping,
else
trace_rxrpc_tx_packet(call->debug_id, &pkt->whdr,
rxrpc_tx_point_call_ack);
+ rxrpc_tx_backoff(call, ret);
if (call->state < RXRPC_CALL_COMPLETE) {
if (ret < 0) {
@@ -218,7 +234,7 @@ int rxrpc_send_ack_packet(struct rxrpc_call *call, bool ping,
rxrpc_propose_ACK(call, pkt->ack.reason,
ntohs(pkt->ack.maxSkew),
ntohl(pkt->ack.serial),
- true, true,
+ false, true,
rxrpc_propose_ack_retry_tx);
} else {
spin_lock_bh(&call->lock);
@@ -300,7 +316,7 @@ int rxrpc_send_abort_packet(struct rxrpc_call *call)
else
trace_rxrpc_tx_packet(call->debug_id, &pkt.whdr,
rxrpc_tx_point_call_abort);
-
+ rxrpc_tx_backoff(call, ret);
rxrpc_put_connection(conn);
return ret;
@@ -413,6 +429,7 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb,
else
trace_rxrpc_tx_packet(call->debug_id, &whdr,
rxrpc_tx_point_call_data_nofrag);
+ rxrpc_tx_backoff(call, ret);
if (ret == -EMSGSIZE)
goto send_fragmentable;
@@ -445,9 +462,18 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb,
rxrpc_reduce_call_timer(call, expect_rx_by, nowj,
rxrpc_timer_set_for_normal);
}
- }
- rxrpc_set_keepalive(call);
+ rxrpc_set_keepalive(call);
+ } else {
+ /* Cancel the call if the initial transmission fails,
+ * particularly if that's due to network routing issues that
+ * aren't going away anytime soon. The layer above can arrange
+ * the retransmission.
+ */
+ if (!test_and_set_bit(RXRPC_CALL_BEGAN_RX_TIMER, &call->flags))
+ rxrpc_set_call_completion(call, RXRPC_CALL_LOCAL_ERROR,
+ RX_USER_ABORT, ret);
+ }
_leave(" = %d [%u]", ret, call->peer->maxdata);
return ret;
@@ -506,6 +532,7 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb,
else
trace_rxrpc_tx_packet(call->debug_id, &whdr,
rxrpc_tx_point_call_data_frag);
+ rxrpc_tx_backoff(call, ret);
up_write(&conn->params.local->defrag_sem);
goto done;
^ permalink raw reply related
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Mark Rutland @ 2018-11-01 13:18 UTC (permalink / raw)
To: Trond Myklebust
Cc: linux@roeck-us.net, paul.burton@mips.com,
linux-kernel@vger.kernel.org, ralf@linux-mips.org,
jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
bfields@fieldses.org, linux-mips@linux-mips.org,
linux-nfs@vger.kernel.org, akpm@linux-foundation.org,
anna.schumaker@netapp.com, jhogan@kernel.org,
netdev@vger.kernel.org, davem@davemloft.net, arnd@arndb.de,
"paulus@samba.org" <paulu
In-Reply-To: <4e2438a23d2edf03368950a72ec058d1d299c32e.camel@hammerspace.com>
[adding the atomic maintainers]
On Thu, Nov 01, 2018 at 12:17:31AM +0000, Trond Myklebust wrote:
> On Wed, 2018-10-31 at 23:32 +0000, Paul Burton wrote:
> > (Copying SunRPC & net maintainers.)
> >
> > Hi Guenter,
> >
> > On Wed, Oct 31, 2018 at 03:02:53PM -0700, Guenter Roeck wrote:
> > > The alternatives I can see are
> > > - Do not use cmpxchg64() outside architecture code (ie drop its use
> > > from the offending driver, and keep doing the same whenever the
> > > problem comes up again).
> > > or
> > > - Introduce something like ARCH_HAS_CMPXCHG64 and use it to
> > > determine if cmpxchg64 is supported or not.
> > >
> > > Any preference ?
> >
> > My preference would be option 1 - avoiding cmpxchg64() where
> > possible in generic code. I wouldn't be opposed to the Kconfig
> > option if there are cases where cmpxchg64() can really help
> > performance though.
> >
> > The last time I'm aware of this coming up the affected driver was
> > modified to avoid cmpxchg64() [1].
> >
> > In this particular case I have no idea why
> > net/sunrpc/auth_gss/gss_krb5_seal.c is using cmpxchg64() at all.
> > It's essentially reinventing atomic64_fetch_inc() which is already
> > provided everywhere via CONFIG_GENERIC_ATOMIC64 & the spinlock
> > approach. At least for atomic64_* functions the assumption that all
> > access will be performed using those same functions seems somewhat
> > reasonable.
> >
> > So how does the below look? Trond?
>
> My one question (and the reason why I went with cmpxchg() in the first
> place) would be about the overflow behaviour for atomic_fetch_inc() and
> friends. I believe those functions should be OK on x86, so that when we
> overflow the counter, it behaves like an unsigned value and wraps back
> around. Is that the case for all architectures?
>
> i.e. are atomic_t/atomic64_t always guaranteed to behave like u32/u64
> on increment?
>
> I could not find any documentation that explicitly stated that they
> should.
Peter, Will, I understand that the atomic_t/atomic64_t ops are required
to wrap per 2's-complement. IIUC the refcount code relies on this.
Can you confirm?
Thanks,
Mark.
> Cheers
> Trond
>
> > Thanks,
> > Paul
> >
> > [1] https://patchwork.ozlabs.org/cover/891284/
> >
> > ---
> > diff --git a/include/linux/sunrpc/gss_krb5.h
> > b/include/linux/sunrpc/gss_krb5.h
> > index 131424cefc6a..02c0412e368c 100644
> > --- a/include/linux/sunrpc/gss_krb5.h
> > +++ b/include/linux/sunrpc/gss_krb5.h
> > @@ -107,8 +107,8 @@ struct krb5_ctx {
> > u8 Ksess[GSS_KRB5_MAX_KEYLEN]; /*
> > session key */
> > u8 cksum[GSS_KRB5_MAX_KEYLEN];
> > s32 endtime;
> > - u32 seq_send;
> > - u64 seq_send64;
> > + atomic_t seq_send;
> > + atomic64_t seq_send64;
> > struct xdr_netobj mech_used;
> > u8 initiator_sign[GSS_KRB5_MAX_KEYLEN];
> > u8 acceptor_sign[GSS_KRB5_MAX_KEYLEN];
> > @@ -118,9 +118,6 @@ struct krb5_ctx {
> > u8 acceptor_integ[GSS_KRB5_MAX_KEYLEN];
> > };
> >
> > -extern u32 gss_seq_send_fetch_and_inc(struct krb5_ctx *ctx);
> > -extern u64 gss_seq_send64_fetch_and_inc(struct krb5_ctx *ctx);
> > -
> > /* The length of the Kerberos GSS token header */
> > #define GSS_KRB5_TOK_HDR_LEN (16)
> >
> > diff --git a/net/sunrpc/auth_gss/gss_krb5_mech.c
> > b/net/sunrpc/auth_gss/gss_krb5_mech.c
> > index 7f0424dfa8f6..eab71fc7af3e 100644
> > --- a/net/sunrpc/auth_gss/gss_krb5_mech.c
> > +++ b/net/sunrpc/auth_gss/gss_krb5_mech.c
> > @@ -274,6 +274,7 @@ get_key(const void *p, const void *end,
> > static int
> > gss_import_v1_context(const void *p, const void *end, struct
> > krb5_ctx *ctx)
> > {
> > + u32 seq_send;
> > int tmp;
> >
> > p = simple_get_bytes(p, end, &ctx->initiate, sizeof(ctx-
> > >initiate));
> > @@ -315,9 +316,10 @@ gss_import_v1_context(const void *p, const void
> > *end, struct krb5_ctx *ctx)
> > p = simple_get_bytes(p, end, &ctx->endtime, sizeof(ctx-
> > >endtime));
> > if (IS_ERR(p))
> > goto out_err;
> > - p = simple_get_bytes(p, end, &ctx->seq_send, sizeof(ctx-
> > >seq_send));
> > + p = simple_get_bytes(p, end, &seq_send, sizeof(seq_send));
> > if (IS_ERR(p))
> > goto out_err;
> > + atomic_set(&ctx->seq_send, seq_send);
> > p = simple_get_netobj(p, end, &ctx->mech_used);
> > if (IS_ERR(p))
> > goto out_err;
> > @@ -607,6 +609,7 @@ static int
> > gss_import_v2_context(const void *p, const void *end, struct
> > krb5_ctx *ctx,
> > gfp_t gfp_mask)
> > {
> > + u64 seq_send64;
> > int keylen;
> >
> > p = simple_get_bytes(p, end, &ctx->flags, sizeof(ctx->flags));
> > @@ -617,14 +620,15 @@ gss_import_v2_context(const void *p, const void
> > *end, struct krb5_ctx *ctx,
> > p = simple_get_bytes(p, end, &ctx->endtime, sizeof(ctx-
> > >endtime));
> > if (IS_ERR(p))
> > goto out_err;
> > - p = simple_get_bytes(p, end, &ctx->seq_send64, sizeof(ctx-
> > >seq_send64));
> > + p = simple_get_bytes(p, end, &seq_send64, sizeof(seq_send64));
> > if (IS_ERR(p))
> > goto out_err;
> > + atomic64_set(&ctx->seq_send64, seq_send64);
> > /* set seq_send for use by "older" enctypes */
> > - ctx->seq_send = ctx->seq_send64;
> > - if (ctx->seq_send64 != ctx->seq_send) {
> > - dprintk("%s: seq_send64 %lx, seq_send %x overflow?\n",
> > __func__,
> > - (unsigned long)ctx->seq_send64, ctx->seq_send);
> > + atomic_set(&ctx->seq_send, seq_send64);
> > + if (seq_send64 != atomic_read(&ctx->seq_send)) {
> > + dprintk("%s: seq_send64 %llx, seq_send %x overflow?\n",
> > __func__,
> > + seq_send64, atomic_read(&ctx->seq_send));
> > p = ERR_PTR(-EINVAL);
> > goto out_err;
> > }
> > diff --git a/net/sunrpc/auth_gss/gss_krb5_seal.c
> > b/net/sunrpc/auth_gss/gss_krb5_seal.c
> > index b4adeb06660b..48fe4a591b54 100644
> > --- a/net/sunrpc/auth_gss/gss_krb5_seal.c
> > +++ b/net/sunrpc/auth_gss/gss_krb5_seal.c
> > @@ -123,30 +123,6 @@ setup_token_v2(struct krb5_ctx *ctx, struct
> > xdr_netobj *token)
> > return krb5_hdr;
> > }
> >
> > -u32
> > -gss_seq_send_fetch_and_inc(struct krb5_ctx *ctx)
> > -{
> > - u32 old, seq_send = READ_ONCE(ctx->seq_send);
> > -
> > - do {
> > - old = seq_send;
> > - seq_send = cmpxchg(&ctx->seq_send, old, old + 1);
> > - } while (old != seq_send);
> > - return seq_send;
> > -}
> > -
> > -u64
> > -gss_seq_send64_fetch_and_inc(struct krb5_ctx *ctx)
> > -{
> > - u64 old, seq_send = READ_ONCE(ctx->seq_send);
> > -
> > - do {
> > - old = seq_send;
> > - seq_send = cmpxchg64(&ctx->seq_send64, old, old + 1);
> > - } while (old != seq_send);
> > - return seq_send;
> > -}
> > -
> > static u32
> > gss_get_mic_v1(struct krb5_ctx *ctx, struct xdr_buf *text,
> > struct xdr_netobj *token)
> > @@ -177,7 +153,7 @@ gss_get_mic_v1(struct krb5_ctx *ctx, struct
> > xdr_buf *text,
> >
> > memcpy(ptr + GSS_KRB5_TOK_HDR_LEN, md5cksum.data,
> > md5cksum.len);
> >
> > - seq_send = gss_seq_send_fetch_and_inc(ctx);
> > + seq_send = atomic_fetch_inc(&ctx->seq_send);
> >
> > if (krb5_make_seq_num(ctx, ctx->seq, ctx->initiate ? 0 : 0xff,
> > seq_send, ptr + GSS_KRB5_TOK_HDR_LEN, ptr
> > + 8))
> > @@ -205,7 +181,7 @@ gss_get_mic_v2(struct krb5_ctx *ctx, struct
> > xdr_buf *text,
> >
> > /* Set up the sequence number. Now 64-bits in clear
> > * text and w/o direction indicator */
> > - seq_send_be64 = cpu_to_be64(gss_seq_send64_fetch_and_inc(ctx));
> > + seq_send_be64 = cpu_to_be64(atomic64_fetch_inc(&ctx-
> > >seq_send64));
> > memcpy(krb5_hdr + 8, (char *) &seq_send_be64, 8);
> >
> > if (ctx->initiate) {
> > diff --git a/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > b/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > index 962fa84e6db1..5cdde6cb703a 100644
> > --- a/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > +++ b/net/sunrpc/auth_gss/gss_krb5_wrap.c
> > @@ -228,7 +228,7 @@ gss_wrap_kerberos_v1(struct krb5_ctx *kctx, int
> > offset,
> >
> > memcpy(ptr + GSS_KRB5_TOK_HDR_LEN, md5cksum.data,
> > md5cksum.len);
> >
> > - seq_send = gss_seq_send_fetch_and_inc(kctx);
> > + seq_send = atomic_fetch_inc(&kctx->seq_send);
> >
> > /* XXX would probably be more efficient to compute checksum
> > * and encrypt at the same time: */
> > @@ -475,7 +475,7 @@ gss_wrap_kerberos_v2(struct krb5_ctx *kctx, u32
> > offset,
> > *be16ptr++ = 0;
> >
> > be64ptr = (__be64 *)be16ptr;
> > - *be64ptr = cpu_to_be64(gss_seq_send64_fetch_and_inc(kctx));
> > + *be64ptr = cpu_to_be64(atomic64_fetch_inc(&kctx->seq_send64));
> >
> > err = (*kctx->gk5e->encrypt_v2)(kctx, offset, buf, pages);
> > if (err)
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space
>
>
^ permalink raw reply
* Re: [PATCH net] net: hns3: Fix for out-of-bounds access when setting pfc back pressure
From: John Garry @ 2018-11-01 12:49 UTC (permalink / raw)
To: Yunsheng Lin, davem
Cc: huangdaode, xuwei5, liguozhu, Yisen.Zhuang, linuxarm, salil.mehta,
lipeng321, netdev, linux-kernel
In-Reply-To: <1541074363-98630-1-git-send-email-linyunsheng@huawei.com>
On 01/11/2018 12:12, Yunsheng Lin wrote:
> The vport should be initialized to hdev->vport for each bp group,
> otherwise it will cause out-of-bounds access and bp setting not
> correct problem.
>
> [ 35.254124] BUG: KASAN: slab-out-of-bounds in hclge_pause_setup_hw+0x2a0/0x3f8 [hclge]
> [ 35.254126] Read of size 2 at addr ffff803b6651581a by task kworker/0:1/14
>
> [ 35.254132] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 4.19.0-rc7-hulk+ #85
> [ 35.254133] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - B052 (V0.52) 09/14/2018
> [ 35.254141] Workqueue: events work_for_cpu_fn
> [ 35.254144] Call trace:
> [ 35.254147] dump_backtrace+0x0/0x2f0
> [ 35.254149] show_stack+0x24/0x30
> [ 35.254154] dump_stack+0x110/0x184
> [ 35.254157] print_address_description+0x168/0x2b0
> [ 35.254160] kasan_report+0x184/0x310
> [ 35.254162] __asan_load2+0x7c/0xa0
> [ 35.254170] hclge_pause_setup_hw+0x2a0/0x3f8 [hclge]
> [ 35.254177] hclge_tm_init_hw+0x794/0x9f0 [hclge]
> [ 35.254184] hclge_tm_schd_init+0x48/0x58 [hclge]
> [ 35.254191] hclge_init_ae_dev+0x778/0x1168 [hclge]
> [ 35.254196] hnae3_register_ae_dev+0x14c/0x298 [hnae3]
> [ 35.254206] hns3_probe+0x88/0xa8 [hns3]
> [ 35.254210] local_pci_probe+0x7c/0xf0
> [ 35.254212] work_for_cpu_fn+0x34/0x50
> [ 35.254214] process_one_work+0x4d4/0xa38
> [ 35.254216] worker_thread+0x55c/0x8d8
> [ 35.254219] kthread+0x1b0/0x1b8
> [ 35.254222] ret_from_fork+0x10/0x1c
>
> [ 35.254224] The buggy address belongs to the page:
> [ 35.254228] page:ffff7e00ed994400 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
> [ 35.273835] flags: 0xfffff8000008000(head)
> [ 35.282007] raw: 0fffff8000008000 dead000000000100 dead000000000200 0000000000000000
> [ 35.282010] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> [ 35.282012] page dumped because: kasan: bad access detected
>
> [ 35.282014] Memory state around the buggy address:
> [ 35.282017] ffff803b66515700: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282019] ffff803b66515780: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282021] >ffff803b66515800: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282022] ^
> [ 35.282024] ffff803b66515880: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282026] ffff803b66515900: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
> [ 35.282028] ==================================================================
> [ 35.282029] Disabling lock debugging due to kernel taint
> [ 35.282747] hclge driver initialization finished.
>
> Fixes: 67bf2541f4b9 ("net: hns3: Fixes the back pressure setting when sriov is enabled")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
> index aa5cb98..0c9c6ae 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
> @@ -1168,11 +1168,12 @@ static int hclge_pfc_setup_hw(struct hclge_dev *hdev)
> */
> static int hclge_bp_setup_hw(struct hclge_dev *hdev, u8 tc)
> {
> - struct hclge_vport *vport = hdev->vport;
> u32 i, k, qs_bitmap;
> int ret;
>
> for (i = 0; i < HCLGE_BP_GRP_NUM; i++) {
> + struct hclge_vport *vport = hdev->vport;
> +
It seems to me that the bug is fixed but the code is not much better.
Since variable vport is only referenced in the inner loop, I would
define it there.
> qs_bitmap = 0;
>
> for (k = 0; k < hdev->num_alloc_vport; k++) {
>
static int hclge_bp_setup_hw(struct hclge_dev *hdev, u8 tc)
{
- struct hclge_vport *vport = hdev->vport;
- u32 i, k, qs_bitmap;
- int ret;
+ int ret, i;
for (i = 0; i < HCLGE_BP_GRP_NUM; i++) {
- qs_bitmap = 0;
+ u32 qs_bitmap = 0;
+ int k;
for (k = 0; k < hdev->num_alloc_vport; k++) {
+ struct hclge_vport *vport = &hdev->vport[k];
u16 qs_id = vport->qs_offset + tc;
u8 grp, sub_grp;
...
if (i == grp)
qs_bitmap |= (1 << sub_grp);
-
- vport++;
... so this seems better and safer, as you're explicitly limiting vport
within bounds of hdev->num_alloc_vport.
^ permalink raw reply
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-01 3:37 UTC (permalink / raw)
To: Paweł Staszewski, netdev
In-Reply-To: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl>
On 10/31/18 3:57 PM, Paweł Staszewski wrote:
> Hi
>
> So maybee someone will be interested how linux kernel handles normal
> traffic (not pktgen :) )
>
>
> Server HW configuration:
>
> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>
> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>
>
> Server software:
>
> FRR - as routing daemon
>
> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa
> node)
>
> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
>
>
> Maximum traffic that server can handle:
>
> Bandwidth
>
> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> input: /proc/net/dev type: rate
> \ iface Rx Tx Total
> ==============================================================================
>
> enp175s0f1: 28.51 Gb/s 37.24 Gb/s
> 65.74 Gb/s
> enp175s0f0: 38.07 Gb/s 28.44 Gb/s
> 66.51 Gb/s
> ------------------------------------------------------------------------------
>
> total: 66.58 Gb/s 65.67 Gb/s
> 132.25 Gb/s
>
>
> Packets per second:
>
> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> input: /proc/net/dev type: rate
> - iface Rx Tx Total
> ==============================================================================
>
> enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s
> enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s
> ------------------------------------------------------------------------------
>
> total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s
>
>
> After reaching that limits nics on the upstream side (more RX traffic)
> start to drop packets
>
>
> I just dont understand that server can't handle more bandwidth
> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX
> side are increasing.
>
> Was thinking that maybee reached some pcie x16 limit - but x16 8GT is
> 126Gbit - and also when testing with pktgen i can reach more bw and pps
> (like 4x more comparing to normal internet traffic)
>
> And wondering if there is something that can be improved here.
This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement. You
need the vlan changes I have queued up though.
^ permalink raw reply
* Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
From: David Ahern @ 2018-11-01 2:53 UTC (permalink / raw)
To: Jeff Barnhill, davem; +Cc: netdev, Alexey Kuznetsov, yoshfuji
In-Reply-To: <CAL6e_pfzPzt=rAxjWKAWHQqdrqejZ5e6vA1YoB3nGyc3_jeJeA@mail.gmail.com>
On 10/31/18 6:02 PM, Jeff Barnhill wrote:
> I'll follow this email with a new patch using ifacaddr6 instead of
> creating a new struct. I ended up using fib6_nh.nh_dev to get the net,
> instead of adding a back pointer to idev. It seems that idev was
> recently removed in lieu of this, so if this is incorrect, please let
> me know. Hopefully, I got the locking correct.
That's correct. Make sure that the anycast code can not be accessed for
reject routes which will not have a device set. Should be ok, but double
check.
^ permalink raw reply
* [PATCH v1 net] net: dsa: microchip: initialize mutex before use
From: Tristram.Ha @ 2018-11-01 2:49 UTC (permalink / raw)
To: David S. Miller
Cc: Tristram Ha, Andrew Lunn, Florian Fainelli, Pavel Machek,
UNGLinuxDriver, netdev
From: Tristram Ha <Tristram.Ha@microchip.com>
Initialize mutex before use. Avoid kernel complaint when
CONFIG_DEBUG_LOCK_ALLOC is enabled.
Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
---
v1
- Remove comment
drivers/net/dsa/microchip/ksz_common.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
index 54e0ca6..86b6464 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
{
int i;
- mutex_init(&dev->reg_mutex);
- mutex_init(&dev->stats_mutex);
- mutex_init(&dev->alu_mutex);
- mutex_init(&dev->vlan_mutex);
-
dev->ds->ops = &ksz_switch_ops;
for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
@@ -1206,6 +1201,11 @@ int ksz_switch_register(struct ksz_device *dev)
if (dev->pdata)
dev->chip_id = dev->pdata->chip_id;
+ mutex_init(&dev->reg_mutex);
+ mutex_init(&dev->stats_mutex);
+ mutex_init(&dev->alu_mutex);
+ mutex_init(&dev->vlan_mutex);
+
if (ksz_switch_detect(dev))
return -EINVAL;
--
1.9.1
^ permalink raw reply related
* Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
From: David Ahern @ 2018-11-01 2:48 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yoann P., Stephen Hemminger, netdev
In-Reply-To: <20181030183420.181b3d51@redhat.com>
[ sorry, too many distractions and I forgot to respond ]
On 10/30/18 11:34 AM, Stefano Brivio wrote:
> On Tue, 30 Oct 2018 10:34:45 -0600
> David Ahern <dsahern@gmail.com> wrote:
>
>> A more flexible approach is to use format strings to allow users to
>> customize the output order and whitespace as well. So for ss and your
>> column list (winging it here):
>>
>> netid = %N
>> state = %S
>> recv Q = %Qr
>> send Q = %Qs
>> local address = %Al
>> lport port = %Pl
>> remote address = %Ar
>> remote port = %Pr
>> process data = %p
>> ...
>>
>> then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n"
>
> I like the idea indeed, but I see two issues with ss:
>
> - the current column abstraction is rather lightweight, things are
> already buffered in the defined column order so we don't have to jump
> back and forth in the buffer while rendering. Doing that needs some
> extra care to avoid a performance hit, but it's probably doable, I
> can put that on my to-do list
The ss command is always a pain; it's much better in newer releases but
I am always having to adjust terminal width.
>
> - how would you model automatic spacing in a format string? Should we
> support width specifiers? Disable automatic spacing if a format
> string is given? It might even make sense to allow partial automatic
Follow the format string for whitespace and order, and
yes, on the width specifiers if possible.
> spacing with a special character in the format string, that is:
>
> "%S.%Qr.%Qs %Al:%Pl %Ar:%Pr %p\n"
>
> would mean "align everything to the right, distribute remaining
> whitespace between %S, %Qr and %Qs". But it looks rather complicated
> at a glance.
>
My concern here is that once this goes in for 1 command, the others in
iproute2 need to follow suit - meaning same syntax style for all
commands. Given that I'd prefer we get a reasonable consensus on syntax
that will work across commands -- ss, ip, tc. If it is as simple as
column names with a fixed order, that is fine but just give proper
consideration given the impact.
The 'ip' syntax for example gets ugly quick with the various link types
and their options. We don't need to allow every detail of each link type
to be customized, but there are common attributes for links (e.g., mtu,
ifindex, link flags, lladdr), addresses, and link types such as bridges
and bonds where we can improve the amount of data thrown at a user -- a
better, more customizable version of what the brief option targeted.
^ permalink raw reply
* Re: [PATCH net] openvswitch: Fix push/pop ethernet validation
From: David Miller @ 2018-11-01 1:38 UTC (permalink / raw)
To: jcaamano; +Cc: netdev, pshelar
In-Reply-To: <20181031175203.23808-1-jcaamano@suse.com>
From: Jaime Caamaño Ruiz <jcaamano@suse.com>
Date: Wed, 31 Oct 2018 18:52:03 +0100
> When there are both pop and push ethernet header actions among the
> actions to be applied to a packet, an unexpected EINVAL (Invalid
> argument) error is obtained. This is due to mac_proto not being reset
> correctly when those actions are validated.
>
> Reported-at:
> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-October/047554.html
> Fixes: 91820da6ae85 ("openvswitch: add Ethernet push and pop actions")
> Signed-off-by: Jaime Caamaño Ruiz <jcaamano@suse.com>
Applied and queued up for -stable.
^ permalink raw reply
* Geschäfts- / Projektkredit 1,5%
From: SafetyNet Credit @ 2018-11-01 1:29 UTC (permalink / raw)
--
Schönen Tag.
Wir hoffen, Sie gut zu treffen.
Benötigen Sie einen dringenden Kredit für ein Geschäft oder ein Projekt?
Wir bieten Kredite zu 1,5% und wir können Ihre Transaktion innerhalb von
3 bis 5 Arbeitstagen abschließen.
Wenn Sie ernsthaft an einem Kredit interessiert sind, empfehlen wir
Ihnen, unten die Einzelheiten zur Bearbeitung Ihrer Transaktion
anzugeben.
Vollständiger Name:..............
Darlehensbetrag: ..............
Darlehensdauer: ..............
Darlehen Zweck:..............
Telefon:..............
Wir erwarten Ihre Darlehensdaten wie oben beschrieben für die Abwicklung
Ihrer Transaktion.
Freundliche Grüße.
Wilson Rog.
Buchhalter / Berater
^ permalink raw reply
* Re: [net 0/8][pull request] Intel Wired LAN Driver Updates 2018-10-31
From: David Miller @ 2018-11-01 1:23 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20181031194254.16417-1-jeffrey.t.kirsher@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 31 Oct 2018 12:42:46 -0700
> This series contains a various collection of fixes.
...
> The following are changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:
> net: mvpp2: Fix affinity hint allocation
> and are available in the git repository at:
> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE
Pulled, thanks Jeff.
^ permalink raw reply
* Re: [PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return
From: Roman Gushchin @ 2018-11-01 1:11 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: ast@kernel.org, netdev@vger.kernel.org
In-Reply-To: <20181031230555.3371-3-daniel@iogearbox.net>
On Thu, Nov 01, 2018 at 12:05:53AM +0100, Daniel Borkmann wrote:
> In the verifier there is no such semantics where registers with
> PTR_TO_MAP_VALUE type have an id assigned to them. This is only
> used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
> test against NULL has been pattern matched and type transformed
> into PTR_TO_MAP_VALUE.
>
> Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Roman Gushchin <guro@fb.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
Looks good to me.
Acked-by: Roman Gushchin <guro@fb.com>
Thanks!
^ permalink raw reply
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01 0:53 UTC (permalink / raw)
To: Christoph Paasch, netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181031232635.33750-1-cpaasch@apple.com>
On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
>
Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
It would add a huge overhead in term of memory usage in the kernel,
and lots of epoll events to manage (say a QUIC server with one million flows, receiving
very few packets per second per flow)
Maybe you could elaborate on the need of having one UDP socket per connection.
> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand. Only after the connect() a full 4-tuple match
> is happening. So, one can't really create this kind of a server that has
> a connected UDP-socket per Quic connection.
>
> So, what is needed is an "atomic bind & connect" that basically
> prevents any incoming traffic until the connect() call has been issued
> at which point the full 4-tuple is known.
>
>
> This patchset implements this functionality and exposes a socket-option
> to do this.
>
> Usage would be:
>
> int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
>
> int val = 1;
> setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
>
> bind(fd, (struct sockaddr *)&src, sizeof(src));
>
> /* At this point, incoming traffic will never match on this socket */
>
> connect(fd, (struct sockaddr *)&dst, sizeof(dst));
>
> /* Only now incoming traffic will reach the socket */
>
>
>
> There is literally an infinite number of ways on how to implement it,
> which is why I first send it out as an RFC. With this approach here I
> chose the least invasive one, just preventing the match on the incoming
> path.
>
>
> The reason for choosing a SOL_SOCKET socket-option and not at the
> SOL_UDP-level is because that functionality actually could be useful for
> other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> by binding to the source-IP and the destination-IP at the same time.
Passive TCP flows can not benefit from this idea.
Active TCP flows can already do that, I do not really understand what you are suggesting.
^ permalink raw reply
* Re: pull-request: bpf 2018-11-01
From: David Miller @ 2018-11-01 0:39 UTC (permalink / raw)
To: daniel; +Cc: ast, netdev
In-Reply-To: <20181101002841.6267-1-daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 1 Nov 2018 01:28:41 +0100
> The following pull-request contains BPF updates for your *net* tree.
>
> The main changes are:
>
> 1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking
> case when no data is available yet, from John.
>
> 2) Fix a compilation error in libbpf_attach_type_by_name() when compiled
> with clang 3.8, from Andrey.
>
> 3) Fix a partial copy of map pointer on scalar alu and remove id
> generation for RET_PTR_TO_MAP_VALUE return types, from Daniel.
>
> 4) Add unlimited memlock limit for kernel selftest's flow_dissector_load
> program, from Yonghong.
>
> 5) Fix ping for some BPF shell based kselftests where distro does not
> ship "ping -6" anymore, from Li.
>
> Please consider pulling these changes from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Pulled, thanks Daniel.
^ permalink raw reply
* pull-request: bpf 2018-11-01
From: Daniel Borkmann @ 2018-11-01 0:28 UTC (permalink / raw)
To: davem; +Cc: daniel, ast, netdev
Hi David,
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking
case when no data is available yet, from John.
2) Fix a compilation error in libbpf_attach_type_by_name() when compiled
with clang 3.8, from Andrey.
3) Fix a partial copy of map pointer on scalar alu and remove id
generation for RET_PTR_TO_MAP_VALUE return types, from Daniel.
4) Add unlimited memlock limit for kernel selftest's flow_dissector_load
program, from Yonghong.
5) Fix ping for some BPF shell based kselftests where distro does not
ship "ping -6" anymore, from Li.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
----------------------------------------------------------------
The following changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:
net: mvpp2: Fix affinity hint allocation (2018-10-30 11:34:41 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
for you to fetch changes up to dfeb8f4c9692fd5e6c3eef19c2e4ae5338dbdb01:
Merge branch 'verifier-fixes' (2018-10-31 16:53:18 -0700)
----------------------------------------------------------------
Alexei Starovoitov (1):
Merge branch 'verifier-fixes'
Andrey Ignatov (1):
libbpf: Fix compile error in libbpf_attach_type_by_name
Daniel Borkmann (4):
bpf: fix partial copy of map_ptr when dst is scalar
bpf: don't set id on after map lookup with ptr_to_map_val return
bpf: add various test cases to test_verifier
bpf: test make sure to run unpriv test cases in test_verifier
John Fastabend (1):
bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data
Li Zhijian (1):
kselftests/bpf: use ping6 as the default ipv6 ping binary if it exists
Yonghong Song (1):
tools/bpf: add unlimited rlimit for flow_dissector_load
include/linux/bpf_verifier.h | 3 +
kernel/bpf/verifier.c | 21 +-
net/ipv4/tcp_bpf.c | 1 +
tools/lib/bpf/libbpf.c | 13 +-
tools/testing/selftests/bpf/flow_dissector_load.c | 2 +
tools/testing/selftests/bpf/test_skb_cgroup_id.sh | 3 +-
tools/testing/selftests/bpf/test_sock_addr.sh | 3 +-
tools/testing/selftests/bpf/test_verifier.c | 321 +++++++++++++++++++---
8 files changed, 319 insertions(+), 48 deletions(-)
^ permalink raw reply
* [PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-11-01 0:14 UTC (permalink / raw)
To: netdev; +Cc: davem, kuznet, yoshfuji, Jeff Barnhill
In-Reply-To: <CAL6e_pfzPzt=rAxjWKAWHQqdrqejZ5e6vA1YoB3nGyc3_jeJeA@mail.gmail.com>
icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device. This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().
Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses. This is based on inet6_addr_lst in addrconf.c.
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
---
include/net/addrconf.h | 2 ++
include/net/if_inet6.h | 2 ++
net/ipv6/af_inet6.c | 5 ++++
net/ipv6/anycast.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++---
4 files changed, 85 insertions(+), 4 deletions(-)
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..799af1a037d1 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
const struct in6_addr *addr);
bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
const struct in6_addr *addr);
+int anycast_init(void);
+void anycast_cleanup(void);
/* Device notifier */
int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..c9c78c15bce0 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -146,10 +146,12 @@ struct ifacaddr6 {
struct in6_addr aca_addr;
struct fib6_info *aca_rt;
struct ifacaddr6 *aca_next;
+ struct hlist_node aca_addr_lst;
int aca_users;
refcount_t aca_refcnt;
unsigned long aca_cstamp;
unsigned long aca_tstamp;
+ struct rcu_head rcu;
};
#define IFA_HOST IPV6_ADDR_LOOPBACK
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..ddc8a6dbfba2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
err = ip6_flowlabel_init();
if (err)
goto ip6_flowlabel_fail;
+ err = anycast_init();
+ if (err)
+ goto anycast_fail;
err = addrconf_init();
if (err)
goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
ipv6_exthdrs_fail:
addrconf_cleanup();
addrconf_fail:
+ anycast_cleanup();
+anycast_fail:
ip6_flowlabel_cleanup();
ip6_flowlabel_fail:
ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..f6c4c8ac184c 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
#include <net/checksum.h>
+#define IN6_ADDR_HSIZE_SHIFT 8
+#define IN6_ADDR_HSIZE BIT(IN6_ADDR_HSIZE_SHIFT)
+/* anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr *addr);
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+ u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+ return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
/*
* socket join an anycast group
*/
@@ -204,16 +218,39 @@ void ipv6_sock_ac_close(struct sock *sk)
rtnl_unlock();
}
+static void ipv6_add_acaddr_hash(struct net *net, struct ifacaddr6 *aca)
+{
+ unsigned int hash = inet6_acaddr_hash(net, &aca->aca_addr);
+
+ spin_lock(&acaddr_hash_lock);
+ hlist_add_head_rcu(&aca->aca_addr_lst, &inet6_acaddr_lst[hash]);
+ spin_unlock(&acaddr_hash_lock);
+}
+
+static void ipv6_del_acaddr_hash(struct ifacaddr6 *aca)
+{
+ spin_lock(&acaddr_hash_lock);
+ hlist_del_init_rcu(&aca->aca_addr_lst);
+ spin_unlock(&acaddr_hash_lock);
+}
+
static void aca_get(struct ifacaddr6 *aca)
{
refcount_inc(&aca->aca_refcnt);
}
+static void aca_free_rcu(struct rcu_head *h)
+{
+ struct ifacaddr6 *aca = container_of(h, struct ifacaddr6, rcu);
+
+ fib6_info_release(aca->aca_rt);
+ kfree(aca);
+}
+
static void aca_put(struct ifacaddr6 *ac)
{
if (refcount_dec_and_test(&ac->aca_refcnt)) {
- fib6_info_release(ac->aca_rt);
- kfree(ac);
+ call_rcu(&ac->rcu, aca_free_rcu);
}
}
@@ -229,6 +266,7 @@ static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i,
aca->aca_addr = *addr;
fib6_info_hold(f6i);
aca->aca_rt = f6i;
+ INIT_HLIST_NODE(&aca->aca_addr_lst);
aca->aca_users = 1;
/* aca_tstamp should be updated upon changes */
aca->aca_cstamp = aca->aca_tstamp = jiffies;
@@ -285,6 +323,8 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
aca_get(aca);
write_unlock_bh(&idev->lock);
+ ipv6_add_acaddr_hash(net, aca);
+
ip6_ins_rt(net, f6i);
addrconf_join_solict(idev->dev, &aca->aca_addr);
@@ -325,6 +365,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
else
idev->ac_list = aca->aca_next;
write_unlock_bh(&idev->lock);
+ ipv6_del_acaddr_hash(aca);
addrconf_leave_solict(idev, &aca->aca_addr);
ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
@@ -352,6 +393,8 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev)
idev->ac_list = aca->aca_next;
write_unlock_bh(&idev->lock);
+ ipv6_del_acaddr_hash(aca);
+
addrconf_leave_solict(idev, &aca->aca_addr);
ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
@@ -390,17 +433,25 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad
bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
const struct in6_addr *addr)
{
+ unsigned int hash = inet6_acaddr_hash(net, addr);
+ struct net_device *nh_dev;
+ struct ifacaddr6 *aca;
bool found = false;
rcu_read_lock();
if (dev)
found = ipv6_chk_acast_dev(dev, addr);
else
- for_each_netdev_rcu(net, dev)
- if (ipv6_chk_acast_dev(dev, addr)) {
+ hlist_for_each_entry_rcu(aca, &inet6_acaddr_lst[hash],
+ aca_addr_lst) {
+ nh_dev = fib6_info_nh_dev(aca->aca_rt);
+ if (!nh_dev || !net_eq(dev_net(nh_dev), net))
+ continue;
+ if (ipv6_addr_equal(&aca->aca_addr, addr)) {
found = true;
break;
}
+ }
rcu_read_unlock();
return found;
}
@@ -539,4 +590,25 @@ void ac6_proc_exit(struct net *net)
{
remove_proc_entry("anycast6", net->proc_net);
}
+
+/* Init / cleanup code
+ */
+int __init anycast_init(void)
+{
+ int i;
+
+ for (i = 0; i < IN6_ADDR_HSIZE; i++)
+ INIT_HLIST_HEAD(&inet6_acaddr_lst[i]);
+ return 0;
+}
+
+void anycast_cleanup(void)
+{
+ int i;
+
+ spin_lock(&acaddr_hash_lock);
+ for (i = 0; i < IN6_ADDR_HSIZE; i++)
+ WARN_ON(!hlist_empty(&inet6_acaddr_lst[i]));
+ spin_unlock(&acaddr_hash_lock);
+}
#endif
--
2.14.1
^ permalink raw reply related
* Re: [PATCH bpf 0/4] BPF fixes and tests
From: Alexei Starovoitov @ 2018-11-01 0:08 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: ast, netdev
In-Reply-To: <20181031230555.3371-1-daniel@iogearbox.net>
On Thu, Nov 01, 2018 at 12:05:51AM +0100, Daniel Borkmann wrote:
> The series contains two fixes in BPF core and test cases. For details
> please see individual patches. Thanks!
>
> Daniel Borkmann (4):
> bpf: fix partial copy of map_ptr when dst is scalar
> bpf: don't set id on after map lookup with ptr_to_map_val return
> bpf: add various test cases to test_verifier
> bpf: test make sure to run unpriv test cases in test_verifier
>
> include/linux/bpf_verifier.h | 3 +
> kernel/bpf/verifier.c | 21 +-
> tools/testing/selftests/bpf/test_verifier.c | 321 +++++++++++++++++++++++++---
> 3 files changed, 305 insertions(+), 40 deletions(-)
Applied to bpf tree, Thanks
... and we achieved very nice milestone... crossed 1000 tests in test_verifier :)
Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED
^ permalink raw reply
* Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-11-01 0:02 UTC (permalink / raw)
To: davem; +Cc: David Ahern, netdev, Alexey Kuznetsov, yoshfuji
In-Reply-To: <20181030.161916.2155476722804506340.davem@davemloft.net>
I'll follow this email with a new patch using ifacaddr6 instead of
creating a new struct. I ended up using fib6_nh.nh_dev to get the net,
instead of adding a back pointer to idev. It seems that idev was
recently removed in lieu of this, so if this is incorrect, please let
me know. Hopefully, I got the locking correct.
Thanks,
Jeff
On Tue, Oct 30, 2018 at 7:19 PM David Miller <davem@davemloft.net> wrote:
>
> From: David Ahern <dsahern@gmail.com>
> Date: Tue, 30 Oct 2018 16:06:46 -0600
>
> > or make the table per namespace.
>
> This will increase namespace create/destroy cost, so I'd rather not
> for something like this.
^ permalink raw reply
* [PATCH v3 2/2] trace: remove kretprobed checks
From: Aleksa Sarai @ 2018-11-01 8:35 UTC (permalink / raw)
To: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
Daniel Borkmann
Cc: Aleksa Sarai, Aleksa Sarai, Christian Brauner, Brendan Gregg,
netdev, linux-doc, linux-kernel, linux-kselftest
In-Reply-To: <20181101083551.3805-1-cyphar@cyphar.com>
This is effectively a reversion of commit 76094a2cf46e ("ftrace:
distinguish kretprobe'd functions in trace logs"), as the checking of
kretprobe_trampoline *for tracing* is no longer necessary with the new
kretprobe stack trace changes.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
kernel/trace/trace_output.c | 34 ++++------------------------------
1 file changed, 4 insertions(+), 30 deletions(-)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 6e6cc64faa38..951de16bd4fd 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -321,36 +321,14 @@ int trace_output_call(struct trace_iterator *iter, char *name, char *fmt, ...)
}
EXPORT_SYMBOL_GPL(trace_output_call);
-#ifdef CONFIG_KRETPROBES
-static inline const char *kretprobed(const char *name)
-{
- static const char tramp_name[] = "kretprobe_trampoline";
- int size = sizeof(tramp_name);
-
- if (strncmp(tramp_name, name, size) == 0)
- return "[unknown/kretprobe'd]";
- return name;
-}
-#else
-static inline const char *kretprobed(const char *name)
-{
- return name;
-}
-#endif /* CONFIG_KRETPROBES */
-
static void
seq_print_sym_short(struct trace_seq *s, const char *fmt, unsigned long address)
{
char str[KSYM_SYMBOL_LEN];
#ifdef CONFIG_KALLSYMS
- const char *name;
-
kallsyms_lookup(address, NULL, NULL, NULL, str);
-
- name = kretprobed(str);
-
- if (name && strlen(name)) {
- trace_seq_printf(s, fmt, name);
+ if (strlen(str)) {
+ trace_seq_printf(s, fmt, str);
return;
}
#endif
@@ -364,13 +342,9 @@ seq_print_sym_offset(struct trace_seq *s, const char *fmt,
{
char str[KSYM_SYMBOL_LEN];
#ifdef CONFIG_KALLSYMS
- const char *name;
-
sprint_symbol(str, address);
- name = kretprobed(str);
-
- if (name && strlen(name)) {
- trace_seq_printf(s, fmt, name);
+ if (strlen(str)) {
+ trace_seq_printf(s, fmt, str);
return;
}
#endif
--
2.19.1
^ permalink raw reply related
* [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Aleksa Sarai @ 2018-11-01 8:35 UTC (permalink / raw)
To: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
Daniel Borkmann
Cc: Aleksa Sarai, Brendan Gregg, Christian Brauner, Aleksa Sarai,
netdev, linux-doc, linux-kernel, linux-kselftest
In-Reply-To: <20181101083551.3805-1-cyphar@cyphar.com>
Historically, kretprobe has always produced unusable stack traces
(kretprobe_trampoline is the only entry in most cases, because of the
funky stack pointer overwriting). This has caused quite a few annoyances
when using tracing to debug problems[1] -- since return values are only
available with kretprobes but stack traces were only usable for kprobes,
users had to probe both and then manually associate them.
With the advent of bpf_trace, users would have been able to do this
association in bpf, but this was less than ideal (because
bpf_get_stackid would still produce rubbish and programs that didn't
know better would get silly results). The main usecase for stack traces
(at least with bpf_trace) is for DTrace-style aggregation on stack
traces (both entry and exit). Therefore we cannot simply correct the
stack trace on exit -- we must stash away the stack trace and return the
entry stack trace when it is requested.
[1]: https://github.com/iovisor/bpftrace/issues/101
Cc: Brendan Gregg <bgregg@netflix.com>
Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Documentation/kprobes.txt | 6 +-
include/linux/kprobes.h | 27 +++++
kernel/events/callchain.c | 8 +-
kernel/kprobes.c | 101 +++++++++++++++++-
kernel/trace/trace.c | 11 +-
.../test.d/kprobe/kretprobe_stacktrace.tc | 25 +++++
6 files changed, 173 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt
index 10f4499e677c..1965585848f4 100644
--- a/Documentation/kprobes.txt
+++ b/Documentation/kprobes.txt
@@ -597,7 +597,11 @@ address with the trampoline's address, stack backtraces and calls
to __builtin_return_address() will typically yield the trampoline's
address instead of the real return address for kretprobed functions.
(As far as we can tell, __builtin_return_address() is used only
-for instrumentation and error reporting.)
+for instrumentation and error reporting.) However, since return probes
+are used extensively in tracing (where stack backtraces are useful),
+return probes will stash away the stack backtrace during function entry
+so that return probe handlers can use the entry backtrace instead of
+having a trace with just kretprobe_trampoline.
If the number of times a function is called does not match the number
of times it returns, registering a return probe on that function may
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index e909413e4e38..1a1629544e56 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -40,6 +40,8 @@
#include <linux/rcupdate.h>
#include <linux/mutex.h>
#include <linux/ftrace.h>
+#include <linux/stacktrace.h>
+#include <linux/perf_event.h>
#include <asm/kprobes.h>
#ifdef CONFIG_KPROBES
@@ -168,11 +170,18 @@ struct kretprobe {
raw_spinlock_t lock;
};
+#define KRETPROBE_TRACE_SIZE 127
+struct kretprobe_trace {
+ int nr_entries;
+ unsigned long entries[KRETPROBE_TRACE_SIZE];
+};
+
struct kretprobe_instance {
struct hlist_node hlist;
struct kretprobe *rp;
kprobe_opcode_t *ret_addr;
struct task_struct *task;
+ struct kretprobe_trace entry;
char data[0];
};
@@ -371,6 +380,12 @@ void unregister_kretprobe(struct kretprobe *rp);
int register_kretprobes(struct kretprobe **rps, int num);
void unregister_kretprobes(struct kretprobe **rps, int num);
+struct kretprobe_instance *current_kretprobe_instance(void);
+void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+ struct stack_trace *trace);
+void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+ struct perf_callchain_entry_ctx *ctx);
+
void kprobe_flush_task(struct task_struct *tk);
void recycle_rp_inst(struct kretprobe_instance *ri, struct hlist_head *head);
@@ -397,6 +412,18 @@ static inline struct kprobe *kprobe_running(void)
{
return NULL;
}
+static inline struct kretprobe_instance *current_kretprobe_instance(void)
+{
+ return NULL;
+}
+static inline void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+ struct stack_trace *trace)
+{
+}
+static inline void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+ struct perf_callchain_entry_ctx *ctx)
+{
+}
static inline int register_kprobe(struct kprobe *p)
{
return -ENOSYS;
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 24a77c34e9ad..98edcd8a6987 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -12,6 +12,7 @@
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <linux/sched/task_stack.h>
+#include <linux/kprobes.h>
#include "internal.h"
@@ -197,9 +198,14 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
ctx.contexts_maxed = false;
if (kernel && !user_mode(regs)) {
+ struct kretprobe_instance *ri = current_kretprobe_instance();
+
if (add_mark)
perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
- perf_callchain_kernel(&ctx, regs);
+ if (ri)
+ kretprobe_perf_callchain_kernel(ri, &ctx);
+ else
+ perf_callchain_kernel(&ctx, regs);
}
if (user) {
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 90e98e233647..fca3964d18cd 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1206,6 +1206,16 @@ __releases(hlist_lock)
}
NOKPROBE_SYMBOL(kretprobe_table_unlock);
+static bool kretprobe_hash_is_locked(struct task_struct *tsk)
+{
+ unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
+ raw_spinlock_t *hlist_lock;
+
+ hlist_lock = kretprobe_table_lock_ptr(hash);
+ return raw_spin_is_locked(hlist_lock);
+}
+NOKPROBE_SYMBOL(kretprobe_hash_is_locked);
+
/*
* This function is called from finish_task_switch when task tk becomes dead,
* so that we can recycle any function-return probe instances associated
@@ -1800,6 +1810,13 @@ unsigned long __weak arch_deref_entry_point(void *entry)
return (unsigned long)entry;
}
+static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs);
+
+static inline bool kprobe_is_retprobe(struct kprobe *kp)
+{
+ return kp->pre_handler == pre_handler_kretprobe;
+}
+
#ifdef CONFIG_KRETPROBES
/*
* This kprobe pre_handler is registered with every kretprobe. When probe
@@ -1826,6 +1843,8 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
hash = hash_ptr(current, KPROBE_HASH_BITS);
raw_spin_lock_irqsave(&rp->lock, flags);
if (!hlist_empty(&rp->free_instances)) {
+ struct stack_trace trace = {};
+
ri = hlist_entry(rp->free_instances.first,
struct kretprobe_instance, hlist);
hlist_del(&ri->hlist);
@@ -1834,6 +1853,11 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
ri->rp = rp;
ri->task = current;
+ trace.entries = &ri->entry.entries[0];
+ trace.max_entries = KRETPROBE_TRACE_SIZE;
+ save_stack_trace_regs(regs, &trace);
+ ri->entry.nr_entries = trace.nr_entries;
+
if (rp->entry_handler && rp->entry_handler(ri, regs)) {
raw_spin_lock_irqsave(&rp->lock, flags);
hlist_add_head(&ri->hlist, &rp->free_instances);
@@ -1856,6 +1880,65 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
}
NOKPROBE_SYMBOL(pre_handler_kretprobe);
+/*
+ * Return the kretprobe_instance associated with the current_kprobe. Calling
+ * this is only reasonable from within a kretprobe handler context (otherwise
+ * return NULL).
+ *
+ * Must be called within a kretprobe_hash_lock(current, ...) context.
+ */
+struct kretprobe_instance *current_kretprobe_instance(void)
+{
+ struct kprobe *kp;
+ struct kretprobe *rp;
+ struct kretprobe_instance *ri;
+ struct hlist_head *head;
+ unsigned long hash = hash_ptr(current, KPROBE_HASH_BITS);
+
+ kp = kprobe_running();
+ if (!kp || !kprobe_is_retprobe(kp))
+ return NULL;
+ if (WARN_ON(!kretprobe_hash_is_locked(current)))
+ return NULL;
+
+ rp = container_of(kp, struct kretprobe, kp);
+ head = &kretprobe_inst_table[hash];
+
+ hlist_for_each_entry(ri, head, hlist) {
+ if (ri->task == current && ri->rp == rp)
+ return ri;
+ }
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(current_kretprobe_instance);
+NOKPROBE_SYMBOL(current_kretprobe_instance);
+
+void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+ struct stack_trace *trace)
+{
+ int i;
+ struct kretprobe_trace *krt = &ri->entry;
+
+ for (i = trace->skip; i < krt->nr_entries; i++) {
+ if (trace->nr_entries >= trace->max_entries)
+ break;
+ trace->entries[trace->nr_entries++] = krt->entries[i];
+ }
+}
+
+void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+ struct perf_callchain_entry_ctx *ctx)
+{
+ int i;
+ struct kretprobe_trace *krt = &ri->entry;
+
+ for (i = 0; i < krt->nr_entries; i++) {
+ if (krt->entries[i] == ULONG_MAX)
+ break;
+ perf_callchain_store(ctx, (u64) krt->entries[i]);
+ }
+}
+
bool __weak arch_kprobe_on_func_entry(unsigned long offset)
{
return !offset;
@@ -2005,6 +2088,22 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
}
NOKPROBE_SYMBOL(pre_handler_kretprobe);
+struct kretprobe_instance *current_kretprobe_instance(void)
+{
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(current_kretprobe_instance);
+NOKPROBE_SYMBOL(current_kretprobe_instance);
+
+void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+ struct stack_trace *trace)
+{
+}
+
+void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+ struct perf_callchain_entry_ctx *ctx)
+{
+}
#endif /* CONFIG_KRETPROBES */
/* Set the kprobe gone and remove its instruction buffer. */
@@ -2241,7 +2340,7 @@ static void report_probe(struct seq_file *pi, struct kprobe *p,
char *kprobe_type;
void *addr = p->addr;
- if (p->pre_handler == pre_handler_kretprobe)
+ if (kprobe_is_retprobe(p))
kprobe_type = "r";
else
kprobe_type = "k";
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bf6f1d70484d..2210d38a4dbf 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -42,6 +42,7 @@
#include <linux/nmi.h>
#include <linux/fs.h>
#include <linux/trace.h>
+#include <linux/kprobes.h>
#include <linux/sched/clock.h>
#include <linux/sched/rt.h>
@@ -2590,6 +2591,7 @@ static void __ftrace_trace_stack(struct ring_buffer *buffer,
struct ring_buffer_event *event;
struct stack_entry *entry;
struct stack_trace trace;
+ struct kretprobe_instance *ri = current_kretprobe_instance();
int use_stack;
int size = FTRACE_STACK_ENTRIES;
@@ -2626,7 +2628,9 @@ static void __ftrace_trace_stack(struct ring_buffer *buffer,
trace.entries = this_cpu_ptr(ftrace_stack.calls);
trace.max_entries = FTRACE_STACK_MAX_ENTRIES;
- if (regs)
+ if (ri)
+ kretprobe_save_stack_trace(ri, &trace);
+ else if (regs)
save_stack_trace_regs(regs, &trace);
else
save_stack_trace(&trace);
@@ -2653,7 +2657,10 @@ static void __ftrace_trace_stack(struct ring_buffer *buffer,
else {
trace.max_entries = FTRACE_STACK_ENTRIES;
trace.entries = entry->caller;
- if (regs)
+
+ if (ri)
+ kretprobe_save_stack_trace(ri, &trace);
+ else if (regs)
save_stack_trace_regs(regs, &trace);
else
save_stack_trace(&trace);
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
new file mode 100644
index 000000000000..03146c6a1a3c
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
@@ -0,0 +1,25 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0+
+# description: Kretprobe dynamic event with a stacktrace
+
+[ -f kprobe_events ] || exit_unsupported # this is configurable
+
+echo 0 > events/enable
+echo 1 > options/stacktrace
+
+echo 'r:teststackprobe sched_fork $retval' > kprobe_events
+grep teststackprobe kprobe_events
+test -d events/kprobes/teststackprobe
+
+clear_trace
+echo 1 > events/kprobes/teststackprobe/enable
+( echo "forked")
+echo 0 > events/kprobes/teststackprobe/enable
+
+# Make sure we don't see kretprobe_trampoline and we see _do_fork.
+! grep 'kretprobe' trace
+grep '_do_fork' trace
+
+echo '-:teststackprobe' >> kprobe_events
+clear_trace
+test -d events/kprobes/teststackprobe && exit_fail || exit_pass
--
2.19.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox