* Re: [PATCH v2 net-next 3/7] dccp: do not assume DCCP code is non preemptible
From: Soheil Hassas Yeganeh @ 2016-04-29 13:21 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, netdev, Alexei Starovoitov,
Marcelo Ricardo Leitner, Eric Dumazet
In-Reply-To: <1461899449-8096-4-git-send-email-edumazet@google.com>
On Thu, Apr 28, 2016 at 11:10 PM, Eric Dumazet <edumazet@google.com> wrote:
> DCCP uses the generic backlog code, and this will soon
> be changed to not disable BH when protocol is called back.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---
> net/dccp/input.c | 2 +-
> net/dccp/ipv4.c | 4 ++--
> net/dccp/ipv6.c | 4 ++--
> net/dccp/options.c | 2 +-
> 4 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/dccp/input.c b/net/dccp/input.c
> index 2437ecc13b82..ba347184bda9 100644
> --- a/net/dccp/input.c
> +++ b/net/dccp/input.c
> @@ -359,7 +359,7 @@ send_sync:
> goto discard;
> }
>
> - __DCCP_INC_STATS(DCCP_MIB_INERRS);
> + DCCP_INC_STATS(DCCP_MIB_INERRS);
> discard:
> __kfree_skb(skb);
> return 0;
> diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
> index a8164272e0f4..5c7e413a3ae4 100644
> --- a/net/dccp/ipv4.c
> +++ b/net/dccp/ipv4.c
> @@ -533,8 +533,8 @@ static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb)
> bh_unlock_sock(ctl_sk);
>
> if (net_xmit_eval(err) == 0) {
> - __DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
> - __DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
> + DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
> + DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
> }
> out:
> dst_release(dst);
> diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
> index 0f4eb4ea57a5..d176f4e66369 100644
> --- a/net/dccp/ipv6.c
> +++ b/net/dccp/ipv6.c
> @@ -277,8 +277,8 @@ static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb)
> if (!IS_ERR(dst)) {
> skb_dst_set(skb, dst);
> ip6_xmit(ctl_sk, skb, &fl6, NULL, 0);
> - __DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
> - __DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
> + DCCP_INC_STATS(DCCP_MIB_OUTSEGS);
> + DCCP_INC_STATS(DCCP_MIB_OUTRSTS);
> return;
> }
>
> diff --git a/net/dccp/options.c b/net/dccp/options.c
> index b82b7ee9a1d2..74d29c56c367 100644
> --- a/net/dccp/options.c
> +++ b/net/dccp/options.c
> @@ -253,7 +253,7 @@ out_nonsensical_length:
> return 0;
>
> out_invalid_option:
> - __DCCP_INC_STATS(DCCP_MIB_INVALIDOPT);
> + DCCP_INC_STATS(DCCP_MIB_INVALIDOPT);
> rc = DCCP_RESET_CODE_OPTION_ERROR;
> out_featneg_failed:
> DCCP_WARN("DCCP(%p): Option %d (len=%d) error=%u\n", sk, opt, len, rc);
> --
> 2.8.0.rc3.226.g39d4020
>
^ permalink raw reply
* Re: [PATCH v2 net-next 4/7] udp: prepare for non BH masking at backlog processing
From: Soheil Hassas Yeganeh @ 2016-04-29 13:23 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, netdev, Alexei Starovoitov,
Marcelo Ricardo Leitner, Eric Dumazet
In-Reply-To: <1461899449-8096-5-git-send-email-edumazet@google.com>
On Thu, Apr 28, 2016 at 11:10 PM, Eric Dumazet <edumazet@google.com> wrote:
> UDP uses the generic socket backlog code, and this will soon
> be changed to not disable BH when protocol is called back.
>
> We need to use appropriate SNMP accessors.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---
> net/ipv4/udp.c | 4 ++--
> net/ipv6/udp.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 093284c5c03b..f67f52ba4809 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1514,9 +1514,9 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>
> /* Note that an ENOMEM error is charged twice */
> if (rc == -ENOMEM)
> - __UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
> + UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
> is_udplite);
> - __UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> + UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> kfree_skb(skb);
> trace_udp_fail_queue_rcv_skb(rc, sk);
> return -1;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 1ba5a74ac18f..f911c63f79e6 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -570,9 +570,9 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>
> /* Note that an ENOMEM error is charged twice */
> if (rc == -ENOMEM)
> - __UDP6_INC_STATS(sock_net(sk),
> + UDP6_INC_STATS(sock_net(sk),
> UDP_MIB_RCVBUFERRORS, is_udplite);
> - __UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> + UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> kfree_skb(skb);
> return -1;
> }
> --
> 2.8.0.rc3.226.g39d4020
>
^ permalink raw reply
* Re: [RFC PATCH net] ipv6/ila: fix nlsize calculation for lwtunnel
From: Nicolas Dichtel @ 2016-04-29 13:35 UTC (permalink / raw)
To: Tom Herbert, David Miller; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S36jkPqx5U+B9akRYZ+xiRhozk_AdOmKBEr4N9mhwTKQMw@mail.gmail.com>
Le 28/04/2016 18:07, Tom Herbert a écrit :
> On Wed, Apr 27, 2016 at 12:20 PM, David Miller <davem@davemloft.net> wrote:
>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> Date: Fri, 22 Apr 2016 17:58:02 +0200
>>
>>> The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
>>>
>>> Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module")
>>> CC: Tom Herbert <tom@herbertland.com>
>>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>> ---
>>>
>>> Tom, when I read the comment, I feel I'm misssing something, but what?
>>
>> Tom, seriously, please look at this.
>>
> Yes, it is an issue. Need to do add size for ILA_ATTR_LOCATOR and
> ILA_ATTR_CSUM_MODE. Also not we made ILA_ATTR_CSUM_MODE u64 in fill
> encap info. I will send a path shortly.
David, I saw that you didn't apply this patch. Note that Tom's patch is
for net-next while this one is for net.
I think that both patches are needed (and they will conflict after the next merge).
Regards,
Nicolas
^ permalink raw reply
* Re: [PATCH v3 0/2] sctp: delay calls to sk_data_ready() as much as possible
From: Neil Horman @ 2016-04-29 13:36 UTC (permalink / raw)
To: marcelo.leitner
Cc: David Miller, netdev, vyasevich, linux-sctp, David.Laight, jkbs
In-Reply-To: <20160428204659.GA2276@localhost.localdomain>
On Thu, Apr 28, 2016 at 05:46:59PM -0300, marcelo.leitner@gmail.com wrote:
> On Thu, Apr 14, 2016 at 05:19:00PM -0300, marcelo.leitner@gmail.com wrote:
> > On Thu, Apr 14, 2016 at 04:03:51PM -0400, Neil Horman wrote:
> > > On Thu, Apr 14, 2016 at 02:59:16PM -0400, David Miller wrote:
> > > > From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > Date: Thu, 14 Apr 2016 14:00:49 -0300
> > > >
> > > > > Em 14-04-2016 10:03, Neil Horman escreveu:
> > > > >> On Wed, Apr 13, 2016 at 11:05:32PM -0400, David Miller wrote:
> > > > >>> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > >>> Date: Fri, 8 Apr 2016 16:41:26 -0300
> > > > >>>
> > > > >>>> 1st patch is a preparation for the 2nd. The idea is to not call
> > > > >>>> ->sk_data_ready() for every data chunk processed while processing
> > > > >>>> packets but only once before releasing the socket.
> > > > >>>>
> > > > >>>> v2: patchset re-checked, small changelog fixes
> > > > >>>> v3: on patch 2, make use of local vars to make it more readable
> > > > >>>
> > > > >>> Applied to net-next, but isn't this reduced overhead coming at the
> > > > >>> expense of latency? What if that lower latency is important to the
> > > > >>> application and/or consumer?
> > > > >> Thats a fair point, but I'd make the counter argument that, as it
> > > > >> currently
> > > > >> stands, any latency introduced (or removed), is an artifact of our
> > > > >> implementation rather than a designed feature of it. That is to say,
> > > > >> we make no
> > > > >> guarantees at the application level regarding how long it takes to
> > > > >> signal data
> > > > >> readines from the time we get data off the wire, so I would rather see
> > > > >> our
> > > > >> throughput raised if we can, as thats been sctp's more pressing
> > > > >> achilles heel.
> > > > >>
> > > > >>
> > > > >> Thats not to say I'd like to enable lower latency, but I'd rather have
> > > > >> this now,
> > > > >> and start pondering how to design that in. Perhaps we can convert the
> > > > >> pending
> > > > >> flag to a counter to count the number of events we enqueue, and call
> > > > >> sk_data_ready every time we reach a sysctl defined threshold.
> > > > >
> > > > > That and also that there is no chance of the application reading the
> > > > > first chunks before all current ToDo's are performed by either the bh
> > > > > or backlog handlers for that packet. Socket lock won't be cycled in
> > > > > between chunks so the application is going to wait all the processing
> > > > > one way or another.
> > > >
> > > > But it takes time to signal the wakeup to the remote cpu the process
> > > > was running on, schedule out the current process on that cpu (if it
> > > > has in fact lost it's timeslice), and then finally look at the socket
> > > > queue.
> > > >
> > > > Of course this is all assuming the process was sleeping in the first
> > > > place, either in recv or more likely poll.
> > > >
> > > > I really think signalling early helps performance.
> > > >
> > >
> > > Early, yes, often, not so much :). Perhaps what would be adventageous would be
> > > to signal at the start of a set of enqueues, rather than at the end. That would
> > > be equivalent in terms of not signaling more than needed, but would eliminate
> > > the signaling on every chunk. Perhaps what you could do Marcelo would be to
> > > change the sense of the signal_ready flag to be a has_signaled flag. e.g. call
> > > sk_data_ready in ulp_event_tail like we used to, but only if the has_signaled
> > > flag isn't set, then set the flag, and clear it at the end of the command
> > > interpreter.
> > >
> > > That would be a best of both worlds solution, as long as theres no chance of
> > > race with user space reading from the socket before we were done enqueuing (i.e.
> > > you have to guarantee that the socket lock stays held, which I think we do).
> >
> > That is my feeling too. Will work on it. Thanks :-)
>
> I did the change and tested it on real machines set all for performance.
> I couldn't spot any difference between both implementations.
>
> Set RSS and queue irq affinity for a cpu and taskset netperf and another
> app I wrote to run on another cpu. It hits socket backlog quite often
> but still do direct processing every now and then.
>
> With current state, netperf, scenario above. Results of perf sched
> record for the CPUs in use, reported by perf sched latency:
>
> Task | Runtime ms | Switches | Average delay ms |
> Maximum delay ms | Maximum delay at |
> netserver:3205 | 9999.490 ms | 10 | avg: 0.003 ms |
> max: 0.004 ms | max at: 69087.753356 s
>
> another run
> netserver:3483 | 9999.412 ms | 15 | avg: 0.003 ms |
> max: 0.004 ms | max at: 69194.749814 s
>
> With the patch below, same test:
> netserver:2643 | 10000.110 ms | 14 | avg: 0.003 ms |
> max: 0.004 ms | max at: 172.006315 s
>
> another run:
> netserver:2698 | 10000.049 ms | 15 | avg: 0.003 ms |
> max: 0.004 ms | max at: 368.061672 s
>
> I'll be happy to do more tests if you have any suggestions on how/what
> to test.
>
> ---8<---
>
I think this looks reasonable, but can you post it properly please, as a patch
against the head of teh net-next tree, rather than a diff from your previous
work (which wasn't comitted)
Thanks!
Neil
^ permalink raw reply
* Re: [PATCH v2 net-next 6/7] net: do not block BH while processing socket backlog
From: Soheil Hassas Yeganeh @ 2016-04-29 13:37 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, netdev, Alexei Starovoitov,
Marcelo Ricardo Leitner, Eric Dumazet
In-Reply-To: <1461899449-8096-7-git-send-email-edumazet@google.com>
On Thu, Apr 28, 2016 at 11:10 PM, Eric Dumazet <edumazet@google.com> wrote:
> Socket backlog processing is a major latency source.
>
> With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
> holding cpu for more than 5 ms, and packets being dropped by the NIC
> once ring buffer is filled.
>
> All users are now ready to be called from process context,
> we can unblock BH and let interrupts be serviced faster.
>
> cond_resched_softirq() could be removed, as it has no more user.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---
> net/core/sock.c | 22 ++++++++--------------
> 1 file changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index e16a5db853c6..70744dbb6c3f 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2019,33 +2019,27 @@ static void __release_sock(struct sock *sk)
> __releases(&sk->sk_lock.slock)
> __acquires(&sk->sk_lock.slock)
> {
> - struct sk_buff *skb = sk->sk_backlog.head;
> + struct sk_buff *skb, *next;
>
> - do {
> + while ((skb = sk->sk_backlog.head) != NULL) {
> sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
> - bh_unlock_sock(sk);
>
> - do {
> - struct sk_buff *next = skb->next;
> + spin_unlock_bh(&sk->sk_lock.slock);
>
> + do {
> + next = skb->next;
> prefetch(next);
> WARN_ON_ONCE(skb_dst_is_noref(skb));
> skb->next = NULL;
> sk_backlog_rcv(sk, skb);
>
> - /*
> - * We are in process context here with softirqs
> - * disabled, use cond_resched_softirq() to preempt.
> - * This is safe to do because we've taken the backlog
> - * queue private:
> - */
> - cond_resched_softirq();
> + cond_resched();
>
> skb = next;
> } while (skb != NULL);
>
> - bh_lock_sock(sk);
> - } while ((skb = sk->sk_backlog.head) != NULL);
> + spin_lock_bh(&sk->sk_lock.slock);
> + }
>
> /*
> * Doing the zeroing here guarantee we can not loop forever
> --
> 2.8.0.rc3.226.g39d4020
>
This is great! very nice patch.
^ permalink raw reply
* Re: [PATCH v3 0/2] sctp: delay calls to sk_data_ready() as much as possible
From: marcelo.leitner @ 2016-04-29 13:47 UTC (permalink / raw)
To: Neil Horman
Cc: David Miller, netdev, vyasevich, linux-sctp, David.Laight, jkbs
In-Reply-To: <20160429133637.GA31121@hmsreliant.think-freely.org>
On Fri, Apr 29, 2016 at 09:36:37AM -0400, Neil Horman wrote:
> On Thu, Apr 28, 2016 at 05:46:59PM -0300, marcelo.leitner@gmail.com wrote:
> > On Thu, Apr 14, 2016 at 05:19:00PM -0300, marcelo.leitner@gmail.com wrote:
> > > On Thu, Apr 14, 2016 at 04:03:51PM -0400, Neil Horman wrote:
> > > > On Thu, Apr 14, 2016 at 02:59:16PM -0400, David Miller wrote:
> > > > > From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > > Date: Thu, 14 Apr 2016 14:00:49 -0300
> > > > >
> > > > > > Em 14-04-2016 10:03, Neil Horman escreveu:
> > > > > >> On Wed, Apr 13, 2016 at 11:05:32PM -0400, David Miller wrote:
> > > > > >>> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > > >>> Date: Fri, 8 Apr 2016 16:41:26 -0300
> > > > > >>>
> > > > > >>>> 1st patch is a preparation for the 2nd. The idea is to not call
> > > > > >>>> ->sk_data_ready() for every data chunk processed while processing
> > > > > >>>> packets but only once before releasing the socket.
> > > > > >>>>
> > > > > >>>> v2: patchset re-checked, small changelog fixes
> > > > > >>>> v3: on patch 2, make use of local vars to make it more readable
> > > > > >>>
> > > > > >>> Applied to net-next, but isn't this reduced overhead coming at the
> > > > > >>> expense of latency? What if that lower latency is important to the
> > > > > >>> application and/or consumer?
> > > > > >> Thats a fair point, but I'd make the counter argument that, as it
> > > > > >> currently
> > > > > >> stands, any latency introduced (or removed), is an artifact of our
> > > > > >> implementation rather than a designed feature of it. That is to say,
> > > > > >> we make no
> > > > > >> guarantees at the application level regarding how long it takes to
> > > > > >> signal data
> > > > > >> readines from the time we get data off the wire, so I would rather see
> > > > > >> our
> > > > > >> throughput raised if we can, as thats been sctp's more pressing
> > > > > >> achilles heel.
> > > > > >>
> > > > > >>
> > > > > >> Thats not to say I'd like to enable lower latency, but I'd rather have
> > > > > >> this now,
> > > > > >> and start pondering how to design that in. Perhaps we can convert the
> > > > > >> pending
> > > > > >> flag to a counter to count the number of events we enqueue, and call
> > > > > >> sk_data_ready every time we reach a sysctl defined threshold.
> > > > > >
> > > > > > That and also that there is no chance of the application reading the
> > > > > > first chunks before all current ToDo's are performed by either the bh
> > > > > > or backlog handlers for that packet. Socket lock won't be cycled in
> > > > > > between chunks so the application is going to wait all the processing
> > > > > > one way or another.
> > > > >
> > > > > But it takes time to signal the wakeup to the remote cpu the process
> > > > > was running on, schedule out the current process on that cpu (if it
> > > > > has in fact lost it's timeslice), and then finally look at the socket
> > > > > queue.
> > > > >
> > > > > Of course this is all assuming the process was sleeping in the first
> > > > > place, either in recv or more likely poll.
> > > > >
> > > > > I really think signalling early helps performance.
> > > > >
> > > >
> > > > Early, yes, often, not so much :). Perhaps what would be adventageous would be
> > > > to signal at the start of a set of enqueues, rather than at the end. That would
> > > > be equivalent in terms of not signaling more than needed, but would eliminate
> > > > the signaling on every chunk. Perhaps what you could do Marcelo would be to
> > > > change the sense of the signal_ready flag to be a has_signaled flag. e.g. call
> > > > sk_data_ready in ulp_event_tail like we used to, but only if the has_signaled
> > > > flag isn't set, then set the flag, and clear it at the end of the command
> > > > interpreter.
> > > >
> > > > That would be a best of both worlds solution, as long as theres no chance of
> > > > race with user space reading from the socket before we were done enqueuing (i.e.
> > > > you have to guarantee that the socket lock stays held, which I think we do).
> > >
> > > That is my feeling too. Will work on it. Thanks :-)
> >
> > I did the change and tested it on real machines set all for performance.
> > I couldn't spot any difference between both implementations.
> >
> > Set RSS and queue irq affinity for a cpu and taskset netperf and another
> > app I wrote to run on another cpu. It hits socket backlog quite often
> > but still do direct processing every now and then.
> >
> > With current state, netperf, scenario above. Results of perf sched
> > record for the CPUs in use, reported by perf sched latency:
> >
> > Task | Runtime ms | Switches | Average delay ms |
> > Maximum delay ms | Maximum delay at |
> > netserver:3205 | 9999.490 ms | 10 | avg: 0.003 ms |
> > max: 0.004 ms | max at: 69087.753356 s
> >
> > another run
> > netserver:3483 | 9999.412 ms | 15 | avg: 0.003 ms |
> > max: 0.004 ms | max at: 69194.749814 s
> >
> > With the patch below, same test:
> > netserver:2643 | 10000.110 ms | 14 | avg: 0.003 ms |
> > max: 0.004 ms | max at: 172.006315 s
> >
> > another run:
> > netserver:2698 | 10000.049 ms | 15 | avg: 0.003 ms |
> > max: 0.004 ms | max at: 368.061672 s
> >
> > I'll be happy to do more tests if you have any suggestions on how/what
> > to test.
> >
> > ---8<---
> >
> I think this looks reasonable, but can you post it properly please, as a patch
> against the head of teh net-next tree, rather than a diff from your previous
> work (which wasn't comitted)
The idea was to not officially post it yet, more just as a reference,
because I can't see any gains from it. I'm reluctant just due to that,
no strong opinion here on one way or another.
If you think it's better anyway to signal it early, I'll properly repost
it.
Thanks,
Marcelo
^ permalink raw reply
* Re: [PATCH v2 net-next 1/7] tcp: do not assume TCP code is non preemptible
From: Eric Dumazet @ 2016-04-29 14:37 UTC (permalink / raw)
To: Soheil Hassas Yeganeh
Cc: Eric Dumazet, David S . Miller, netdev, Alexei Starovoitov,
Marcelo Ricardo Leitner
In-Reply-To: <CACSApvYruzhRVLWbHowsPpfnN0f1vhqRywt-2j1vCOFxmmV9+A@mail.gmail.com>
On Fri, 2016-04-29 at 09:18 -0400, Soheil Hassas Yeganeh wrote:
> On Thu, Apr 28, 2016 at 11:10 PM, Eric Dumazet <edumazet@google.com> wrote:
> > + NET_ADD_STATS(sock_net(sk),
> > + LINUX_MIB_TCPHYSTARTTRAINCWND,
> > + pp>>sn__cwdd);
> nit: shouldn't this be tp->snd_cwnd?
Interesting... my git tree has the proper thing, but the file on my SSD
after git format-patch got mangled somehow...
Thanks for noticing !
^ permalink raw reply
* [PATCH net v2 1/1] tipc: only process unicast on intended node
From: Jon Maloy @ 2016-04-29 14:40 UTC (permalink / raw)
To: davem
Cc: netdev, Paul Gortmaker, parthasarathy.bhuvaragan, richard.alpe,
ying.xue, maloy, tipc-discussion, Hamish Martin, Jon Maloy
From: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.
While we are aware that this change only patches over a root problem that
we still haven't identified, this is a sanity test that it is always
legitimate to do. It will remain in the code even after we identify and
fix the real problem.
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
net/tipc/node.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/tipc/node.c b/net/tipc/node.c
index ace178f..9aaa1bc 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1444,6 +1444,7 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, struct tipc_bearer *b)
int bearer_id = b->identity;
struct tipc_link_entry *le;
u16 bc_ack = msg_bcast_ack(hdr);
+ u32 self = tipc_own_addr(net);
int rc = 0;
__skb_queue_head_init(&xmitq);
@@ -1460,6 +1461,10 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, struct tipc_bearer *b)
return tipc_node_bc_rcv(net, skb, bearer_id);
}
+ /* Discard unicast link messages destined for another node */
+ if (unlikely(!msg_short(hdr) && (msg_destnode(hdr) != self)))
+ goto discard;
+
/* Locate neighboring node that sent packet */
n = tipc_node_find(net, msg_prevnode(hdr));
if (unlikely(!n))
--
1.9.1
^ permalink raw reply related
* Re: [PATCH v2 net-next 1/7] tcp: do not assume TCP code is non preemptible
From: Soheil Hassas Yeganeh @ 2016-04-29 14:41 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eric Dumazet, David S . Miller, netdev, Alexei Starovoitov,
Marcelo Ricardo Leitner
In-Reply-To: <1461940645.5535.144.camel@edumazet-glaptop3.roam.corp.google.com>
On Fri, Apr 29, 2016 at 10:37 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-04-29 at 09:18 -0400, Soheil Hassas Yeganeh wrote:
>> On Thu, Apr 28, 2016 at 11:10 PM, Eric Dumazet <edumazet@google.com> wrote:
>
>> > + NET_ADD_STATS(sock_net(sk),
>> > + LINUX_MIB_TCPHYSTARTTRAINCWND,
>> > + pp>>sn__cwdd);
>> nit: shouldn't this be tp->snd_cwnd?
>
> Interesting... my git tree has the proper thing, but the file on my SSD
> after git format-patch got mangled somehow...
>
> Thanks for noticing !
Thanks for the patches!
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
^ permalink raw reply
* Re: [PATCH net-next] net: dsa: mv88e6xxx: replace ds with ps where possible
From: Vivien Didelot @ 2016-04-29 14:55 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev, linux-kernel, kernel, David S. Miller
In-Reply-To: <20160429130645.GG4053@lunn.ch>
Hi Andrew,
Andrew Lunn <andrew@lunn.ch> writes:
> On Thu, Apr 28, 2016 at 09:24:06PM -0400, Vivien Didelot wrote:
>> From: Andrew Lunn <andrew@lunn.ch>
>>
>> The dsa_switch structure ds is actually needed in very few places,
>> mostly during setup of the switch. The private structure ps is however
>> needed nearly everywhere. Pass ps, not ds internally.
>>
>> [vd: rebased Andrew's patch.]
>
> Hi Vivien
>
> Thanks for picking up this patch and rebasing it.
>
> I would generally put comments like that bellow the ---. They don't
> need to be in the commit log.
I've seen that '[<initials-or-firstname>: <message>]' is often used to
inject notes in the message, but that might not be a standard.
Thanks for the note,
Vivien
^ permalink raw reply
* Re: [PATCH nf-next 8/9] netfilter: conntrack: use a single hashtable for all namespaces
From: Florian Westphal @ 2016-04-29 15:04 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel, netdev
In-Reply-To: <1461863628-23350-9-git-send-email-fw@strlen.de>
Florian Westphal <fw@strlen.de> wrote:
> We already include netns address in the hash and compare the netns pointers
> during lookup, so even if namespaces have overlapping addresses entries
> will be spread across the table.
>
> Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
> 64bit system.
>
> NAT bysrc and expectation hash is still per namespace, those will
> changed too soon.
>
> Future patch will also make conntrack object slab cache global again.
>
> @@ -1527,7 +1528,6 @@ i_see_dead_people:
> }
>
> list_for_each_entry(net, net_exit_list, exit_list) {
> - nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
Removing this is ok, but nf_ct_free_hashtable() must now be called in
nf_conntrack_cleanup_end().
I'll wait with v2 for a couple of days.
^ permalink raw reply
* Re: [PATCH] mdio_bus: Fix MDIO bus scanning in __mdiobus_register()
From: Sergei Shtylyov @ 2016-04-29 15:46 UTC (permalink / raw)
To: Marek Vasut, netdev
Cc: Arnd Bergmann, David S . Miller, Dinh Nguyen, Florian Fainelli
In-Reply-To: <5723493C.1040801@denx.de>
Hello.
On 04/29/2016 02:45 PM, Marek Vasut wrote:
>> First of all, thank you for the patch!
>> You beat me to it (and not only me). :-)
>
> Heh, hacking at night has it's perks :)
I know, I'm just supposed to work on different things. :-)
>> On 4/29/2016 4:09 AM, Marek Vasut wrote:
>>
>>> Since commit b74766a0a0feeef5c779709cc5d109451c0d5b17 in linux-next,
>>> ( phylib: don't return NULL from get_phy_device() ), phy_get_device()
>>
>> scripts/checkpatch.pl now enforces certain commit citing style, yours
>> doesn't quite match it.
>
> Ha, I didn't know that checkpatch can now warn about this too, nice. Is
> that in next already ? I just tried checkpatch and it doesn't warn about it.
It's been merged long ago. Actually, I've just run this script from the
current Linus' tree on your patch, here's the results:
ERROR: Please use git commit description style 'commit <12+ chars of sha1>
("<title line>")' - ie: 'commit fatal: bad o ("cc5d109451c0d5b17")'
#4:
Since commit b74766a0a0feeef5c779709cc5d109451c0d5b17 in linux-next,
WARNING: 'immediatelly' may be misspelled - perhaps 'immediately'?
#23:
'if (IS_ERR(phydev))' condition and the loop exits immediatelly if the
total: 1 errors, 1 warnings, 0 checks, 8 lines checked
> Anyway, regarding this format, do you want V2 ? Originally, I had the
> full commit info in the message, but that was just taking space and
> it is not the commit which is important in the message, so I trimmed
> it down.
I'll let DaveM decide.
>>> will return ERR_PTR(-ENODEV) instead of NULL if the PHY device ID is
>>> all ones.
>>>
>>> This causes problem with stmmac driver and likely some other drivers
>>> which call mdiobus_register(). I triggered this bug on SoCFPGA MCVEVK
>>> board with linux-next 20160427 and 20160428. In case of the stmmac, if
>>> there is no PHY node specified in the DT for the stmmac block, the stmmac
>>> driver ( drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c function
>>> stmmac_mdio_register() ) will call mdiobus_register() , which will
>>> register the MDIO bus and probe for the PHY.
>>>
>>> The mdiobus_register() resp. __mdiobus_register() iterates over all of
>>> the addresses on the MDIO bus and calls mdiobus_scan() for each of them,
>>> which invokes get_phy_device(). Before the aforementioned patch, the
>>> mdiobus_scan() would return NULL if no PHY was found on a given address
>>> and mdiobus_register() would continue and try the next PHY address. Now,
>>> mdiobus_scan() returns ERR_PTR(-ENODEV), which is caught by the
>>> 'if (IS_ERR(phydev))' condition and the loop exits immediatelly if the
>>> PHY address does not contain PHY.
>>>
>>> Repair this by explicitly checking for the ERR_PTR(-ENODEV) and if this
>>> error comes around, continue with the next PHY address.
>>>
>>> Signed-off-by: Marek Vasut <marex@denx.de>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: David S. Miller <davem@davemloft.net>
>>> Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
>>> Cc: Florian Fainelli <f.fainelli@gmail.com>
>>> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>>
>> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>>
>>> ---
>>> drivers/net/phy/mdio_bus.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> NOTE: I don't quite like this explicit check , but I don't have better
>>> idea now.
>>
>> It's fine. I was going to do just the same :-)
>
> OK, I'm glad I'm not alone on this one :)
>
>>> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
>>> index 499003ee..388f992 100644
>>> --- a/drivers/net/phy/mdio_bus.c
>>> +++ b/drivers/net/phy/mdio_bus.c
>>> @@ -333,7 +333,7 @@ int __mdiobus_register(struct mii_bus *bus, struct
>>> module *owner)
>>> struct phy_device *phydev;
>>>
>>> phydev = mdiobus_scan(bus, i);
>>> - if (IS_ERR(phydev)) {
>>> + if (IS_ERR(phydev) && (PTR_ERR(phydev) != -ENODEV)) {
>>
>> Parens around the second operand of && are not really needed though...
>
> While I agree, I also prefer to make things obvious when reading the
> code by adding the parenthesis. It's a matter of taste I think. Just let
> me know if I should spin V2 without them :)
Again, let DaveM decide. But I don't think the code being patched uses
parens in such cases... I wouldn't, for sure.
> Thanks for the review!
You've saved me some work (but I still need to analyze the other call paths).
>> [...]
MBR, Sergei
^ permalink raw reply
* Re: [PATCH v3 0/2] sctp: delay calls to sk_data_ready() as much as possible
From: Neil Horman @ 2016-04-29 16:10 UTC (permalink / raw)
To: marcelo.leitner
Cc: David Miller, netdev, vyasevich, linux-sctp, David.Laight, jkbs
In-Reply-To: <20160429134725.GB5676@localhost.localdomain>
On Fri, Apr 29, 2016 at 10:47:25AM -0300, marcelo.leitner@gmail.com wrote:
> On Fri, Apr 29, 2016 at 09:36:37AM -0400, Neil Horman wrote:
> > On Thu, Apr 28, 2016 at 05:46:59PM -0300, marcelo.leitner@gmail.com wrote:
> > > On Thu, Apr 14, 2016 at 05:19:00PM -0300, marcelo.leitner@gmail.com wrote:
> > > > On Thu, Apr 14, 2016 at 04:03:51PM -0400, Neil Horman wrote:
> > > > > On Thu, Apr 14, 2016 at 02:59:16PM -0400, David Miller wrote:
> > > > > > From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > > > Date: Thu, 14 Apr 2016 14:00:49 -0300
> > > > > >
> > > > > > > Em 14-04-2016 10:03, Neil Horman escreveu:
> > > > > > >> On Wed, Apr 13, 2016 at 11:05:32PM -0400, David Miller wrote:
> > > > > > >>> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > > > >>> Date: Fri, 8 Apr 2016 16:41:26 -0300
> > > > > > >>>
> > > > > > >>>> 1st patch is a preparation for the 2nd. The idea is to not call
> > > > > > >>>> ->sk_data_ready() for every data chunk processed while processing
> > > > > > >>>> packets but only once before releasing the socket.
> > > > > > >>>>
> > > > > > >>>> v2: patchset re-checked, small changelog fixes
> > > > > > >>>> v3: on patch 2, make use of local vars to make it more readable
> > > > > > >>>
> > > > > > >>> Applied to net-next, but isn't this reduced overhead coming at the
> > > > > > >>> expense of latency? What if that lower latency is important to the
> > > > > > >>> application and/or consumer?
> > > > > > >> Thats a fair point, but I'd make the counter argument that, as it
> > > > > > >> currently
> > > > > > >> stands, any latency introduced (or removed), is an artifact of our
> > > > > > >> implementation rather than a designed feature of it. That is to say,
> > > > > > >> we make no
> > > > > > >> guarantees at the application level regarding how long it takes to
> > > > > > >> signal data
> > > > > > >> readines from the time we get data off the wire, so I would rather see
> > > > > > >> our
> > > > > > >> throughput raised if we can, as thats been sctp's more pressing
> > > > > > >> achilles heel.
> > > > > > >>
> > > > > > >>
> > > > > > >> Thats not to say I'd like to enable lower latency, but I'd rather have
> > > > > > >> this now,
> > > > > > >> and start pondering how to design that in. Perhaps we can convert the
> > > > > > >> pending
> > > > > > >> flag to a counter to count the number of events we enqueue, and call
> > > > > > >> sk_data_ready every time we reach a sysctl defined threshold.
> > > > > > >
> > > > > > > That and also that there is no chance of the application reading the
> > > > > > > first chunks before all current ToDo's are performed by either the bh
> > > > > > > or backlog handlers for that packet. Socket lock won't be cycled in
> > > > > > > between chunks so the application is going to wait all the processing
> > > > > > > one way or another.
> > > > > >
> > > > > > But it takes time to signal the wakeup to the remote cpu the process
> > > > > > was running on, schedule out the current process on that cpu (if it
> > > > > > has in fact lost it's timeslice), and then finally look at the socket
> > > > > > queue.
> > > > > >
> > > > > > Of course this is all assuming the process was sleeping in the first
> > > > > > place, either in recv or more likely poll.
> > > > > >
> > > > > > I really think signalling early helps performance.
> > > > > >
> > > > >
> > > > > Early, yes, often, not so much :). Perhaps what would be adventageous would be
> > > > > to signal at the start of a set of enqueues, rather than at the end. That would
> > > > > be equivalent in terms of not signaling more than needed, but would eliminate
> > > > > the signaling on every chunk. Perhaps what you could do Marcelo would be to
> > > > > change the sense of the signal_ready flag to be a has_signaled flag. e.g. call
> > > > > sk_data_ready in ulp_event_tail like we used to, but only if the has_signaled
> > > > > flag isn't set, then set the flag, and clear it at the end of the command
> > > > > interpreter.
> > > > >
> > > > > That would be a best of both worlds solution, as long as theres no chance of
> > > > > race with user space reading from the socket before we were done enqueuing (i.e.
> > > > > you have to guarantee that the socket lock stays held, which I think we do).
> > > >
> > > > That is my feeling too. Will work on it. Thanks :-)
> > >
> > > I did the change and tested it on real machines set all for performance.
> > > I couldn't spot any difference between both implementations.
> > >
> > > Set RSS and queue irq affinity for a cpu and taskset netperf and another
> > > app I wrote to run on another cpu. It hits socket backlog quite often
> > > but still do direct processing every now and then.
> > >
> > > With current state, netperf, scenario above. Results of perf sched
> > > record for the CPUs in use, reported by perf sched latency:
> > >
> > > Task | Runtime ms | Switches | Average delay ms |
> > > Maximum delay ms | Maximum delay at |
> > > netserver:3205 | 9999.490 ms | 10 | avg: 0.003 ms |
> > > max: 0.004 ms | max at: 69087.753356 s
> > >
> > > another run
> > > netserver:3483 | 9999.412 ms | 15 | avg: 0.003 ms |
> > > max: 0.004 ms | max at: 69194.749814 s
> > >
> > > With the patch below, same test:
> > > netserver:2643 | 10000.110 ms | 14 | avg: 0.003 ms |
> > > max: 0.004 ms | max at: 172.006315 s
> > >
> > > another run:
> > > netserver:2698 | 10000.049 ms | 15 | avg: 0.003 ms |
> > > max: 0.004 ms | max at: 368.061672 s
> > >
> > > I'll be happy to do more tests if you have any suggestions on how/what
> > > to test.
> > >
> > > ---8<---
> > >
> > I think this looks reasonable, but can you post it properly please, as a patch
> > against the head of teh net-next tree, rather than a diff from your previous
> > work (which wasn't comitted)
>
> The idea was to not officially post it yet, more just as a reference,
> because I can't see any gains from it. I'm reluctant just due to that,
> no strong opinion here on one way or another.
>
> If you think it's better anyway to signal it early, I'll properly repost
> it.
>
Yeah, your results seem to me to indicate that for your test at least, signaling
early vs. late doesn't make alot of difference, but Dave I think made a point in
principle in that allowing processes to wake up when we start enqueuing can be
better in some situations. So all other things being equal, I'd say go with the
method that you have here.
Best
Neil
> Thanks,
> Marcelo
>
>
^ permalink raw reply
* Re: [PATCH v3 0/2] sctp: delay calls to sk_data_ready() as much as possible
From: marcelo.leitner @ 2016-04-29 16:28 UTC (permalink / raw)
To: Neil Horman
Cc: David Miller, netdev, vyasevich, linux-sctp, David.Laight, jkbs
In-Reply-To: <20160429161031.GB31121@hmsreliant.think-freely.org>
On Fri, Apr 29, 2016 at 12:10:31PM -0400, Neil Horman wrote:
> On Fri, Apr 29, 2016 at 10:47:25AM -0300, marcelo.leitner@gmail.com wrote:
> > On Fri, Apr 29, 2016 at 09:36:37AM -0400, Neil Horman wrote:
> > > On Thu, Apr 28, 2016 at 05:46:59PM -0300, marcelo.leitner@gmail.com wrote:
> > > > On Thu, Apr 14, 2016 at 05:19:00PM -0300, marcelo.leitner@gmail.com wrote:
> > > > > On Thu, Apr 14, 2016 at 04:03:51PM -0400, Neil Horman wrote:
> > > > > > On Thu, Apr 14, 2016 at 02:59:16PM -0400, David Miller wrote:
> > > > > > > From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > > > > Date: Thu, 14 Apr 2016 14:00:49 -0300
> > > > > > >
> > > > > > > > Em 14-04-2016 10:03, Neil Horman escreveu:
> > > > > > > >> On Wed, Apr 13, 2016 at 11:05:32PM -0400, David Miller wrote:
> > > > > > > >>> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > > > > >>> Date: Fri, 8 Apr 2016 16:41:26 -0300
> > > > > > > >>>
> > > > > > > >>>> 1st patch is a preparation for the 2nd. The idea is to not call
> > > > > > > >>>> ->sk_data_ready() for every data chunk processed while processing
> > > > > > > >>>> packets but only once before releasing the socket.
> > > > > > > >>>>
> > > > > > > >>>> v2: patchset re-checked, small changelog fixes
> > > > > > > >>>> v3: on patch 2, make use of local vars to make it more readable
> > > > > > > >>>
> > > > > > > >>> Applied to net-next, but isn't this reduced overhead coming at the
> > > > > > > >>> expense of latency? What if that lower latency is important to the
> > > > > > > >>> application and/or consumer?
> > > > > > > >> Thats a fair point, but I'd make the counter argument that, as it
> > > > > > > >> currently
> > > > > > > >> stands, any latency introduced (or removed), is an artifact of our
> > > > > > > >> implementation rather than a designed feature of it. That is to say,
> > > > > > > >> we make no
> > > > > > > >> guarantees at the application level regarding how long it takes to
> > > > > > > >> signal data
> > > > > > > >> readines from the time we get data off the wire, so I would rather see
> > > > > > > >> our
> > > > > > > >> throughput raised if we can, as thats been sctp's more pressing
> > > > > > > >> achilles heel.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thats not to say I'd like to enable lower latency, but I'd rather have
> > > > > > > >> this now,
> > > > > > > >> and start pondering how to design that in. Perhaps we can convert the
> > > > > > > >> pending
> > > > > > > >> flag to a counter to count the number of events we enqueue, and call
> > > > > > > >> sk_data_ready every time we reach a sysctl defined threshold.
> > > > > > > >
> > > > > > > > That and also that there is no chance of the application reading the
> > > > > > > > first chunks before all current ToDo's are performed by either the bh
> > > > > > > > or backlog handlers for that packet. Socket lock won't be cycled in
> > > > > > > > between chunks so the application is going to wait all the processing
> > > > > > > > one way or another.
> > > > > > >
> > > > > > > But it takes time to signal the wakeup to the remote cpu the process
> > > > > > > was running on, schedule out the current process on that cpu (if it
> > > > > > > has in fact lost it's timeslice), and then finally look at the socket
> > > > > > > queue.
> > > > > > >
> > > > > > > Of course this is all assuming the process was sleeping in the first
> > > > > > > place, either in recv or more likely poll.
> > > > > > >
> > > > > > > I really think signalling early helps performance.
> > > > > > >
> > > > > >
> > > > > > Early, yes, often, not so much :). Perhaps what would be adventageous would be
> > > > > > to signal at the start of a set of enqueues, rather than at the end. That would
> > > > > > be equivalent in terms of not signaling more than needed, but would eliminate
> > > > > > the signaling on every chunk. Perhaps what you could do Marcelo would be to
> > > > > > change the sense of the signal_ready flag to be a has_signaled flag. e.g. call
> > > > > > sk_data_ready in ulp_event_tail like we used to, but only if the has_signaled
> > > > > > flag isn't set, then set the flag, and clear it at the end of the command
> > > > > > interpreter.
> > > > > >
> > > > > > That would be a best of both worlds solution, as long as theres no chance of
> > > > > > race with user space reading from the socket before we were done enqueuing (i.e.
> > > > > > you have to guarantee that the socket lock stays held, which I think we do).
> > > > >
> > > > > That is my feeling too. Will work on it. Thanks :-)
> > > >
> > > > I did the change and tested it on real machines set all for performance.
> > > > I couldn't spot any difference between both implementations.
> > > >
> > > > Set RSS and queue irq affinity for a cpu and taskset netperf and another
> > > > app I wrote to run on another cpu. It hits socket backlog quite often
> > > > but still do direct processing every now and then.
> > > >
> > > > With current state, netperf, scenario above. Results of perf sched
> > > > record for the CPUs in use, reported by perf sched latency:
> > > >
> > > > Task | Runtime ms | Switches | Average delay ms |
> > > > Maximum delay ms | Maximum delay at |
> > > > netserver:3205 | 9999.490 ms | 10 | avg: 0.003 ms |
> > > > max: 0.004 ms | max at: 69087.753356 s
> > > >
> > > > another run
> > > > netserver:3483 | 9999.412 ms | 15 | avg: 0.003 ms |
> > > > max: 0.004 ms | max at: 69194.749814 s
> > > >
> > > > With the patch below, same test:
> > > > netserver:2643 | 10000.110 ms | 14 | avg: 0.003 ms |
> > > > max: 0.004 ms | max at: 172.006315 s
> > > >
> > > > another run:
> > > > netserver:2698 | 10000.049 ms | 15 | avg: 0.003 ms |
> > > > max: 0.004 ms | max at: 368.061672 s
> > > >
> > > > I'll be happy to do more tests if you have any suggestions on how/what
> > > > to test.
> > > >
> > > > ---8<---
> > > >
> > > I think this looks reasonable, but can you post it properly please, as a patch
> > > against the head of teh net-next tree, rather than a diff from your previous
> > > work (which wasn't comitted)
> >
> > The idea was to not officially post it yet, more just as a reference,
> > because I can't see any gains from it. I'm reluctant just due to that,
> > no strong opinion here on one way or another.
> >
> > If you think it's better anyway to signal it early, I'll properly repost
> > it.
> >
> Yeah, your results seem to me to indicate that for your test at least, signaling
> early vs. late doesn't make alot of difference, but Dave I think made a point in
> principle in that allowing processes to wake up when we start enqueuing can be
> better in some situations. So all other things being equal, I'd say go with the
> method that you have here.
Okay, I'll rebase the patch and post it properly. Thanks Neil!
Marcelo
^ permalink raw reply
* [PATCH 0/3] Use timespec64 for select like timeouts
From: Deepa Dinamani @ 2016-04-29 16:39 UTC (permalink / raw)
To: linux-kernel
Cc: arnd, y2038, linux-fsdevel, John Stultz, Thomas Gleixner,
Alexander Viro, David S. Miller, netdev
The series is part of y2038 changes.
This changes a few syscalls that have common functions to use
struct timespec64 instead of struct timespec.
This does not include changes to system call uapi interfaces.
Those will be in a different series.
Thanks to Arnd Bergmann for comments on the patches.
Deepa Dinamani (3):
time: Add missing implementation for timespec64_add_safe()
fs: poll/select/recvmmsg: use timespec64 for timeout events
time: Remove timespec_add_safe()
fs/eventpoll.c | 12 ++++-----
fs/select.c | 67 ++++++++++++++++++++++++++++----------------------
include/linux/poll.h | 11 +++++----
include/linux/time64.h | 17 ++++++-------
kernel/time/time.c | 21 ++++++++++++++++
net/socket.c | 8 +++---
6 files changed, 82 insertions(+), 54 deletions(-)
--
1.9.1
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
^ permalink raw reply
* [PATCH 2/3] fs: poll/select/recvmmsg: use timespec64 for timeout events
From: Deepa Dinamani @ 2016-04-29 16:39 UTC (permalink / raw)
To: linux-kernel
Cc: arnd, y2038, linux-fsdevel, Alexander Viro, David S. Miller,
netdev
In-Reply-To: <1461947989-21926-1-git-send-email-deepa.kernel@gmail.com>
struct timespec is not y2038 safe.
Even though timespec might be sufficient to represent
timeouts, use struct timespec64 here as the plan is to
get rid of all timespec reference in the kernel.
The patch transitions the common functions:
poll_select_set_timeout() and select_estimate_accuracy()
to use timespec64. And, all the syscalls that use these
functions are transitioned in the same patch.
The restart block parameters for poll uses monotonic time.
Use timespec64 here as well to assign timeout value. This
parameter in the restart block need not change because
this only holds the monotonic timestamp at which timeout
should occur. And, unsigned long data type should be big
enough for this timestamp.
The system call interfaces will be handled in a separate
series.
Compat interfaces need not change as timespec64 is an
alias to struct timespec on a 64 bit system.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
fs/eventpoll.c | 12 +++++-----
fs/select.c | 67 +++++++++++++++++++++++++++++-----------------------
include/linux/poll.h | 11 +++++----
net/socket.c | 8 ++++---
4 files changed, 54 insertions(+), 44 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 8a74a2a..10db912 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1583,15 +1583,15 @@ static int ep_send_events(struct eventpoll *ep,
return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}
-static inline struct timespec ep_set_mstimeout(long ms)
+static inline struct timespec64 ep_set_mstimeout(long ms)
{
- struct timespec now, ts = {
+ struct timespec64 now, ts = {
.tv_sec = ms / MSEC_PER_SEC,
.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
};
- ktime_get_ts(&now);
- return timespec_add_safe(now, ts);
+ ktime_get_ts64(&now);
+ return timespec64_add_safe(now, ts);
}
/**
@@ -1621,11 +1621,11 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
ktime_t expires, *to = NULL;
if (timeout > 0) {
- struct timespec end_time = ep_set_mstimeout(timeout);
+ struct timespec64 end_time = ep_set_mstimeout(timeout);
slack = select_estimate_accuracy(&end_time);
to = &expires;
- *to = timespec_to_ktime(end_time);
+ *to = timespec64_to_ktime(end_time);
} else if (timeout == 0) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
diff --git a/fs/select.c b/fs/select.c
index 8692939..8ed9da5 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -47,7 +47,7 @@
#define MAX_SLACK (100 * NSEC_PER_MSEC)
-static long __estimate_accuracy(struct timespec *tv)
+static long __estimate_accuracy(struct timespec64 *tv)
{
long slack;
int divfactor = 1000;
@@ -70,10 +70,10 @@ static long __estimate_accuracy(struct timespec *tv)
return slack;
}
-u64 select_estimate_accuracy(struct timespec *tv)
+u64 select_estimate_accuracy(struct timespec64 *tv)
{
u64 ret;
- struct timespec now;
+ struct timespec64 now;
/*
* Realtime tasks get a slack of 0 for obvious reasons.
@@ -82,8 +82,8 @@ u64 select_estimate_accuracy(struct timespec *tv)
if (rt_task(current))
return 0;
- ktime_get_ts(&now);
- now = timespec_sub(*tv, now);
+ ktime_get_ts64(&now);
+ now = timespec64_sub(*tv, now);
ret = __estimate_accuracy(&now);
if (ret < current->timer_slack_ns)
return current->timer_slack_ns;
@@ -260,7 +260,7 @@ EXPORT_SYMBOL(poll_schedule_timeout);
/**
* poll_select_set_timeout - helper function to setup the timeout value
- * @to: pointer to timespec variable for the final timeout
+ * @to: pointer to timespec64 variable for the final timeout
* @sec: seconds (from user space)
* @nsec: nanoseconds (from user space)
*
@@ -269,26 +269,28 @@ EXPORT_SYMBOL(poll_schedule_timeout);
*
* Returns -EINVAL if sec/nsec are not normalized. Otherwise 0.
*/
-int poll_select_set_timeout(struct timespec *to, long sec, long nsec)
+int poll_select_set_timeout(struct timespec64 *to, time64_t sec, long nsec)
{
- struct timespec ts = {.tv_sec = sec, .tv_nsec = nsec};
+ struct timespec64 ts = {.tv_sec = sec, .tv_nsec = nsec};
- if (!timespec_valid(&ts))
+ if (!timespec64_valid(&ts))
return -EINVAL;
/* Optimize for the zero timeout value here */
if (!sec && !nsec) {
to->tv_sec = to->tv_nsec = 0;
} else {
- ktime_get_ts(to);
- *to = timespec_add_safe(*to, ts);
+ ktime_get_ts64(to);
+ *to = timespec64_add_safe(*to, ts);
}
return 0;
}
-static int poll_select_copy_remaining(struct timespec *end_time, void __user *p,
+static int poll_select_copy_remaining(struct timespec64 *end_time,
+ void __user *p,
int timeval, int ret)
{
+ struct timespec64 rts64;
struct timespec rts;
struct timeval rtv;
@@ -302,16 +304,18 @@ static int poll_select_copy_remaining(struct timespec *end_time, void __user *p,
if (!end_time->tv_sec && !end_time->tv_nsec)
return ret;
- ktime_get_ts(&rts);
- rts = timespec_sub(*end_time, rts);
- if (rts.tv_sec < 0)
- rts.tv_sec = rts.tv_nsec = 0;
+ ktime_get_ts64(&rts64);
+ rts64 = timespec64_sub(*end_time, rts64);
+ if (rts64.tv_sec < 0)
+ rts64.tv_sec = rts64.tv_nsec = 0;
+
+ rts = timespec64_to_timespec(rts64);
if (timeval) {
if (sizeof(rtv) > sizeof(rtv.tv_sec) + sizeof(rtv.tv_usec))
memset(&rtv, 0, sizeof(rtv));
- rtv.tv_sec = rts.tv_sec;
- rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC;
+ rtv.tv_sec = rts64.tv_sec;
+ rtv.tv_usec = rts64.tv_nsec / NSEC_PER_USEC;
if (!copy_to_user(p, &rtv, sizeof(rtv)))
return ret;
@@ -396,7 +400,7 @@ static inline void wait_key_set(poll_table *wait, unsigned long in,
wait->_key |= POLLOUT_SET;
}
-int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
+int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
{
ktime_t expire, *to = NULL;
struct poll_wqueues table;
@@ -522,7 +526,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
* pointer to the expiry value.
*/
if (end_time && !to) {
- expire = timespec_to_ktime(*end_time);
+ expire = timespec64_to_ktime(*end_time);
to = &expire;
}
@@ -545,7 +549,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
* I'm trying ERESTARTNOHAND which restart only when you want to.
*/
int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
- fd_set __user *exp, struct timespec *end_time)
+ fd_set __user *exp, struct timespec64 *end_time)
{
fd_set_bits fds;
void *bits;
@@ -622,7 +626,7 @@ out_nofds:
SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp,
fd_set __user *, exp, struct timeval __user *, tvp)
{
- struct timespec end_time, *to = NULL;
+ struct timespec64 end_time, *to = NULL;
struct timeval tv;
int ret;
@@ -648,15 +652,17 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp,
const sigset_t __user *sigmask, size_t sigsetsize)
{
sigset_t ksigmask, sigsaved;
- struct timespec ts, end_time, *to = NULL;
+ struct timespec ts;
+ struct timespec64 ts64, end_time, *to = NULL;
int ret;
if (tsp) {
if (copy_from_user(&ts, tsp, sizeof(ts)))
return -EFAULT;
+ ts64 = timespec_to_timespec64(ts);
to = &end_time;
- if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec))
+ if (poll_select_set_timeout(to, ts64.tv_sec, ts64.tv_nsec))
return -EINVAL;
}
@@ -779,7 +785,7 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait,
}
static int do_poll(struct poll_list *list, struct poll_wqueues *wait,
- struct timespec *end_time)
+ struct timespec64 *end_time)
{
poll_table* pt = &wait->pt;
ktime_t expire, *to = NULL;
@@ -854,7 +860,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,
* pointer to the expiry value.
*/
if (end_time && !to) {
- expire = timespec_to_ktime(*end_time);
+ expire = timespec64_to_ktime(*end_time);
to = &expire;
}
@@ -868,7 +874,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,
sizeof(struct pollfd))
int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
- struct timespec *end_time)
+ struct timespec64 *end_time)
{
struct poll_wqueues table;
int err = -EFAULT, fdcount, len, size;
@@ -936,7 +942,7 @@ static long do_restart_poll(struct restart_block *restart_block)
{
struct pollfd __user *ufds = restart_block->poll.ufds;
int nfds = restart_block->poll.nfds;
- struct timespec *to = NULL, end_time;
+ struct timespec64 *to = NULL, end_time;
int ret;
if (restart_block->poll.has_timeout) {
@@ -957,7 +963,7 @@ static long do_restart_poll(struct restart_block *restart_block)
SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
int, timeout_msecs)
{
- struct timespec end_time, *to = NULL;
+ struct timespec64 end_time, *to = NULL;
int ret;
if (timeout_msecs >= 0) {
@@ -993,7 +999,8 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
size_t, sigsetsize)
{
sigset_t ksigmask, sigsaved;
- struct timespec ts, end_time, *to = NULL;
+ struct timespec ts;
+ struct timespec64 end_time, *to = NULL;
int ret;
if (tsp) {
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 9fb4f40..37b057b 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -96,7 +96,7 @@ extern void poll_initwait(struct poll_wqueues *pwq);
extern void poll_freewait(struct poll_wqueues *pwq);
extern int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
ktime_t *expires, unsigned long slack);
-extern u64 select_estimate_accuracy(struct timespec *tv);
+extern u64 select_estimate_accuracy(struct timespec64 *tv);
static inline int poll_schedule(struct poll_wqueues *pwq, int state)
@@ -153,12 +153,13 @@ void zero_fd_set(unsigned long nr, unsigned long *fdset)
#define MAX_INT64_SECONDS (((s64)(~((u64)0)>>1)/HZ)-1)
-extern int do_select(int n, fd_set_bits *fds, struct timespec *end_time);
+extern int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time);
extern int do_sys_poll(struct pollfd __user * ufds, unsigned int nfds,
- struct timespec *end_time);
+ struct timespec64 *end_time);
extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
- fd_set __user *exp, struct timespec *end_time);
+ fd_set __user *exp, struct timespec64 *end_time);
-extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);
+extern int poll_select_set_timeout(struct timespec64 *to, time64_t sec,
+ long nsec);
#endif /* _LINUX_POLL_H */
diff --git a/net/socket.c b/net/socket.c
index 5dbb0bb..bdfe115 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2171,7 +2171,8 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
struct mmsghdr __user *entry;
struct compat_mmsghdr __user *compat_entry;
struct msghdr msg_sys;
- struct timespec end_time;
+ struct timespec64 end_time;
+ struct timespec64 timeout64;
if (timeout &&
poll_select_set_timeout(&end_time, timeout->tv_sec,
@@ -2223,8 +2224,9 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
flags |= MSG_DONTWAIT;
if (timeout) {
- ktime_get_ts(timeout);
- *timeout = timespec_sub(end_time, *timeout);
+ ktime_get_ts64(&timeout64);
+ *timeout = timespec64_to_timespec(
+ timespec64_sub(end_time, timeout64));
if (timeout->tv_sec < 0) {
timeout->tv_sec = timeout->tv_nsec = 0;
break;
--
1.9.1
^ permalink raw reply related
* [PATCH] sctp: signal sk_data_ready earlier on data chunks reception
From: Marcelo Ricardo Leitner @ 2016-04-29 17:17 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Neil Horman, Vlad Yasevich, linux-sctp
Dave Miller pointed out that fb586f25300f ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.
This patch thus basically inverts the logic on fb586f25300f and signals
it as early as possible, similar to what we had before.
Fixes: fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
include/net/sctp/structs.h | 2 +-
net/sctp/sm_sideeffect.c | 7 +++----
net/sctp/ulpqueue.c | 25 ++++++++++++++++---------
3 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 558bae3cbe0d5107d52c8cb31b324cfd5479def0..16b013a6191cf1c416e4dd1aeb1707a8569ea49b 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -218,7 +218,7 @@ struct sctp_sock {
frag_interleave:1,
recvrcvinfo:1,
recvnxtinfo:1,
- pending_data_ready:1;
+ data_ready_signalled:1;
atomic_t pd_mode;
/* Receive to here while partial delivery is in effect. */
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index e8f0112f9b28472c39c4c91dcb28576373c858e7..aa37122593684d8501fdca15983fbd8620fabe07 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1741,10 +1741,9 @@ out:
} else if (local_cork)
error = sctp_outq_uncork(&asoc->outqueue, gfp);
- if (sp->pending_data_ready) {
- sk->sk_data_ready(sk);
- sp->pending_data_ready = 0;
- }
+ if (sp->data_ready_signalled)
+ sp->data_ready_signalled = 0;
+
return error;
nomem:
error = -ENOMEM;
diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c
index ec12a8920e5fd7a0f26d19f1695bc2feeae41518..ec166d2bd2d95d9aa69369da2ead9437da4ce8ed 100644
--- a/net/sctp/ulpqueue.c
+++ b/net/sctp/ulpqueue.c
@@ -194,6 +194,7 @@ static int sctp_ulpq_clear_pd(struct sctp_ulpq *ulpq)
int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
{
struct sock *sk = ulpq->asoc->base.sk;
+ struct sctp_sock *sp = sctp_sk(sk);
struct sk_buff_head *queue, *skb_list;
struct sk_buff *skb = sctp_event2skb(event);
int clear_pd = 0;
@@ -211,7 +212,7 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
sk_incoming_cpu_update(sk);
}
/* Check if the user wishes to receive this event. */
- if (!sctp_ulpevent_is_enabled(event, &sctp_sk(sk)->subscribe))
+ if (!sctp_ulpevent_is_enabled(event, &sp->subscribe))
goto out_free;
/* If we are in partial delivery mode, post to the lobby until
@@ -219,7 +220,7 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
* the association the cause of the partial delivery.
*/
- if (atomic_read(&sctp_sk(sk)->pd_mode) == 0) {
+ if (atomic_read(&sp->pd_mode) == 0) {
queue = &sk->sk_receive_queue;
} else {
if (ulpq->pd_mode) {
@@ -231,7 +232,7 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
if ((event->msg_flags & MSG_NOTIFICATION) ||
(SCTP_DATA_NOT_FRAG ==
(event->msg_flags & SCTP_DATA_FRAG_MASK)))
- queue = &sctp_sk(sk)->pd_lobby;
+ queue = &sp->pd_lobby;
else {
clear_pd = event->msg_flags & MSG_EOR;
queue = &sk->sk_receive_queue;
@@ -242,10 +243,10 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
* can queue this to the receive queue instead
* of the lobby.
*/
- if (sctp_sk(sk)->frag_interleave)
+ if (sp->frag_interleave)
queue = &sk->sk_receive_queue;
else
- queue = &sctp_sk(sk)->pd_lobby;
+ queue = &sp->pd_lobby;
}
}
@@ -264,8 +265,10 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
if (clear_pd)
sctp_ulpq_clear_pd(ulpq);
- if (queue == &sk->sk_receive_queue)
- sctp_sk(sk)->pending_data_ready = 1;
+ if (queue == &sk->sk_receive_queue && !sp->data_ready_signalled) {
+ sp->data_ready_signalled = 1;
+ sk->sk_data_ready(sk);
+ }
return 1;
out_free:
@@ -1126,11 +1129,13 @@ void sctp_ulpq_abort_pd(struct sctp_ulpq *ulpq, gfp_t gfp)
{
struct sctp_ulpevent *ev = NULL;
struct sock *sk;
+ struct sctp_sock *sp;
if (!ulpq->pd_mode)
return;
sk = ulpq->asoc->base.sk;
+ sp = sctp_sk(sk);
if (sctp_ulpevent_type_enabled(SCTP_PARTIAL_DELIVERY_EVENT,
&sctp_sk(sk)->subscribe))
ev = sctp_ulpevent_make_pdapi(ulpq->asoc,
@@ -1140,6 +1145,8 @@ void sctp_ulpq_abort_pd(struct sctp_ulpq *ulpq, gfp_t gfp)
__skb_queue_tail(&sk->sk_receive_queue, sctp_event2skb(ev));
/* If there is data waiting, send it up the socket now. */
- if (sctp_ulpq_clear_pd(ulpq) || ev)
- sctp_sk(sk)->pending_data_ready = 1;
+ if ((sctp_ulpq_clear_pd(ulpq) || ev) && !sp->data_ready_signalled) {
+ sp->data_ready_signalled = 1;
+ sk->sk_data_ready(sk);
+ }
}
--
2.5.0
^ permalink raw reply related
* Re: [net-next PATCH 0/8] Fix Tunnel features and enable GSO partial for Mellanox adapters
From: David Miller @ 2016-04-29 17:31 UTC (permalink / raw)
To: aduyck; +Cc: talal, netdev, galp, ogerlitz, eranbe
In-Reply-To: <20160425182442.11331.88349.stgit@ahduyck-xeon-server>
From: Alexander Duyck <aduyck@mirantis.com>
Date: Mon, 25 Apr 2016 11:30:54 -0700
> This patch series is meant to allow us to get the best performance possible
> for Mellanox ConnectX-3/4 adapters in terms of VXLAN tunnels.
I'm going to mark this as "deferred" in patchwork, so Alex why don't you just
respin these and repost next week when you get final feedback from the Mellanox
folks?
THanks.
^ permalink raw reply
* Re: [net-next PATCH 0/8] Fix Tunnel features and enable GSO partial for Mellanox adapters
From: Alex Duyck @ 2016-04-29 17:32 UTC (permalink / raw)
To: David Miller
Cc: talal, Linux Kernel Network Developers, galp, ogerlitz,
Eran Ben Elisha
In-Reply-To: <20160429.133107.1317675802100306452.davem@davemloft.net>
On Fri, Apr 29, 2016 at 10:31 AM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <aduyck@mirantis.com>
> Date: Mon, 25 Apr 2016 11:30:54 -0700
>
>> This patch series is meant to allow us to get the best performance possible
>> for Mellanox ConnectX-3/4 adapters in terms of VXLAN tunnels.
>
> I'm going to mark this as "deferred" in patchwork, so Alex why don't you just
> respin these and repost next week when you get final feedback from the Mellanox
> folks?
>
> THanks.
Okay. Will do.
Thanks.
- Alex
^ permalink raw reply
* [PATCHv3] netem: Segment GSO packets on enqueue
From: Neil Horman @ 2016-04-29 17:35 UTC (permalink / raw)
To: netdev; +Cc: Neil Horman, Jamal Hadi Salim, David S. Miller, netem,
eric.dumazet
In-Reply-To: <1461692618-21333-1-git-send-email-nhorman@tuxdriver.com>
This was recently reported to me, and reproduced on the latest net kernel, when
attempting to run netperf from a host that had a netem qdisc attached to the
egress interface:
[ 788.073771] ------------[ cut here ]------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to whcih this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
----
Change Notes:
V2) As per request from Eric Dumazet, I rewrote this to limit the need to
segment the skb. Instead of doing so unilaterally, we only do so now when
the netem qdisc requires determines that a packet must be corrupted, thus
avoidign the failure in skb_checksum_help. This still leaves open concerns
wth statistical measurements made on GSO packets being dropped or reordered
(i.e. they are counted as a single packet rather than multiple packets), but
i'd rather fix the immediate problem before we go rewriting everything to fix
that larger issue
V3) Added back missing call to qdisc_tree_reduce_backlog that I misplaced in
the V2 change.
---
net/sched/sch_netem.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 57 insertions(+), 3 deletions(-)
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 9640bb3..c4d8133 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -395,6 +395,25 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
sch->q.qlen++;
}
+/* netem can't properly corrupt a megapacket (like we get from GSO), so instead
+ * when we statistically choose to corrupt one, we instead segment it, returning
+ * the first packet to be corrupted, and re-enqueue the remaining frames
+ */
+static struct sk_buff* netem_segment(struct sk_buff *skb, struct Qdisc *sch)
+{
+ struct sk_buff *segs;
+ netdev_features_t features = netif_skb_features(skb);
+
+ segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+
+ if (IS_ERR_OR_NULL(segs)) {
+ qdisc_reshape_fail(skb, sch);
+ return NULL;
+ }
+ consume_skb(skb);
+ return segs;
+}
+
/*
* Insert one skb into qdisc.
* Note: parent depends on return value to account for queue length.
@@ -407,7 +426,11 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
/* We don't fill cb now as skb_unshare() may invalidate it */
struct netem_skb_cb *cb;
struct sk_buff *skb2;
+ struct sk_buff *segs = NULL;
+ unsigned int len = 0, prev_len = qdisc_pkt_len(skb);
+ int nb = 0;
int count = 1;
+ int rc = NET_XMIT_SUCCESS;
/* Random duplication */
if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor))
@@ -453,10 +476,22 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
* do it now in software before we mangle it.
*/
if (q->corrupt && q->corrupt >= get_crandom(&q->corrupt_cor)) {
+ if (skb_is_gso(skb)) {
+ segs = netem_segment(skb, sch);
+ if (!segs)
+ return NET_XMIT_DROP;
+ } else
+ segs = skb;
+
+ skb = segs;
+ segs = segs->next;
+
if (!(skb = skb_unshare(skb, GFP_ATOMIC)) ||
(skb->ip_summed == CHECKSUM_PARTIAL &&
- skb_checksum_help(skb)))
- return qdisc_drop(skb, sch);
+ skb_checksum_help(skb))) {
+ rc = qdisc_drop(skb, sch);
+ goto finish_segs;
+ }
skb->data[prandom_u32() % skb_headlen(skb)] ^=
1<<(prandom_u32() % 8);
@@ -516,7 +551,26 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
sch->qstats.requeues++;
}
- return NET_XMIT_SUCCESS;
+finish_segs:
+ if (segs) {
+ while (segs) {
+ skb2 = segs->next;
+ segs->next = NULL;
+ qdisc_skb_cb(segs)->pkt_len = segs->len;
+ len += segs->len;
+ rc = qdisc_enqueue(segs, sch);
+ if (rc != NET_XMIT_SUCCESS) {
+ if (net_xmit_drop_count(rc))
+ qdisc_qstats_drop(sch);
+ } else
+ nb++;
+ segs = skb2;
+ }
+ sch->q.qlen += nb;
+ if (nb > 1)
+ qdisc_tree_reduce_backlog(sch, 1 - nb, prev_len - len);
+ }
+ return rc;
}
static unsigned int netem_drop(struct Qdisc *sch)
--
2.5.5
^ permalink raw reply related
* Re: [PATCH net-next 0/4] net: hns: update DT properties according to Rob's comments
From: David Miller @ 2016-04-29 17:39 UTC (permalink / raw)
To: Yisen.Zhuang
Cc: devicetree, netdev, linux-arm-kernel, robh+dt, pawel.moll,
mark.rutland, ijc+devicetree, galak, will.deacon, catalin.marinas,
yankejian, huangdaode, salil.mehta, lipeng321, liguozhu,
xieqianqian, xuwei5, linuxarm
In-Reply-To: <1461827344-52230-1-git-send-email-Yisen.Zhuang@huawei.com>
From: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Date: Thu, 28 Apr 2016 15:09:00 +0800
> There are some inappropriate properties definition in hns DT. We update the definition
> according to Rob's review comments and fix some typos in binding.
>
> For more details, please see individual patches.
Series applied, thanks.
BTW, you reposted your 10 part series for adding the dsaf-debug device, but I
already applied the series you posted for that nearly a week ago.
^ permalink raw reply
* [PATCH] net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
From: Tim Bingham @ 2016-04-29 17:30 UTC (permalink / raw)
To: David S. Miller, netdev; +Cc: Jason A. Donenfeld, Tim Bingham
Prior to commit d92cff89a0c8 ("net_dbg_ratelimited: turn into no-op
when !DEBUG") the implementation of net_dbg_ratelimited() was buggy
for both the DEBUG and CONFIG_DYNAMIC_DEBUG cases.
The bug was that net_ratelimit() was being called and, despite
returning true, nothing was being printed to the console. This
resulted in messages like the following -
"net_ratelimit: %d callbacks suppressed"
with no other output nearby.
After commit d92cff89a0c8 ("net_dbg_ratelimited: turn into no-op when
!DEBUG") the bug is fixed for the DEBUG case. However, there's no
output at all for CONFIG_DYNAMIC_DEBUG case.
This patch restores debug output (if enabled) for the
CONFIG_DYNAMIC_DEBUG case.
Add a definition of net_dbg_ratelimited() for the CONFIG_DYNAMIC_DEBUG
case. The implementation takes care to check that dynamic debugging is
enabled before calling net_ratelimit().
Fixes: d92cff89a0c8 ("net_dbg_ratelimited: turn into no-op when !DEBUG")
Signed-off-by: Tim Bingham <tbingham@akamai.com>
---
include/linux/net.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/net.h b/include/linux/net.h
index 49175e4..f840d77 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -246,7 +246,15 @@ do { \
net_ratelimited_function(pr_warn, fmt, ##__VA_ARGS__)
#define net_info_ratelimited(fmt, ...) \
net_ratelimited_function(pr_info, fmt, ##__VA_ARGS__)
-#if defined(DEBUG)
+#if defined(CONFIG_DYNAMIC_DEBUG)
+#define net_dbg_ratelimited(fmt, ...) \
+do { \
+ DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+ if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT) && \
+ net_ratelimit()) \
+ __dynamic_pr_debug(&descriptor, fmt, ##__VA_ARGS__); \
+} while (0)
+#elif defined(DEBUG)
#define net_dbg_ratelimited(fmt, ...) \
net_ratelimited_function(pr_debug, fmt, ##__VA_ARGS__)
#else
--
1.9.1
^ permalink raw reply related
* Re: [PATCH net-next 0/2] cxgb4/cxgb4vf: add support for mbox cmd logging
From: David Miller @ 2016-04-29 17:42 UTC (permalink / raw)
To: hariprasad; +Cc: netdev, leedom, swise, nirranjan, santosh
In-Reply-To: <1461829999-20270-1-git-send-email-hariprasad@chelsio.com>
From: Hariprasad Shenai <hariprasad@chelsio.com>
Date: Thu, 28 Apr 2016 13:23:17 +0530
> This patch series adds support for logging mailbox commands and replies
> for debugging purpose for both PF and VF driver.
>
> This patch series has been created against net-next tree and includes
> patches on cxgb4 and cxgb4vf driver.
>
> We have included all the maintainers of respective drivers. Kindly review
> the change and let us know in case of any review comments.
Series applied, thanks.
^ permalink raw reply
* [PATCH] i40e: fix misleading indentation
From: Arnd Bergmann @ 2016-04-29 17:44 UTC (permalink / raw)
To: Jeff Kirsher
Cc: Arnd Bergmann, Jesse Brandeburg, Shannon Nelson, Carolyn Wyborny,
Don Skidmore, Bruce Allan, John Ronciak, Mitch Williams,
Anjali Singhai Jain, Greg Rose, intel-wired-lan, netdev,
linux-kernel
Newly added code in i40e_vc_config_promiscuous_mode_msg() is indented
in a way that gcc rightly complains about:
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c: In function 'i40e_vc_config_promiscuous_mode_msg':
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1543:4: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
if (f->vlan >= 0 && f->vlan <= I40E_MAX_VLANID)
^~
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1550:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
aq_err = pf->hw.aq.asq_last_status;
>From the context, it looks like the aq_err assignment was meant to be
inside of the conditional expression, so I'm adding the appropriate
curly braces now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5676a8b9cd9a ("i40e: Add VF promiscuous mode driver support")
---
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index c226c2dad247..30f9cc404707 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1540,7 +1540,7 @@ static int i40e_vc_config_promiscuous_mode_msg(struct i40e_vf *vf,
} else if (i40e_getnum_vf_vsi_vlan_filters(vsi)) {
list_for_each_entry(f, &vsi->mac_filter_list, list) {
aq_ret = 0;
- if (f->vlan >= 0 && f->vlan <= I40E_MAX_VLANID)
+ if (f->vlan >= 0 && f->vlan <= I40E_MAX_VLANID) {
aq_ret =
i40e_aq_set_vsi_uc_promisc_on_vlan(hw,
vsi->seid,
@@ -1548,6 +1548,7 @@ static int i40e_vc_config_promiscuous_mode_msg(struct i40e_vf *vf,
f->vlan,
NULL);
aq_err = pf->hw.aq.asq_last_status;
+ }
if (aq_ret)
dev_err(&pf->pdev->dev,
"Could not add VLAN %d to Unicast promiscuous domain err %s aq_err %s\n",
--
2.7.0
^ permalink raw reply related
* Re: [PATCHv3] netem: Segment GSO packets on enqueue
From: Eric Dumazet @ 2016-04-29 18:09 UTC (permalink / raw)
To: Neil Horman; +Cc: netdev, Jamal Hadi Salim, David S. Miller, netem
In-Reply-To: <1461951348-3920-1-git-send-email-nhorman@tuxdriver.com>
On Fri, 2016-04-29 at 13:35 -0400, Neil Horman wrote:
> This was recently reported to me, and reproduced on the latest net kernel, when
> attempting to run netperf from a host that had a netem qdisc attached to the
> egress interface:
>
> /*
> * Insert one skb into qdisc.
> * Note: parent depends on return value to account for queue length.
> @@ -407,7 +426,11 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
> /* We don't fill cb now as skb_unshare() may invalidate it */
> struct netem_skb_cb *cb;
> struct sk_buff *skb2;
> + struct sk_buff *segs = NULL;
> + unsigned int len = 0, prev_len = qdisc_pkt_len(skb);
> + int nb = 0;
> int count = 1;
> + int rc = NET_XMIT_SUCCESS;
>
> /* Random duplication */
> if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor))
> @@ -453,10 +476,22 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
> * do it now in software before we mangle it.
> */
> if (q->corrupt && q->corrupt >= get_crandom(&q->corrupt_cor)) {
> + if (skb_is_gso(skb)) {
> + segs = netem_segment(skb, sch);
> + if (!segs)
> + return NET_XMIT_DROP;
> + } else
> + segs = skb;
> +
> + skb = segs;
> + segs = segs->next;
> +
> if (!(skb = skb_unshare(skb, GFP_ATOMIC)) ||
> (skb->ip_summed == CHECKSUM_PARTIAL &&
> - skb_checksum_help(skb)))
> - return qdisc_drop(skb, sch);
> + skb_checksum_help(skb))) {
> + rc = qdisc_drop(skb, sch);
> + goto finish_segs;
> + }
>
> skb->data[prandom_u32() % skb_headlen(skb)] ^=
> 1<<(prandom_u32() % 8);
> @@ -516,7 +551,26 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
> sch->qstats.requeues++;
> }
>
> - return NET_XMIT_SUCCESS;
> +finish_segs:
> + if (segs) {
> + while (segs) {
> + skb2 = segs->next;
> + segs->next = NULL;
> + qdisc_skb_cb(segs)->pkt_len = segs->len;
> + len += segs->len;
Wrong operation if packet is dropped by qdisc_enqueue() ?
I would use
u32 last_len = segs->len;
> + rc = qdisc_enqueue(segs, sch);
> + if (rc != NET_XMIT_SUCCESS) {
> + if (net_xmit_drop_count(rc))
> + qdisc_qstats_drop(sch);
> + } else
} else {
nb++;
len += last_len;
}
> + nb++;
> + segs = skb2;
> + }
> + sch->q.qlen += nb;
> + if (nb > 1)
> + qdisc_tree_reduce_backlog(sch, 1 - nb, prev_len - len);
> + }
> + return rc;
Seems you should return NET_XMIT_SUCCESS instead of status of last
segment ?
> }
Thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox