* [PATCH net-next 5/7] tcp: optimize tcp internal pacing
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
To: David S . Miller, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Gasper Zejn
Cc: netdev, Eric Dumazet, Eric Dumazet
In-Reply-To: <20181015163758.232436-1-edumazet@google.com>
When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.
But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.
This leads to extra overhead (high number of IRQ)
Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.
This leads to a ~10% performance increase in TCP_RR workloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5474c9854f252e50cdb1136435417873861d7618..d212e4cbc68902e873afb4a12b43b467ccd6069b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -975,16 +975,6 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
-static void tcp_internal_pacing(struct sock *sk)
-{
- if (!tcp_needs_internal_pacing(sk))
- return;
- hrtimer_start(&tcp_sk(sk)->pacing_timer,
- ns_to_ktime(tcp_sk(sk)->tcp_wstamp_ns),
- HRTIMER_MODE_ABS_PINNED_SOFT);
- sock_hold(sk);
-}
-
static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
u64 prior_wstamp)
{
@@ -1005,8 +995,6 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
/* take into account OS jitter */
len_ns -= min_t(u64, len_ns / 2, credit);
tp->tcp_wstamp_ns += len_ns;
-
- tcp_internal_pacing(sk);
}
}
list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
@@ -2186,10 +2174,23 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
}
-static bool tcp_pacing_check(const struct sock *sk)
+static bool tcp_pacing_check(struct sock *sk)
{
- return tcp_needs_internal_pacing(sk) &&
- hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ if (!tcp_needs_internal_pacing(sk))
+ return false;
+
+ if (tp->tcp_wstamp_ns <= tp->tcp_clock_cache)
+ return false;
+
+ if (!hrtimer_is_queued(&tp->pacing_timer)) {
+ hrtimer_start(&tp->pacing_timer,
+ ns_to_ktime(tp->tcp_wstamp_ns),
+ HRTIMER_MODE_ABS_PINNED_SOFT);
+ sock_hold(sk);
+ }
+ return true;
}
/* TCP Small Queues :
--
2.19.0.605.g01d371f741-goog
^ permalink raw reply related
* [PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
To: David S . Miller, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Gasper Zejn
Cc: netdev, Eric Dumazet, Eric Dumazet
In-Reply-To: <20181015163758.232436-1-edumazet@google.com>
With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/sched/sch_fq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 3923d14095335df61c270f69e50cb7cbfde4c796..4b1af706896c07e5a0fe6d542dfcd530acdcf8f5 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -444,7 +444,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
}
skb = f->head;
- if (skb && !skb_is_tcp_pure_ack(skb)) {
+ if (skb) {
u64 time_next_packet = max_t(u64, ktime_to_ns(skb->tstamp),
f->time_next_packet);
--
2.19.0.605.g01d371f741-goog
^ permalink raw reply related
* [PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
To: David S . Miller, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Gasper Zejn
Cc: netdev, Eric Dumazet, Eric Dumazet
In-Reply-To: <20181015163758.232436-1-edumazet@google.com>
In commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.
This patch does the same in TCP stack, now it is using EDT model.
Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.
This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.
Consider following case :
An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec
This is like having a token bucket with a depth of half
an skb.
Tested:
tc qdisc replace dev eth0 root pfifo_fast
Before
netperf -P0 -H remote -- -q 1000000000 # 8000Mbit
540000 262144 262144 10.00 7710.43
After :
netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit
540000 262144 262144 10.00 7999.75 # Much closer to 8000Mbit target
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f4aa4109334a043d02b17b18bef346d805dab501..5474c9854f252e50cdb1136435417873861d7618 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -985,7 +985,8 @@ static void tcp_internal_pacing(struct sock *sk)
sock_hold(sk);
}
-static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
+static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
+ u64 prior_wstamp)
{
struct tcp_sock *tp = tcp_sk(sk);
@@ -998,7 +999,12 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
* this is a minor annoyance.
*/
if (rate != ~0UL && rate && tp->data_segs_out >= 10) {
- tp->tcp_wstamp_ns += div64_ul((u64)skb->len * NSEC_PER_SEC, rate);
+ u64 len_ns = div64_ul((u64)skb->len * NSEC_PER_SEC, rate);
+ u64 credit = tp->tcp_wstamp_ns - prior_wstamp;
+
+ /* take into account OS jitter */
+ len_ns -= min_t(u64, len_ns / 2, credit);
+ tp->tcp_wstamp_ns += len_ns;
tcp_internal_pacing(sk);
}
@@ -1029,6 +1035,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
struct sk_buff *oskb = NULL;
struct tcp_md5sig_key *md5;
struct tcphdr *th;
+ u64 prior_wstamp;
int err;
BUG_ON(!skb || !tcp_skb_pcount(skb));
@@ -1050,7 +1057,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
return -ENOBUFS;
}
- /* TODO: might take care of jitter here */
+ prior_wstamp = tp->tcp_wstamp_ns;
tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
@@ -1169,7 +1176,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
err = net_xmit_eval(err);
}
if (!err && oskb) {
- tcp_update_skb_after_send(sk, oskb);
+ tcp_update_skb_after_send(sk, oskb, prior_wstamp);
tcp_rate_skb_sent(sk, oskb);
}
return err;
@@ -2321,7 +2328,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
/* "skb_mstamp" is used as a start point for the retransmit timer */
- tcp_update_skb_after_send(sk, skb);
+ tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
goto repair; /* Skip network transmission */
}
@@ -2896,7 +2903,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
} tcp_skb_tsorted_restore(skb);
if (!err) {
- tcp_update_skb_after_send(sk, skb);
+ tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
tcp_rate_skb_sent(sk, skb);
}
} else {
--
2.19.0.605.g01d371f741-goog
^ permalink raw reply related
* [PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
To: David S . Miller, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Gasper Zejn
Cc: netdev, Eric Dumazet, Eric Dumazet
In-Reply-To: <20181015163758.232436-1-edumazet@google.com>
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 4 ++--
net/core/filter.c | 4 ++--
net/core/sock.c | 9 +++++----
net/ipv4/tcp.c | 10 +++++-----
net/ipv4/tcp_bbr.c | 6 +++---
net/ipv4/tcp_output.c | 19 +++++++++++--------
net/sched/sch_fq.c | 20 ++++++++++++--------
7 files changed, 40 insertions(+), 32 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 751549ac0a849144ab0382203ee5c877374523e2..cfaf261936c8787b3a65ce832fd9c871697d00f4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -422,8 +422,8 @@ struct sock {
struct timer_list sk_timer;
__u32 sk_priority;
__u32 sk_mark;
- u32 sk_pacing_rate; /* bytes per second */
- u32 sk_max_pacing_rate;
+ unsigned long sk_pacing_rate; /* bytes per second */
+ unsigned long sk_max_pacing_rate;
struct page_frag sk_frag;
netdev_features_t sk_route_caps;
netdev_features_t sk_route_nocaps;
diff --git a/net/core/filter.c b/net/core/filter.c
index 4bbc6567fcb818e91617bfa9a2fd7fbebbd129f8..80da21b097b8d05eb7b9fa92afa86762334ac0ae 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3927,8 +3927,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
break;
- case SO_MAX_PACING_RATE:
- sk->sk_max_pacing_rate = val;
+ case SO_MAX_PACING_RATE: /* 32bit version */
+ sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val;
sk->sk_pacing_rate = min(sk->sk_pacing_rate,
sk->sk_max_pacing_rate);
break;
diff --git a/net/core/sock.c b/net/core/sock.c
index 7e8796a6a0892efbb7dfce67d12b8062b2d5daa9..fdf9fc7d3f9875f2718575078a0f263674c80b4f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -998,7 +998,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
cmpxchg(&sk->sk_pacing_status,
SK_PACING_NONE,
SK_PACING_NEEDED);
- sk->sk_max_pacing_rate = val;
+ sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val;
sk->sk_pacing_rate = min(sk->sk_pacing_rate,
sk->sk_max_pacing_rate);
break;
@@ -1336,7 +1336,8 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
#endif
case SO_MAX_PACING_RATE:
- v.val = sk->sk_max_pacing_rate;
+ /* 32bit version */
+ v.val = min_t(unsigned long, sk->sk_max_pacing_rate, ~0U);
break;
case SO_INCOMING_CPU:
@@ -2810,8 +2811,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_ll_usec = sysctl_net_busy_read;
#endif
- sk->sk_max_pacing_rate = ~0U;
- sk->sk_pacing_rate = ~0U;
+ sk->sk_max_pacing_rate = ~0UL;
+ sk->sk_pacing_rate = ~0UL;
sk->sk_pacing_shift = 10;
sk->sk_incoming_cpu = -1;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 43ef83b2330e6238a55c9843580a585d87708e0c..b8ba8fa34effac5138aea76b0d0fc2a9f1c05c4f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3111,10 +3111,10 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
{
const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */
const struct inet_connection_sock *icsk = inet_csk(sk);
+ unsigned long rate;
u32 now;
u64 rate64;
bool slow;
- u32 rate;
memset(info, 0, sizeof(*info));
if (sk->sk_type != SOCK_STREAM)
@@ -3124,11 +3124,11 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
/* Report meaningful fields for all TCP states, including listeners */
rate = READ_ONCE(sk->sk_pacing_rate);
- rate64 = rate != ~0U ? rate : ~0ULL;
+ rate64 = (rate != ~0UL) ? rate : ~0ULL;
info->tcpi_pacing_rate = rate64;
rate = READ_ONCE(sk->sk_max_pacing_rate);
- rate64 = rate != ~0U ? rate : ~0ULL;
+ rate64 = (rate != ~0UL) ? rate : ~0ULL;
info->tcpi_max_pacing_rate = rate64;
info->tcpi_reordering = tp->reordering;
@@ -3254,8 +3254,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
const struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *stats;
struct tcp_info info;
+ unsigned long rate;
u64 rate64;
- u32 rate;
stats = alloc_skb(tcp_opt_stats_get_size(), GFP_ATOMIC);
if (!stats)
@@ -3274,7 +3274,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
tp->total_retrans, TCP_NLA_PAD);
rate = READ_ONCE(sk->sk_pacing_rate);
- rate64 = rate != ~0U ? rate : ~0ULL;
+ rate64 = (rate != ~0UL) ? rate : ~0ULL;
nla_put_u64_64bit(stats, TCP_NLA_PACING_RATE, rate64, TCP_NLA_PAD);
rate64 = tcp_compute_delivery_rate(tp);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index a5786e3e2c16ce53a332f29c9a55b9a641eec791..33f4358615e6d63b5c98a30484f12ffae66334a2 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -219,7 +219,7 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
}
/* Convert a BBR bw and gain factor to a pacing rate in bytes per second. */
-static u32 bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
+static unsigned long bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
{
u64 rate = bw;
@@ -258,7 +258,7 @@ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
{
struct tcp_sock *tp = tcp_sk(sk);
struct bbr *bbr = inet_csk_ca(sk);
- u32 rate = bbr_bw_to_pacing_rate(sk, bw, gain);
+ unsigned long rate = bbr_bw_to_pacing_rate(sk, bw, gain);
if (unlikely(!bbr->has_seen_rtt && tp->srtt_us))
bbr_init_pacing_rate_from_rtt(sk);
@@ -280,7 +280,7 @@ static u32 bbr_tso_segs_goal(struct sock *sk)
/* Sort of tcp_tso_autosize() but ignoring
* driver provided sk_gso_max_size.
*/
- bytes = min_t(u32, sk->sk_pacing_rate >> sk->sk_pacing_shift,
+ bytes = min_t(unsigned long, sk->sk_pacing_rate >> sk->sk_pacing_shift,
GSO_MAX_SIZE - 1 - MAX_TCP_HEADER);
segs = max_t(u32, bytes / tp->mss_cache, bbr_min_tso_segs(sk));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f14df66a0c858dcb22b8924b9691c375eb5fcbc5..f4aa4109334a043d02b17b18bef346d805dab501 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -991,14 +991,14 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
if (sk->sk_pacing_status != SK_PACING_NONE) {
- u32 rate = sk->sk_pacing_rate;
+ unsigned long rate = sk->sk_pacing_rate;
/* Original sch_fq does not pace first 10 MSS
* Note that tp->data_segs_out overflows after 2^32 packets,
* this is a minor annoyance.
*/
- if (rate != ~0U && rate && tp->data_segs_out >= 10) {
- tp->tcp_wstamp_ns += div_u64((u64)skb->len * NSEC_PER_SEC, rate);
+ if (rate != ~0UL && rate && tp->data_segs_out >= 10) {
+ tp->tcp_wstamp_ns += div64_ul((u64)skb->len * NSEC_PER_SEC, rate);
tcp_internal_pacing(sk);
}
@@ -1704,8 +1704,9 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
{
u32 bytes, segs;
- bytes = min(sk->sk_pacing_rate >> sk->sk_pacing_shift,
- sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+ bytes = min_t(unsigned long,
+ sk->sk_pacing_rate >> sk->sk_pacing_shift,
+ sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
/* Goal is to send at least one packet per ms,
* not one big TSO packet every 100 ms.
@@ -2198,10 +2199,12 @@ static bool tcp_pacing_check(const struct sock *sk)
static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
unsigned int factor)
{
- unsigned int limit;
+ unsigned long limit;
- limit = max(2 * skb->truesize, sk->sk_pacing_rate >> sk->sk_pacing_shift);
- limit = min_t(u32, limit,
+ limit = max_t(unsigned long,
+ 2 * skb->truesize,
+ sk->sk_pacing_rate >> sk->sk_pacing_shift);
+ limit = min_t(unsigned long, limit,
sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes);
limit <<= factor;
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 338222a6c664b1825aaada4355e2fc0a01db9c73..3923d14095335df61c270f69e50cb7cbfde4c796 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -92,8 +92,8 @@ struct fq_sched_data {
u32 quantum;
u32 initial_quantum;
u32 flow_refill_delay;
- u32 flow_max_rate; /* optional max rate per flow */
u32 flow_plimit; /* max packets per flow */
+ unsigned long flow_max_rate; /* optional max rate per flow */
u32 orphan_mask; /* mask for orphaned skb */
u32 low_rate_threshold;
struct rb_root *fq_root;
@@ -416,7 +416,8 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
struct fq_flow_head *head;
struct sk_buff *skb;
struct fq_flow *f;
- u32 rate, plen;
+ unsigned long rate;
+ u32 plen;
skb = fq_dequeue_head(sch, &q->internal);
if (skb)
@@ -485,11 +486,11 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
if (f->credit > 0)
goto out;
}
- if (rate != ~0U) {
+ if (rate != ~0UL) {
u64 len = (u64)plen * NSEC_PER_SEC;
if (likely(rate))
- do_div(len, rate);
+ len = div64_ul(len, rate);
/* Since socket rate can change later,
* clamp the delay to 1 second.
* Really, providers of too big packets should be fixed !
@@ -701,9 +702,11 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
pr_warn_ratelimited("sch_fq: defrate %u ignored.\n",
nla_get_u32(tb[TCA_FQ_FLOW_DEFAULT_RATE]));
- if (tb[TCA_FQ_FLOW_MAX_RATE])
- q->flow_max_rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
+ if (tb[TCA_FQ_FLOW_MAX_RATE]) {
+ u32 rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
+ q->flow_max_rate = (rate == ~0U) ? ~0UL : rate;
+ }
if (tb[TCA_FQ_LOW_RATE_THRESHOLD])
q->low_rate_threshold =
nla_get_u32(tb[TCA_FQ_LOW_RATE_THRESHOLD]);
@@ -766,7 +769,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt,
q->quantum = 2 * psched_mtu(qdisc_dev(sch));
q->initial_quantum = 10 * psched_mtu(qdisc_dev(sch));
q->flow_refill_delay = msecs_to_jiffies(40);
- q->flow_max_rate = ~0U;
+ q->flow_max_rate = ~0UL;
q->time_next_delayed_flow = ~0ULL;
q->rate_enable = 1;
q->new_flows.first = NULL;
@@ -802,7 +805,8 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
nla_put_u32(skb, TCA_FQ_QUANTUM, q->quantum) ||
nla_put_u32(skb, TCA_FQ_INITIAL_QUANTUM, q->initial_quantum) ||
nla_put_u32(skb, TCA_FQ_RATE_ENABLE, q->rate_enable) ||
- nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE, q->flow_max_rate) ||
+ nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE,
+ min_t(unsigned long, q->flow_max_rate, ~0U)) ||
nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
jiffies_to_usecs(q->flow_refill_delay)) ||
nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
--
2.19.0.605.g01d371f741-goog
^ permalink raw reply related
* [PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
To: David S . Miller, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Gasper Zejn
Cc: netdev, Eric Dumazet, Eric Dumazet
In-Reply-To: <20181015163758.232436-1-edumazet@google.com>
In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.
This causes major regressions at high speed flows.
Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.
tcp_wstamp_ns is only updated when a packet is sent.
Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.
Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
include/linux/tcp.h | 1 +
net/ipv4/tcp_output.c | 9 ++++++---
net/ipv4/tcp_timer.c | 2 +-
3 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 848f5b25e178288ce870637b68a692ab88dc7d4d..8ed77bb4ed8636e9294389a011529fd9a667dce4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -249,6 +249,7 @@ struct tcp_sock {
u32 tlp_high_seq; /* snd_nxt at the time of TLP retransmit. */
u64 tcp_wstamp_ns; /* departure time for next sent data packet */
+ u64 tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */
/* RTT measurement */
u64 tcp_mstamp; /* most recent packet received/sent */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 059b67af28b137fb9566eaef370b270fc424bffb..f14df66a0c858dcb22b8924b9691c375eb5fcbc5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -52,9 +52,8 @@ void tcp_mstamp_refresh(struct tcp_sock *tp)
{
u64 val = tcp_clock_ns();
- /* departure time for next data packet */
- if (val > tp->tcp_wstamp_ns)
- tp->tcp_wstamp_ns = val;
+ if (val > tp->tcp_clock_cache)
+ tp->tcp_clock_cache = val;
val = div_u64(val, NSEC_PER_USEC);
if (val > tp->tcp_mstamp)
@@ -1050,6 +1049,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
if (unlikely(!skb))
return -ENOBUFS;
}
+
+ /* TODO: might take care of jitter here */
+ tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
+
skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
inet = inet_sk(sk);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 61023d50cd604d5e19464a32c33b65d29c75c81e..676020663ce80a79341ad1a05352742cc8dd5850 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -360,7 +360,7 @@ static void tcp_probe_timer(struct sock *sk)
*/
start_ts = tcp_skb_timestamp(skb);
if (!start_ts)
- skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+ skb->skb_mstamp_ns = tp->tcp_clock_cache;
else if (icsk->icsk_user_timeout &&
(s32)(tcp_time_stamp(tp) - start_ts) > icsk->icsk_user_timeout)
goto abort;
--
2.19.0.605.g01d371f741-goog
^ permalink raw reply related
* [PATCH net-next 0/7] tcp: second round for EDT conversion
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
To: David S . Miller, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Gasper Zejn
Cc: netdev, Eric Dumazet, Eric Dumazet
First round of EDT patches left TCP stack in a non optimal state.
- High speed flows suffered from loss of performance, addressed
by the first patch of this series.
- Second patch brings pacing to the current state of networking,
since we now reach ~100 Gbit on a single TCP flow.
- Third patch implements a mitigation for scheduling delays,
like the one we did in sch_fq in the past.
- Fourth patch removes one special case in sch_fq for ACK packets.
- Fifth patch removes a serious perfomance cost for TCP internal
pacing. We should setup the high resolution timer only if
really needed.
- Sixth patch fixes a typo in BBR.
- Last patch is one minor change in cdg congestion control.
Neal Cardwell also has a patch series fixing BBR after
EDT adoption.
Eric Dumazet (6):
tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
net: extend sk_pacing_rate to unsigned long
tcp: mitigate scheduling jitter in EDT pacing model
net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
tcp: optimize tcp internal pacing
tcp: cdg: use tcp high resolution clock cache
Neal Cardwell (1):
tcp_bbr: fix typo in bbr_pacing_margin_percent
include/linux/tcp.h | 1 +
include/net/sock.h | 4 +--
net/core/filter.c | 4 +--
net/core/sock.c | 9 +++---
net/ipv4/tcp.c | 10 +++---
net/ipv4/tcp_bbr.c | 10 +++---
net/ipv4/tcp_cdg.c | 2 +-
net/ipv4/tcp_output.c | 72 ++++++++++++++++++++++++++-----------------
net/ipv4/tcp_timer.c | 2 +-
net/sched/sch_fq.c | 22 +++++++------
10 files changed, 78 insertions(+), 58 deletions(-)
--
2.19.0.605.g01d371f741-goog
^ permalink raw reply
* Re: [PATCH iproute2] macsec: fix off-by-one when parsing attributes
From: Stephen Hemminger @ 2018-10-15 16:36 UTC (permalink / raw)
To: Sabrina Dubroca; +Cc: netdev, David Ahern
In-Reply-To: <9149b45616600eaa71f1dbbb101f7fe169c13efe.1539357364.git.sd@queasysnail.net>
On Fri, 12 Oct 2018 17:34:12 +0200
Sabrina Dubroca <sd@queasysnail.net> wrote:
> I seem to have had a massive brainfart with uses of
> parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and
> the call to parse_rtattr_nested must have MAX as its bound. Let's fix
> those.
>
> Fixes: b26fc590ce62 ("ip: add MACsec support")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Applied,
How did it ever work??
^ permalink raw reply
* Re: [PATCH iproute2] json: make 0xhex handle u64
From: Stephen Hemminger @ 2018-10-15 16:34 UTC (permalink / raw)
To: Sabrina Dubroca; +Cc: netdev, David Ahern
In-Reply-To: <4342ff99b4111ea2b2867e51e0eb15cbf5b02ad1.1539357365.git.sd@queasysnail.net>
On Fri, 12 Oct 2018 17:34:32 +0200
Sabrina Dubroca <sd@queasysnail.net> wrote:
> Stephen converted macsec's sci to use 0xhex, but 0xhex handles
> unsigned int's, not 64 bits ints. Thus, the output of the "ip macsec
> show" command is mangled, with half of the SCI replaced with 0s:
>
> # ip macsec show
> 11: macsec0: [...]
> cipher suite: GCM-AES-128, using ICV length 16
> TXSC: 0000000001560001 on SA 0
>
> # ip -d link show macsec0
> 11: macsec0@ens3: [...]
> link/ether 52:54:00:12:01:56 brd ff:ff:ff:ff:ff:ff promiscuity 0
> macsec sci 5254001201560001 [...]
>
> where TXSC and sci should match.
>
> Fixes: c0b904de6211 ("macsec: support JSON")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Thanks for finding this. We should add JSON (and macsec) to tests.
^ permalink raw reply
* Re: [iproute PATCH] bridge: fdb: Fix for missing keywords in non-JSON output
From: Stephen Hemminger @ 2018-10-15 16:31 UTC (permalink / raw)
To: Phil Sutter; +Cc: netdev
In-Reply-To: <20181009124408.7571-1-phil@nwl.cc>
On Tue, 9 Oct 2018 14:44:08 +0200
Phil Sutter <phil@nwl.cc> wrote:
> While migrating to JSON print library, some keywords were dropped from
> standard output by accident. Add them back to unbreak output parsers.
>
> Fixes: c7c1a1ef51aea ("bridge: colorize output and use JSON print library")
> Signed-off-by: Phil Sutter <phil@nwl.cc>
Good catch. Applied.
^ permalink raw reply
* Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast
From: Eric Dumazet @ 2018-10-15 16:23 UTC (permalink / raw)
To: Eric Dumazet, Gasper Zejn; +Cc: Eric Dumazet, Kevin Yang, netdev
In-Reply-To: <CANn89i+OoMLmx4jRTSRX8Svka0FRo-hPM-CqOxb9NJhV9KS7iQ@mail.gmail.com>
On 10/15/2018 07:50 AM, Eric Dumazet wrote:
> On Mon, Oct 15, 2018 at 3:26 AM Gasper Zejn <zelo.zejn@gmail.com> wrote:
>>
>>
>> I've tried to isolate the issue as best I could. There seems to be an
>> issue if the TCP socket has keepalive set and send queue is not empty
>> and the route goes away.
>>
>> https://github.com/zejn/bbr_pfifo_interrupts_issue
>>
>> Hope this helps,
>> Gasper
>
> This is awesome Gasper, I will take a look thanks.
>
> Note that we are about to send a patch series (targeting net-next) to
> polish the EDT patch series that was merged last month for linux-4.20.
> TCP internal pacing is going to be much better performance-wise.
>
Yeah, I believe that :
Commit c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f ("tcp: switch
tcp_internal_pacing() to tcp_wstamp_ns")
has incidentally fixed the issue.
That is because it calls tcp_internal_pacing() from
tcp_update_skb_after_send() which is called only if the packet was
correctly sent by IP layer.
Before this patch, tcp_internal_pacing() was called from
__tcp_transmit_skb() before we attempted to send the clone
and the clone could be dropped in IP layer (lack of route for example)
right away.
So in case the packet was not sent because of a route problem, the high resolution
timer would kick soon after and TCP xmit path would be entered again, triggering this loop problem.
I am going to send the 2nd round of EDT patches, so that you can try David Miller net-next tree
with all the patches we believe are needed for 4.20. Once proven to work, we might have to backport
the series to 4.18 and 4.19
Thanks !
^ permalink raw reply
* Re: [Bug 201423] New: eth0: hw csum failure
From: Stephen Hemminger @ 2018-10-15 16:21 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, rossi.f
In-Reply-To: <CANn89iLA+rdFNXXdzogLHF1FqYg3CjpwXJbscWTJ8Bk8bN2Scw@mail.gmail.com>
On Mon, 15 Oct 2018 08:41:47 -0700
Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> >
> >
> > Begin forwarded message:
> >
> > Date: Sun, 14 Oct 2018 10:42:48 +0000
> > From: bugzilla-daemon@bugzilla.kernel.org
> > To: stephen@networkplumber.org
> > Subject: [Bug 201423] New: eth0: hw csum failure
> >
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=201423
> >
> > Bug ID: 201423
> > Summary: eth0: hw csum failure
> > Product: Networking
> > Version: 2.5
> > Kernel Version: 4.19.0-rc7
> > Hardware: Intel
> > OS: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: Other
> > Assignee: stephen@networkplumber.org
> > Reporter: rossi.f@inwind.it
> > Regression: No
> >
> > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the ethernet
> > ports. I get the following error message:
> >
> > [ 433.727397] eth0: hw csum failure
> > [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> > [ 433.727406] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 1202 12/22/2010
> > [ 433.727407] Call Trace:
> > [ 433.727409] <IRQ>
> > [ 433.727415] dump_stack+0x46/0x5b
> > [ 433.727419] __skb_checksum_complete+0xb0/0xc0
> > [ 433.727423] tcp_v4_rcv+0x528/0xb60
> > [ 433.727426] ? ipt_do_table+0x2d0/0x400
> > [ 433.727429] ip_local_deliver_finish+0x5a/0x110
> > [ 433.727430] ip_local_deliver+0xe1/0xf0
> > [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60
> > [ 433.727432] ip_rcv+0xca/0xe0
> > [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70
> > [ 433.727438] netif_receive_skb_internal+0x4e/0x130
> > [ 433.727439] napi_gro_receive+0x6a/0x80
> > [ 433.727442] sky2_poll+0x707/0xd20
> > [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900
> > [ 433.727447] net_rx_action+0x237/0x380
> > [ 433.727449] __do_softirq+0xdc/0x1e0
> > [ 433.727452] irq_exit+0xa9/0xb0
> > [ 433.727453] do_IRQ+0x45/0xc0
> > [ 433.727455] common_interrupt+0xf/0xf
> > [ 433.727456] </IRQ>
> > [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
> > [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
> > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> > [ 433.727462] RSP: 0000:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
> > ffffffffffffffde
> > [ 433.727463] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
> > 000000000000001f
> > [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
> > 0000000000000000
> > [ 433.727465] RBP: ffff880237b263a0 R08: 0000000000000714 R09:
> > 000000650512105d
> > [ 433.727465] R10: 00000000ffffffff R11: 0000000000000342 R12:
> > 00000064fc2a8b1c
> > [ 433.727466] R13: 00000064fc25b35f R14: 0000000000000004 R15:
> > ffffffff8204af20
> > [ 433.727468] ? cpuidle_enter_state+0x119/0x200
> > [ 433.727471] do_idle+0x1bf/0x200
> > [ 433.727473] cpu_startup_entry+0x6a/0x70
> > [ 433.727475] start_secondary+0x17f/0x1c0
> > [ 433.727476] secondary_startup_64+0xa4/0xb0
> > [ 441.662954] eth0: hw csum failure
> > [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
> > [ 441.662960] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 1202 12/22/2010
> > [ 441.662960] Call Trace:
> > [ 441.662963] <IRQ>
> > [ 441.662968] dump_stack+0x46/0x5b
> > [ 441.662972] __skb_checksum_complete+0xb0/0xc0
> > [ 441.662975] tcp_v4_rcv+0x528/0xb60
> > [ 441.662979] ? ipt_do_table+0x2d0/0x400
> > [ 441.662981] ip_local_deliver_finish+0x5a/0x110
> > [ 441.662983] ip_local_deliver+0xe1/0xf0
> > [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60
> > [ 441.662986] ip_rcv+0xca/0xe0
> > [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70
> > [ 441.662993] netif_receive_skb_internal+0x4e/0x130
> > [ 441.662994] napi_gro_receive+0x6a/0x80
> > [ 441.662998] sky2_poll+0x707/0xd20
> > [ 441.663000] net_rx_action+0x237/0x380
> > [ 441.663002] __do_softirq+0xdc/0x1e0
> > [ 441.663005] irq_exit+0xa9/0xb0
> > [ 441.663007] do_IRQ+0x45/0xc0
> > [ 441.663009] common_interrupt+0xf/0xf
> > [ 441.663010] </IRQ>
> > [ 441.663012] RIP: 0010:merge+0x22/0xb0
> > [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 53
> > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 85 c9
> > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
> > [ 441.663015] RSP: 0018:ffffc9000090b988 EFLAGS: 00000246 ORIG_RAX:
> > ffffffffffffffde
> > [ 441.663017] RAX: 0000000000000000 RBX: ffff88021ab2d408 RCX:
> > ffff88021ab2d408
> > [ 441.663018] RDX: ffff88021ab2d388 RSI: ffffffffa021c440 RDI:
> > 0000000000000000
> > [ 441.663019] RBP: ffff88021ab2d388 R08: 0000000000005ecf R09:
> > 0000000000008500
> > [ 441.663020] R10: ffffea000877ec00 R11: ffff880236803500 R12:
> > ffffffffa021c440
> > [ 441.663021] R13: ffff88021ab2d448 R14: 0000000000000004 R15:
> > ffffc9000090b9e0
> > [ 441.663048] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> > [ 441.663063] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> > [ 441.663065] ? merge+0x57/0xb0
> > [ 441.663080] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> > [ 441.663082] list_sort+0x8b/0x230
> > [ 441.663094] radeon_cs_parser_fini+0xdf/0x110 [radeon]
> > [ 441.663110] radeon_cs_ioctl+0x2a4/0x710 [radeon]
> > [ 441.663113] ? __switch_to_asm+0x34/0x70
> > [ 441.663114] ? __switch_to_asm+0x40/0x70
> > [ 441.663130] ? radeon_cs_parser_init+0x20/0x20 [radeon]
> > [ 441.663141] drm_ioctl_kernel+0xa3/0xe0 [drm]
> > [ 441.663149] drm_ioctl+0x2e2/0x380 [drm]
> > [ 441.663164] ? radeon_cs_parser_init+0x20/0x20 [radeon]
> > [ 441.663168] ? page_add_new_anon_rmap+0x42/0x70
> > [ 441.663171] do_vfs_ioctl+0x9a/0x600
> > [ 441.663173] ksys_ioctl+0x35/0x60
> > [ 441.663175] __x64_sys_ioctl+0x11/0x20
> > [ 441.663177] do_syscall_64+0x3d/0xf0
> > [ 441.663179] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 441.663180] RIP: 0033:0x7f9377377f37
> > [ 441.663182] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 ad
> > db 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d 01
> > f0 ff ff 73 01 c3 48 8b 0d 21 4f 2c 00 f7 d8 64 89 01 48
> > [ 441.663183] RSP: 002b:00007f92c3130d28 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000010
> > [ 441.663185] RAX: ffffffffffffffda RBX: 0000564498327ec0 RCX:
> > 00007f9377377f37
> > [ 441.663186] RDX: 0000564498337ec8 RSI: 00000000c0206466 RDI:
> > 0000000000000010
> > [ 441.663186] RBP: 0000564498337ec8 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 441.663187] R10: 0000000000000000 R11: 0000000000000246 R12:
> > 00000000c0206466
> > [ 441.663188] R13: 0000000000000010 R14: 0000000000000000 R15:
> > 0000564497a38120
> > [ 462.833418] eth0: hw csum failure
> > [ 462.833428] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> > [ 462.833429] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 1202 12/22/2010
> > [ 462.833429] Call Trace:
> > [ 462.833432] <IRQ>
> > [ 462.833438] dump_stack+0x46/0x5b
> > [ 462.833442] __skb_checksum_complete+0xb0/0xc0
> > [ 462.833446] tcp_v4_rcv+0x528/0xb60
> > [ 462.833449] ? ipt_do_table+0x2d0/0x400
> > [ 462.833452] ip_local_deliver_finish+0x5a/0x110
> > [ 462.833454] ip_local_deliver+0xe1/0xf0
> > [ 462.833455] ? ip_sublist_rcv_finish+0x60/0x60
> > [ 462.833457] ip_rcv+0xca/0xe0
> > [ 462.833459] ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [ 462.833461] __netif_receive_skb_one_core+0x4b/0x70
> > [ 462.833464] netif_receive_skb_internal+0x4e/0x130
> > [ 462.833466] napi_gro_receive+0x6a/0x80
> > [ 462.833469] sky2_poll+0x707/0xd20
> > [ 462.833471] net_rx_action+0x237/0x380
> > [ 462.833474] __do_softirq+0xdc/0x1e0
> > [ 462.833477] irq_exit+0xa9/0xb0
> > [ 462.833479] do_IRQ+0x45/0xc0
> > [ 462.833481] common_interrupt+0xf/0xf
> > [ 462.833482] </IRQ>
> > [ 462.833486] RIP: 0010:cpuidle_enter_state+0x124/0x200
> > [ 462.833488] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
> > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> > [ 462.833489] RSP: 0018:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
> > ffffffffffffffde
> > [ 462.833491] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
> > 000000000000001f
> > [ 462.833492] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
> > 0000000000000000
> > [ 462.833493] RBP: ffff880237b263a0 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 462.833494] R10: 00000000ffffffff R11: 0000000000000273 R12:
> > 0000006bc3052131
> > [ 462.833495] R13: 0000006bc2f99f57 R14: 0000000000000004 R15:
> > ffffffff8204af20
> > [ 462.833498] ? cpuidle_enter_state+0x119/0x200
> > [ 462.833503] do_idle+0x1bf/0x200
> > [ 462.833506] cpu_startup_entry+0x6a/0x70
> > [ 462.833510] start_secondary+0x17f/0x1c0
> > [ 462.833513] secondary_startup_64+0xa4/0xb0
> >
> > Something is changed between 4.17.12 and 4.18, after bisecting the problem I
> > got the following first bad commit:
> >
> > commit 88078d98d1bb085d72af8437707279e203524fa5
> > Author: Eric Dumazet <edumazet@google.com>
> > Date: Wed Apr 18 11:43:15 2018 -0700
> >
> > net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
> >
> > After working on IP defragmentation lately, I found that some large
> > packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
> > zero paddings on the last (small) fragment.
> >
> > While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
> > to CHECKSUM_NONE, forcing a full csum validation, even if all prior
> > fragments had CHECKSUM_COMPLETE set.
> >
> > We can instead compute the checksum of the part we are trimming,
> > usually smaller than the part we keep.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> >
>
> Thanks for bisecting !
>
> This commit is known to expose some NIC/driver bugs.
>
> Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
> ("net: sungem: fix rx checksum support") for one driver needing a fix.
>
> I assume SKY2_HW_NEW_LE is not set on your NIC ?
There are two variants of this chip, one does 1's compliment checksum, and
the other one does TCP checksum. Maybe the 1's compliment version is incorrectly
including the CRC.
Side note, not sure why but the driver only calls gro for checksummed packets.
Is that necessary?
^ permalink raw reply
* Re: Fw: [Bug 201423] New: eth0: hw csum failure
From: Dave Stevenson @ 2018-10-15 16:12 UTC (permalink / raw)
To: edumazet
Cc: stephen, netdev, rossi.f, Woojung Huh,
Microchip Linux Driver Support, Steve Glendinning
In-Reply-To: <CANn89iLA+rdFNXXdzogLHF1FqYg3CjpwXJbscWTJ8Bk8bN2Scw@mail.gmail.com>
Hi Eric.
On Mon, 15 Oct 2018 at 16:42, Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> >
> >
> > Begin forwarded message:
> >
> > Date: Sun, 14 Oct 2018 10:42:48 +0000
> > From: bugzilla-daemon@bugzilla.kernel.org
> > To: stephen@networkplumber.org
> > Subject: [Bug 201423] New: eth0: hw csum failure
> >
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=201423
> >
> > Bug ID: 201423
> > Summary: eth0: hw csum failure
> > Product: Networking
> > Version: 2.5
> > Kernel Version: 4.19.0-rc7
> > Hardware: Intel
> > OS: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: Other
> > Assignee: stephen@networkplumber.org
> > Reporter: rossi.f@inwind.it
> > Regression: No
> >
> > I have a P6T DELUXE V2 motherboard and using the sky2 driver for the ethernet
> > ports. I get the following error message:
> >
> > [ 433.727397] eth0: hw csum failure
> > [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> > [ 433.727406] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 1202 12/22/2010
> > [ 433.727407] Call Trace:
> > [ 433.727409] <IRQ>
> > [ 433.727415] dump_stack+0x46/0x5b
> > [ 433.727419] __skb_checksum_complete+0xb0/0xc0
> > [ 433.727423] tcp_v4_rcv+0x528/0xb60
> > [ 433.727426] ? ipt_do_table+0x2d0/0x400
> > [ 433.727429] ip_local_deliver_finish+0x5a/0x110
> > [ 433.727430] ip_local_deliver+0xe1/0xf0
> > [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60
> > [ 433.727432] ip_rcv+0xca/0xe0
> > [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70
> > [ 433.727438] netif_receive_skb_internal+0x4e/0x130
> > [ 433.727439] napi_gro_receive+0x6a/0x80
> > [ 433.727442] sky2_poll+0x707/0xd20
> > [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900
> > [ 433.727447] net_rx_action+0x237/0x380
> > [ 433.727449] __do_softirq+0xdc/0x1e0
> > [ 433.727452] irq_exit+0xa9/0xb0
> > [ 433.727453] do_IRQ+0x45/0xc0
> > [ 433.727455] common_interrupt+0xf/0xf
> > [ 433.727456] </IRQ>
> > [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
> > [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
> > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> > [ 433.727462] RSP: 0000:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
> > ffffffffffffffde
> > [ 433.727463] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
> > 000000000000001f
> > [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
> > 0000000000000000
> > [ 433.727465] RBP: ffff880237b263a0 R08: 0000000000000714 R09:
> > 000000650512105d
> > [ 433.727465] R10: 00000000ffffffff R11: 0000000000000342 R12:
> > 00000064fc2a8b1c
> > [ 433.727466] R13: 00000064fc25b35f R14: 0000000000000004 R15:
> > ffffffff8204af20
> > [ 433.727468] ? cpuidle_enter_state+0x119/0x200
> > [ 433.727471] do_idle+0x1bf/0x200
> > [ 433.727473] cpu_startup_entry+0x6a/0x70
> > [ 433.727475] start_secondary+0x17f/0x1c0
> > [ 433.727476] secondary_startup_64+0xa4/0xb0
> > [ 441.662954] eth0: hw csum failure
> > [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
> > [ 441.662960] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 1202 12/22/2010
> > [ 441.662960] Call Trace:
> > [ 441.662963] <IRQ>
> > [ 441.662968] dump_stack+0x46/0x5b
> > [ 441.662972] __skb_checksum_complete+0xb0/0xc0
> > [ 441.662975] tcp_v4_rcv+0x528/0xb60
> > [ 441.662979] ? ipt_do_table+0x2d0/0x400
> > [ 441.662981] ip_local_deliver_finish+0x5a/0x110
> > [ 441.662983] ip_local_deliver+0xe1/0xf0
> > [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60
> > [ 441.662986] ip_rcv+0xca/0xe0
> > [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70
> > [ 441.662993] netif_receive_skb_internal+0x4e/0x130
> > [ 441.662994] napi_gro_receive+0x6a/0x80
> > [ 441.662998] sky2_poll+0x707/0xd20
> > [ 441.663000] net_rx_action+0x237/0x380
> > [ 441.663002] __do_softirq+0xdc/0x1e0
> > [ 441.663005] irq_exit+0xa9/0xb0
> > [ 441.663007] do_IRQ+0x45/0xc0
> > [ 441.663009] common_interrupt+0xf/0xf
> > [ 441.663010] </IRQ>
> > [ 441.663012] RIP: 0010:merge+0x22/0xb0
> > [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 53
> > 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 85 c9
> > 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
> > [ 441.663015] RSP: 0018:ffffc9000090b988 EFLAGS: 00000246 ORIG_RAX:
> > ffffffffffffffde
> > [ 441.663017] RAX: 0000000000000000 RBX: ffff88021ab2d408 RCX:
> > ffff88021ab2d408
> > [ 441.663018] RDX: ffff88021ab2d388 RSI: ffffffffa021c440 RDI:
> > 0000000000000000
> > [ 441.663019] RBP: ffff88021ab2d388 R08: 0000000000005ecf R09:
> > 0000000000008500
> > [ 441.663020] R10: ffffea000877ec00 R11: ffff880236803500 R12:
> > ffffffffa021c440
> > [ 441.663021] R13: ffff88021ab2d448 R14: 0000000000000004 R15:
> > ffffc9000090b9e0
> > [ 441.663048] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> > [ 441.663063] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> > [ 441.663065] ? merge+0x57/0xb0
> > [ 441.663080] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> > [ 441.663082] list_sort+0x8b/0x230
> > [ 441.663094] radeon_cs_parser_fini+0xdf/0x110 [radeon]
> > [ 441.663110] radeon_cs_ioctl+0x2a4/0x710 [radeon]
> > [ 441.663113] ? __switch_to_asm+0x34/0x70
> > [ 441.663114] ? __switch_to_asm+0x40/0x70
> > [ 441.663130] ? radeon_cs_parser_init+0x20/0x20 [radeon]
> > [ 441.663141] drm_ioctl_kernel+0xa3/0xe0 [drm]
> > [ 441.663149] drm_ioctl+0x2e2/0x380 [drm]
> > [ 441.663164] ? radeon_cs_parser_init+0x20/0x20 [radeon]
> > [ 441.663168] ? page_add_new_anon_rmap+0x42/0x70
> > [ 441.663171] do_vfs_ioctl+0x9a/0x600
> > [ 441.663173] ksys_ioctl+0x35/0x60
> > [ 441.663175] __x64_sys_ioctl+0x11/0x20
> > [ 441.663177] do_syscall_64+0x3d/0xf0
> > [ 441.663179] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 441.663180] RIP: 0033:0x7f9377377f37
> > [ 441.663182] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 ad
> > db 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d 01
> > f0 ff ff 73 01 c3 48 8b 0d 21 4f 2c 00 f7 d8 64 89 01 48
> > [ 441.663183] RSP: 002b:00007f92c3130d28 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000010
> > [ 441.663185] RAX: ffffffffffffffda RBX: 0000564498327ec0 RCX:
> > 00007f9377377f37
> > [ 441.663186] RDX: 0000564498337ec8 RSI: 00000000c0206466 RDI:
> > 0000000000000010
> > [ 441.663186] RBP: 0000564498337ec8 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 441.663187] R10: 0000000000000000 R11: 0000000000000246 R12:
> > 00000000c0206466
> > [ 441.663188] R13: 0000000000000010 R14: 0000000000000000 R15:
> > 0000564497a38120
> > [ 462.833418] eth0: hw csum failure
> > [ 462.833428] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> > [ 462.833429] Hardware name: System manufacturer System Product Name/P6T
> > DELUXE V2, BIOS 1202 12/22/2010
> > [ 462.833429] Call Trace:
> > [ 462.833432] <IRQ>
> > [ 462.833438] dump_stack+0x46/0x5b
> > [ 462.833442] __skb_checksum_complete+0xb0/0xc0
> > [ 462.833446] tcp_v4_rcv+0x528/0xb60
> > [ 462.833449] ? ipt_do_table+0x2d0/0x400
> > [ 462.833452] ip_local_deliver_finish+0x5a/0x110
> > [ 462.833454] ip_local_deliver+0xe1/0xf0
> > [ 462.833455] ? ip_sublist_rcv_finish+0x60/0x60
> > [ 462.833457] ip_rcv+0xca/0xe0
> > [ 462.833459] ? ip_rcv_finish_core.isra.0+0x300/0x300
> > [ 462.833461] __netif_receive_skb_one_core+0x4b/0x70
> > [ 462.833464] netif_receive_skb_internal+0x4e/0x130
> > [ 462.833466] napi_gro_receive+0x6a/0x80
> > [ 462.833469] sky2_poll+0x707/0xd20
> > [ 462.833471] net_rx_action+0x237/0x380
> > [ 462.833474] __do_softirq+0xdc/0x1e0
> > [ 462.833477] irq_exit+0xa9/0xb0
> > [ 462.833479] do_IRQ+0x45/0xc0
> > [ 462.833481] common_interrupt+0xf/0xf
> > [ 462.833482] </IRQ>
> > [ 462.833486] RIP: 0010:cpuidle_enter_state+0x124/0x200
> > [ 462.833488] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> > ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
> > 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> > [ 462.833489] RSP: 0018:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
> > ffffffffffffffde
> > [ 462.833491] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
> > 000000000000001f
> > [ 462.833492] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
> > 0000000000000000
> > [ 462.833493] RBP: ffff880237b263a0 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 462.833494] R10: 00000000ffffffff R11: 0000000000000273 R12:
> > 0000006bc3052131
> > [ 462.833495] R13: 0000006bc2f99f57 R14: 0000000000000004 R15:
> > ffffffff8204af20
> > [ 462.833498] ? cpuidle_enter_state+0x119/0x200
> > [ 462.833503] do_idle+0x1bf/0x200
> > [ 462.833506] cpu_startup_entry+0x6a/0x70
> > [ 462.833510] start_secondary+0x17f/0x1c0
> > [ 462.833513] secondary_startup_64+0xa4/0xb0
> >
> > Something is changed between 4.17.12 and 4.18, after bisecting the problem I
> > got the following first bad commit:
> >
> > commit 88078d98d1bb085d72af8437707279e203524fa5
> > Author: Eric Dumazet <edumazet@google.com>
> > Date: Wed Apr 18 11:43:15 2018 -0700
> >
> > net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
> >
> > After working on IP defragmentation lately, I found that some large
> > packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
> > zero paddings on the last (small) fragment.
> >
> > While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
> > to CHECKSUM_NONE, forcing a full csum validation, even if all prior
> > fragments had CHECKSUM_COMPLETE set.
> >
> > We can instead compute the checksum of the part we are trimming,
> > usually smaller than the part we keep.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> >
>
> Thanks for bisecting !
>
> This commit is known to expose some NIC/driver bugs.
>
> Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
> ("net: sungem: fix rx checksum support") for one driver needing a fix.
>
> I assume SKY2_HW_NEW_LE is not set on your NIC ?
Just to say that we've also just hit this with both the LAN78xx and
SMSC9514 drivers, ie all Raspberry Pis with onboard ethernet. Likewise
that commit had been pinpointed as the cause, or at least exposing an
underlying issue.
As the patch has been backported in 4.14.71 it's hitting LTS users too.
Thanks for the pointer on sungem. I'll have a look into what's going
on and see if we can sort it, although I have cc'ed in the maintainers
of those chips in case they are already on the case.
Cheers.
Dave
^ permalink raw reply
* Re: [bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions
From: Jakub Kicinski @ 2018-10-15 16:06 UTC (permalink / raw)
To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20181015151753.9258.75330.stgit@john-Precision-Tower-5810>
On Mon, 15 Oct 2018 08:17:53 -0700, John Fastabend wrote:
> Multiple map definition structures exist and user may have non-zero
> fields in their definition that are not recognized by bpftool and
> libbpf. The normal behavior is to then fail loading the map. Although
> this is a good default behavior users may still want to load the map
> for debugging or other reasons. This patch adds a --mapcompat flag
> that can be used to override the default behavior and allow loading
> the map even when it has additional non-zero fields.
>
> For now the only user is 'bpftool prog' we can switch over other
> subcommands as needed. The library exposes an API that consumes
> a flags field now but I kept the original API around also in case
> users of the API don't want to expose this. The flags field is an
> int in case we need more control over how the API call handles
> errors/features/etc in the future.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
No strong opinion on the functionality, but may I be a grump and again
request adding the new option to completions and the man page? :)
^ permalink raw reply
* Re: [bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps
From: Jakub Kicinski @ 2018-10-15 16:04 UTC (permalink / raw)
To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20181015151748.9258.66154.stgit@john-Precision-Tower-5810>
On Mon, 15 Oct 2018 08:17:48 -0700, John Fastabend wrote:
> Sock map/hash introduce support for attaching programs to maps. To
> date I have been doing this with custom tooling but this is less than
> ideal as we shift to using bpftool as the single CLI for our BPF uses.
> This patch adds new sub commands 'attach' and 'detach' to the 'prog'
> command to attach programs to maps and then detach them.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
* Re: Bug in MACSec - stops passing traffic after approx 5TB
From: Josh Coombs @ 2018-10-15 15:45 UTC (permalink / raw)
To: sd; +Cc: netdev
In-Reply-To: <CACcUnf8uYKKbE8o=LvT7cENKiuB_sBMD+2VMCCOnwa=2W=Ri4w@mail.gmail.com>
And confirmed, starting with a high packet number results in a very
short testbed run, 296 packets and then nothing, just as you surmised.
Sorry for raising the alarm falsely. Looks like I need to roll my own
build of wpa_supplicant as the ubuntu builds don't include the macsec
driver, haven't tested Gentoo's ebuilds yet to see if they do.
Josh Coombs
On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs <jcoombs@staff.gwi.net> wrote:
>
> On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> >
> > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > I initially mistook this for a traffic control issue, but after
> > > stripping the test beds down to just the MACSec component, I can still
> > > replicate the issue. After approximately 5TB of transfer / 4 billion
> > > packets over a MACSec link it stops passing traffic.
> >
> > I think you're just hitting packet number exhaustion. After 2^32
> > packets, the packet number would wrap to 0 and start being reused,
> > which breaks the crypto used by macsec. Before this point, you have to
> > add a new SA, and tell the macsec device to switch to it.
>
> I had not considered that, I naively thought as long as I didn't
> specify a replay window, it'd roll the PN over on it's own and life
> would be good. I'll test that theory tomorrow, should be easy to
> prove out.
>
> > That's why you should be using wpa_supplicant. It will monitor the
> > growth of the packet number, and handle the rekey for you.
>
> Thank you for the heads up, I'll read up on this as well.
>
> Josh C
^ permalink raw reply
* Re: Fw: [Bug 201423] New: eth0: hw csum failure
From: Eric Dumazet @ 2018-10-15 15:41 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, rossi.f
In-Reply-To: <20181015081519.0bf076bc@xeon-e3>
On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
>
>
> Begin forwarded message:
>
> Date: Sun, 14 Oct 2018 10:42:48 +0000
> From: bugzilla-daemon@bugzilla.kernel.org
> To: stephen@networkplumber.org
> Subject: [Bug 201423] New: eth0: hw csum failure
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=201423
>
> Bug ID: 201423
> Summary: eth0: hw csum failure
> Product: Networking
> Version: 2.5
> Kernel Version: 4.19.0-rc7
> Hardware: Intel
> OS: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: Other
> Assignee: stephen@networkplumber.org
> Reporter: rossi.f@inwind.it
> Regression: No
>
> I have a P6T DELUXE V2 motherboard and using the sky2 driver for the ethernet
> ports. I get the following error message:
>
> [ 433.727397] eth0: hw csum failure
> [ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> [ 433.727406] Hardware name: System manufacturer System Product Name/P6T
> DELUXE V2, BIOS 1202 12/22/2010
> [ 433.727407] Call Trace:
> [ 433.727409] <IRQ>
> [ 433.727415] dump_stack+0x46/0x5b
> [ 433.727419] __skb_checksum_complete+0xb0/0xc0
> [ 433.727423] tcp_v4_rcv+0x528/0xb60
> [ 433.727426] ? ipt_do_table+0x2d0/0x400
> [ 433.727429] ip_local_deliver_finish+0x5a/0x110
> [ 433.727430] ip_local_deliver+0xe1/0xf0
> [ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60
> [ 433.727432] ip_rcv+0xca/0xe0
> [ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300
> [ 433.727436] __netif_receive_skb_one_core+0x4b/0x70
> [ 433.727438] netif_receive_skb_internal+0x4e/0x130
> [ 433.727439] napi_gro_receive+0x6a/0x80
> [ 433.727442] sky2_poll+0x707/0xd20
> [ 433.727446] ? rcu_check_callbacks+0x1b4/0x900
> [ 433.727447] net_rx_action+0x237/0x380
> [ 433.727449] __do_softirq+0xdc/0x1e0
> [ 433.727452] irq_exit+0xa9/0xb0
> [ 433.727453] do_IRQ+0x45/0xc0
> [ 433.727455] common_interrupt+0xf/0xf
> [ 433.727456] </IRQ>
> [ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
> [ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
> 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> [ 433.727462] RSP: 0000:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
> ffffffffffffffde
> [ 433.727463] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
> 000000000000001f
> [ 433.727464] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
> 0000000000000000
> [ 433.727465] RBP: ffff880237b263a0 R08: 0000000000000714 R09:
> 000000650512105d
> [ 433.727465] R10: 00000000ffffffff R11: 0000000000000342 R12:
> 00000064fc2a8b1c
> [ 433.727466] R13: 00000064fc25b35f R14: 0000000000000004 R15:
> ffffffff8204af20
> [ 433.727468] ? cpuidle_enter_state+0x119/0x200
> [ 433.727471] do_idle+0x1bf/0x200
> [ 433.727473] cpu_startup_entry+0x6a/0x70
> [ 433.727475] start_secondary+0x17f/0x1c0
> [ 433.727476] secondary_startup_64+0xa4/0xb0
> [ 441.662954] eth0: hw csum failure
> [ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
> [ 441.662960] Hardware name: System manufacturer System Product Name/P6T
> DELUXE V2, BIOS 1202 12/22/2010
> [ 441.662960] Call Trace:
> [ 441.662963] <IRQ>
> [ 441.662968] dump_stack+0x46/0x5b
> [ 441.662972] __skb_checksum_complete+0xb0/0xc0
> [ 441.662975] tcp_v4_rcv+0x528/0xb60
> [ 441.662979] ? ipt_do_table+0x2d0/0x400
> [ 441.662981] ip_local_deliver_finish+0x5a/0x110
> [ 441.662983] ip_local_deliver+0xe1/0xf0
> [ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60
> [ 441.662986] ip_rcv+0xca/0xe0
> [ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300
> [ 441.662990] __netif_receive_skb_one_core+0x4b/0x70
> [ 441.662993] netif_receive_skb_internal+0x4e/0x130
> [ 441.662994] napi_gro_receive+0x6a/0x80
> [ 441.662998] sky2_poll+0x707/0xd20
> [ 441.663000] net_rx_action+0x237/0x380
> [ 441.663002] __do_softirq+0xdc/0x1e0
> [ 441.663005] irq_exit+0xa9/0xb0
> [ 441.663007] do_IRQ+0x45/0xc0
> [ 441.663009] common_interrupt+0xf/0xf
> [ 441.663010] </IRQ>
> [ 441.663012] RIP: 0010:merge+0x22/0xb0
> [ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 53
> 48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 85 c9
> 74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
> [ 441.663015] RSP: 0018:ffffc9000090b988 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffde
> [ 441.663017] RAX: 0000000000000000 RBX: ffff88021ab2d408 RCX:
> ffff88021ab2d408
> [ 441.663018] RDX: ffff88021ab2d388 RSI: ffffffffa021c440 RDI:
> 0000000000000000
> [ 441.663019] RBP: ffff88021ab2d388 R08: 0000000000005ecf R09:
> 0000000000008500
> [ 441.663020] R10: ffffea000877ec00 R11: ffff880236803500 R12:
> ffffffffa021c440
> [ 441.663021] R13: ffff88021ab2d448 R14: 0000000000000004 R15:
> ffffc9000090b9e0
> [ 441.663048] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> [ 441.663063] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> [ 441.663065] ? merge+0x57/0xb0
> [ 441.663080] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
> [ 441.663082] list_sort+0x8b/0x230
> [ 441.663094] radeon_cs_parser_fini+0xdf/0x110 [radeon]
> [ 441.663110] radeon_cs_ioctl+0x2a4/0x710 [radeon]
> [ 441.663113] ? __switch_to_asm+0x34/0x70
> [ 441.663114] ? __switch_to_asm+0x40/0x70
> [ 441.663130] ? radeon_cs_parser_init+0x20/0x20 [radeon]
> [ 441.663141] drm_ioctl_kernel+0xa3/0xe0 [drm]
> [ 441.663149] drm_ioctl+0x2e2/0x380 [drm]
> [ 441.663164] ? radeon_cs_parser_init+0x20/0x20 [radeon]
> [ 441.663168] ? page_add_new_anon_rmap+0x42/0x70
> [ 441.663171] do_vfs_ioctl+0x9a/0x600
> [ 441.663173] ksys_ioctl+0x35/0x60
> [ 441.663175] __x64_sys_ioctl+0x11/0x20
> [ 441.663177] do_syscall_64+0x3d/0xf0
> [ 441.663179] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 441.663180] RIP: 0033:0x7f9377377f37
> [ 441.663182] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 ad
> db 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d 01
> f0 ff ff 73 01 c3 48 8b 0d 21 4f 2c 00 f7 d8 64 89 01 48
> [ 441.663183] RSP: 002b:00007f92c3130d28 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 441.663185] RAX: ffffffffffffffda RBX: 0000564498327ec0 RCX:
> 00007f9377377f37
> [ 441.663186] RDX: 0000564498337ec8 RSI: 00000000c0206466 RDI:
> 0000000000000010
> [ 441.663186] RBP: 0000564498337ec8 R08: 0000000000000000 R09:
> 0000000000000000
> [ 441.663187] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00000000c0206466
> [ 441.663188] R13: 0000000000000010 R14: 0000000000000000 R15:
> 0000564497a38120
> [ 462.833418] eth0: hw csum failure
> [ 462.833428] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
> [ 462.833429] Hardware name: System manufacturer System Product Name/P6T
> DELUXE V2, BIOS 1202 12/22/2010
> [ 462.833429] Call Trace:
> [ 462.833432] <IRQ>
> [ 462.833438] dump_stack+0x46/0x5b
> [ 462.833442] __skb_checksum_complete+0xb0/0xc0
> [ 462.833446] tcp_v4_rcv+0x528/0xb60
> [ 462.833449] ? ipt_do_table+0x2d0/0x400
> [ 462.833452] ip_local_deliver_finish+0x5a/0x110
> [ 462.833454] ip_local_deliver+0xe1/0xf0
> [ 462.833455] ? ip_sublist_rcv_finish+0x60/0x60
> [ 462.833457] ip_rcv+0xca/0xe0
> [ 462.833459] ? ip_rcv_finish_core.isra.0+0x300/0x300
> [ 462.833461] __netif_receive_skb_one_core+0x4b/0x70
> [ 462.833464] netif_receive_skb_internal+0x4e/0x130
> [ 462.833466] napi_gro_receive+0x6a/0x80
> [ 462.833469] sky2_poll+0x707/0xd20
> [ 462.833471] net_rx_action+0x237/0x380
> [ 462.833474] __do_softirq+0xdc/0x1e0
> [ 462.833477] irq_exit+0xa9/0xb0
> [ 462.833479] do_IRQ+0x45/0xc0
> [ 462.833481] common_interrupt+0xf/0xf
> [ 462.833482] </IRQ>
> [ 462.833486] RIP: 0010:cpuidle_enter_state+0x124/0x200
> [ 462.833488] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
> ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
> 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> [ 462.833489] RSP: 0018:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
> ffffffffffffffde
> [ 462.833491] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
> 000000000000001f
> [ 462.833492] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
> 0000000000000000
> [ 462.833493] RBP: ffff880237b263a0 R08: 0000000000000000 R09:
> 0000000000000000
> [ 462.833494] R10: 00000000ffffffff R11: 0000000000000273 R12:
> 0000006bc3052131
> [ 462.833495] R13: 0000006bc2f99f57 R14: 0000000000000004 R15:
> ffffffff8204af20
> [ 462.833498] ? cpuidle_enter_state+0x119/0x200
> [ 462.833503] do_idle+0x1bf/0x200
> [ 462.833506] cpu_startup_entry+0x6a/0x70
> [ 462.833510] start_secondary+0x17f/0x1c0
> [ 462.833513] secondary_startup_64+0xa4/0xb0
>
> Something is changed between 4.17.12 and 4.18, after bisecting the problem I
> got the following first bad commit:
>
> commit 88078d98d1bb085d72af8437707279e203524fa5
> Author: Eric Dumazet <edumazet@google.com>
> Date: Wed Apr 18 11:43:15 2018 -0700
>
> net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>
> After working on IP defragmentation lately, I found that some large
> packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
> zero paddings on the last (small) fragment.
>
> While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
> to CHECKSUM_NONE, forcing a full csum validation, even if all prior
> fragments had CHECKSUM_COMPLETE set.
>
> We can instead compute the checksum of the part we are trimming,
> usually smaller than the part we keep.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
Thanks for bisecting !
This commit is known to expose some NIC/driver bugs.
Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
("net: sungem: fix rx checksum support") for one driver needing a fix.
I assume SKY2_HW_NEW_LE is not set on your NIC ?
^ permalink raw reply
* [bpf-next PATCH v2 2/2] bpf: bpftool, add flag to allow non-compat map definitions
From: John Fastabend @ 2018-10-15 15:17 UTC (permalink / raw)
To: jakub.kicinski, ast, daniel; +Cc: netdev
In-Reply-To: <20181015151336.9258.37606.stgit@john-Precision-Tower-5810>
Multiple map definition structures exist and user may have non-zero
fields in their definition that are not recognized by bpftool and
libbpf. The normal behavior is to then fail loading the map. Although
this is a good default behavior users may still want to load the map
for debugging or other reasons. This patch adds a --mapcompat flag
that can be used to override the default behavior and allow loading
the map even when it has additional non-zero fields.
For now the only user is 'bpftool prog' we can switch over other
subcommands as needed. The library exposes an API that consumes
a flags field now but I kept the original API around also in case
users of the API don't want to expose this. The flags field is an
int in case we need more control over how the API call handles
errors/features/etc in the future.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
tools/bpf/bpftool/main.c | 7 ++++++-
tools/bpf/bpftool/main.h | 3 ++-
tools/bpf/bpftool/prog.c | 2 +-
tools/lib/bpf/bpf.h | 3 +++
tools/lib/bpf/libbpf.c | 27 ++++++++++++++++++---------
tools/lib/bpf/libbpf.h | 2 ++
6 files changed, 32 insertions(+), 12 deletions(-)
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index 79dc3f1..828dde3 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -55,6 +55,7 @@
bool pretty_output;
bool json_output;
bool show_pinned;
+int bpf_flags;
struct pinned_obj_table prog_table;
struct pinned_obj_table map_table;
@@ -341,6 +342,7 @@ int main(int argc, char **argv)
{ "pretty", no_argument, NULL, 'p' },
{ "version", no_argument, NULL, 'V' },
{ "bpffs", no_argument, NULL, 'f' },
+ { "mapcompat", no_argument, NULL, 'm' },
{ 0 }
};
int opt, ret;
@@ -355,7 +357,7 @@ int main(int argc, char **argv)
hash_init(map_table.table);
opterr = 0;
- while ((opt = getopt_long(argc, argv, "Vhpjf",
+ while ((opt = getopt_long(argc, argv, "Vhpjfm",
options, NULL)) >= 0) {
switch (opt) {
case 'V':
@@ -379,6 +381,9 @@ int main(int argc, char **argv)
case 'f':
show_pinned = true;
break;
+ case 'm':
+ bpf_flags = MAPS_RELAX_COMPAT;
+ break;
default:
p_err("unrecognized option '%s'", argv[optind - 1]);
if (json_output)
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 40492cd..91fd697 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -74,7 +74,7 @@
#define HELP_SPEC_PROGRAM \
"PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }"
#define HELP_SPEC_OPTIONS \
- "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} }"
+ "OPTIONS := { {-j|--json} [{-p|--pretty}] | {-f|--bpffs} | {-m|--mapcompat}"
#define HELP_SPEC_MAP \
"MAP := { id MAP_ID | pinned FILE }"
@@ -89,6 +89,7 @@ enum bpf_obj_type {
extern json_writer_t *json_wtr;
extern bool json_output;
extern bool show_pinned;
+extern int bpf_flags;
extern struct pinned_obj_table prog_table;
extern struct pinned_obj_table map_table;
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 99ab42c..3350289 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -908,7 +908,7 @@ static int do_load(int argc, char **argv)
}
}
- obj = bpf_object__open_xattr(&attr);
+ obj = __bpf_object__open_xattr(&attr, bpf_flags);
if (IS_ERR_OR_NULL(obj)) {
p_err("failed to open object file");
goto err_free_reuse_maps;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 87520a8..69a4d40 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -69,6 +69,9 @@ struct bpf_load_program_attr {
__u32 prog_ifindex;
};
+/* Flags to direct loading requirements */
+#define MAPS_RELAX_COMPAT 0x01
+
/* Recommend log buffer size */
#define BPF_LOG_BUF_SIZE (256 * 1024)
int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 176cf55..bd71efc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -562,8 +562,9 @@ static int compare_bpf_map(const void *_a, const void *_b)
}
static int
-bpf_object__init_maps(struct bpf_object *obj)
+bpf_object__init_maps(struct bpf_object *obj, int flags)
{
+ bool strict = !(flags & MAPS_RELAX_COMPAT);
int i, map_idx, map_def_sz, nr_maps = 0;
Elf_Scn *scn;
Elf_Data *data;
@@ -685,7 +686,8 @@ static int compare_bpf_map(const void *_a, const void *_b)
"has unrecognized, non-zero "
"options\n",
obj->path, map_name);
- return -EINVAL;
+ if (strict)
+ return -EINVAL;
}
}
memcpy(&obj->maps[map_idx].def, def,
@@ -716,7 +718,7 @@ static bool section_have_execinstr(struct bpf_object *obj, int idx)
return false;
}
-static int bpf_object__elf_collect(struct bpf_object *obj)
+static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
{
Elf *elf = obj->efile.elf;
GElf_Ehdr *ep = &obj->efile.ehdr;
@@ -843,7 +845,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
return LIBBPF_ERRNO__FORMAT;
}
if (obj->efile.maps_shndx >= 0) {
- err = bpf_object__init_maps(obj);
+ err = bpf_object__init_maps(obj, flags);
if (err)
goto out;
}
@@ -1515,7 +1517,7 @@ static int bpf_object__validate(struct bpf_object *obj, bool needs_kver)
static struct bpf_object *
__bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz,
- bool needs_kver)
+ bool needs_kver, int flags)
{
struct bpf_object *obj;
int err;
@@ -1531,7 +1533,7 @@ static int bpf_object__validate(struct bpf_object *obj, bool needs_kver)
CHECK_ERR(bpf_object__elf_init(obj), err, out);
CHECK_ERR(bpf_object__check_endianness(obj), err, out);
- CHECK_ERR(bpf_object__elf_collect(obj), err, out);
+ CHECK_ERR(bpf_object__elf_collect(obj, flags), err, out);
CHECK_ERR(bpf_object__collect_reloc(obj), err, out);
CHECK_ERR(bpf_object__validate(obj, needs_kver), err, out);
@@ -1542,7 +1544,8 @@ static int bpf_object__validate(struct bpf_object *obj, bool needs_kver)
return ERR_PTR(err);
}
-struct bpf_object *bpf_object__open_xattr(struct bpf_object_open_attr *attr)
+struct bpf_object *__bpf_object__open_xattr(struct bpf_object_open_attr *attr,
+ int flags)
{
/* param validation */
if (!attr->file)
@@ -1551,7 +1554,13 @@ struct bpf_object *bpf_object__open_xattr(struct bpf_object_open_attr *attr)
pr_debug("loading %s\n", attr->file);
return __bpf_object__open(attr->file, NULL, 0,
- bpf_prog_type__needs_kver(attr->prog_type));
+ bpf_prog_type__needs_kver(attr->prog_type),
+ flags);
+}
+
+struct bpf_object *bpf_object__open_xattr(struct bpf_object_open_attr *attr)
+{
+ return __bpf_object__open_xattr(attr, 0);
}
struct bpf_object *bpf_object__open(const char *path)
@@ -1584,7 +1593,7 @@ struct bpf_object *bpf_object__open_buffer(void *obj_buf,
pr_debug("loading object '%s' from buffer\n",
name);
- return __bpf_object__open(name, obj_buf, obj_buf_sz, true);
+ return __bpf_object__open(name, obj_buf, obj_buf_sz, true, true);
}
int bpf_object__unload(struct bpf_object *obj)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8af8d36..7e9c801 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -61,6 +61,8 @@ struct bpf_object_open_attr {
struct bpf_object *bpf_object__open(const char *path);
struct bpf_object *bpf_object__open_xattr(struct bpf_object_open_attr *attr);
+struct bpf_object *__bpf_object__open_xattr(struct bpf_object_open_attr *attr,
+ int flags);
struct bpf_object *bpf_object__open_buffer(void *obj_buf,
size_t obj_buf_sz,
const char *name);
^ permalink raw reply related
* [bpf-next PATCH v2 1/2] bpf: bpftool, add support for attaching programs to maps
From: John Fastabend @ 2018-10-15 15:17 UTC (permalink / raw)
To: jakub.kicinski, ast, daniel; +Cc: netdev
In-Reply-To: <20181015151336.9258.37606.stgit@john-Precision-Tower-5810>
Sock map/hash introduce support for attaching programs to maps. To
date I have been doing this with custom tooling but this is less than
ideal as we shift to using bpftool as the single CLI for our BPF uses.
This patch adds new sub commands 'attach' and 'detach' to the 'prog'
command to attach programs to maps and then detach them.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
tools/bpf/bpftool/Documentation/bpftool-prog.rst | 11 ++
tools/bpf/bpftool/Documentation/bpftool.rst | 2
tools/bpf/bpftool/bash-completion/bpftool | 19 ++++
tools/bpf/bpftool/prog.c | 99 ++++++++++++++++++++++
4 files changed, 128 insertions(+), 3 deletions(-)
diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 64156a1..12c8030 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -25,6 +25,8 @@ MAP COMMANDS
| **bpftool** **prog dump jited** *PROG* [{**file** *FILE* | **opcodes**}]
| **bpftool** **prog pin** *PROG* *FILE*
| **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
+| **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
+| **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
| **bpftool** **prog help**
|
| *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -37,6 +39,7 @@ MAP COMMANDS
| **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** |
| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6**
| }
+| *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
DESCRIPTION
@@ -90,6 +93,14 @@ DESCRIPTION
Note: *FILE* must be located in *bpffs* mount.
+ **bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
+ Attach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
+ to the map *MAP*.
+
+ **bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
+ Detach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
+ from the map *MAP*.
+
**bpftool prog help**
Print short help message.
diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst
index 8dda77d..25c0872 100644
--- a/tools/bpf/bpftool/Documentation/bpftool.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool.rst
@@ -26,7 +26,7 @@ SYNOPSIS
| **pin** | **event_pipe** | **help** }
*PROG-COMMANDS* := { **show** | **list** | **dump jited** | **dump xlated** | **pin**
- | **load** | **help** }
+ | **load** | **attach** | **detach** | **help** }
*CGROUP-COMMANDS* := { **show** | **list** | **attach** | **detach** | **help** }
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index df1060b..0826519 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -292,6 +292,23 @@ _bpftool()
fi
return 0
;;
+ attach|detach)
+ if [[ ${#words[@]} == 7 ]]; then
+ COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+ return 0
+ fi
+
+ if [[ ${#words[@]} == 6 ]]; then
+ COMPREPLY=( $( compgen -W "msg_verdict skb_verdict skb_parse" -- "$cur" ) )
+ return 0
+ fi
+
+ if [[ $prev == "$command" ]]; then
+ COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
+ return 0
+ fi
+ return 0
+ ;;
load)
local obj
@@ -347,7 +364,7 @@ _bpftool()
;;
*)
[[ $prev == $object ]] && \
- COMPREPLY=( $( compgen -W 'dump help pin load \
+ COMPREPLY=( $( compgen -W 'dump help pin attach detach load \
show list' -- "$cur" ) )
;;
esac
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index b1cd3bc..99ab42c 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -77,6 +77,26 @@
[BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector",
};
+static const char * const attach_type_strings[] = {
+ [BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
+ [BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
+ [BPF_SK_MSG_VERDICT] = "msg_verdict",
+ [__MAX_BPF_ATTACH_TYPE] = NULL,
+};
+
+enum bpf_attach_type parse_attach_type(const char *str)
+{
+ enum bpf_attach_type type;
+
+ for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) {
+ if (attach_type_strings[type] &&
+ is_prefix(str, attach_type_strings[type]))
+ return type;
+ }
+
+ return __MAX_BPF_ATTACH_TYPE;
+}
+
static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
{
struct timespec real_time_ts, boot_time_ts;
@@ -697,6 +717,77 @@ int map_replace_compar(const void *p1, const void *p2)
return a->idx - b->idx;
}
+static int do_attach(int argc, char **argv)
+{
+ enum bpf_attach_type attach_type;
+ int err, mapfd, progfd;
+
+ if (!REQ_ARGS(5)) {
+ p_err("too few parameters for map attach");
+ return -EINVAL;
+ }
+
+ progfd = prog_parse_fd(&argc, &argv);
+ if (progfd < 0)
+ return progfd;
+
+ attach_type = parse_attach_type(*argv);
+ if (attach_type == __MAX_BPF_ATTACH_TYPE) {
+ p_err("invalid attach type");
+ return -EINVAL;
+ }
+ NEXT_ARG();
+
+ mapfd = map_parse_fd(&argc, &argv);
+ if (mapfd < 0)
+ return mapfd;
+
+ err = bpf_prog_attach(progfd, mapfd, attach_type, 0);
+ if (err) {
+ p_err("failed prog attach to map");
+ return -EINVAL;
+ }
+
+ if (json_output)
+ jsonw_null(json_wtr);
+ return 0;
+}
+
+static int do_detach(int argc, char **argv)
+{
+ enum bpf_attach_type attach_type;
+ int err, mapfd, progfd;
+
+ if (!REQ_ARGS(5)) {
+ p_err("too few parameters for map detach");
+ return -EINVAL;
+ }
+
+ progfd = prog_parse_fd(&argc, &argv);
+ if (progfd < 0)
+ return progfd;
+
+ attach_type = parse_attach_type(*argv);
+ if (attach_type == __MAX_BPF_ATTACH_TYPE) {
+ p_err("invalid attach type");
+ return -EINVAL;
+ }
+ NEXT_ARG();
+
+ mapfd = map_parse_fd(&argc, &argv);
+ if (mapfd < 0)
+ return mapfd;
+
+ err = bpf_prog_detach2(progfd, mapfd, attach_type);
+ if (err) {
+ p_err("failed prog detach from map");
+ return -EINVAL;
+ }
+
+ if (json_output)
+ jsonw_null(json_wtr);
+ return 0;
+}
static int do_load(int argc, char **argv)
{
enum bpf_attach_type expected_attach_type;
@@ -942,6 +1033,8 @@ static int do_help(int argc, char **argv)
" %s %s pin PROG FILE\n"
" %s %s load OBJ FILE [type TYPE] [dev NAME] \\\n"
" [map { idx IDX | name NAME } MAP]\n"
+ " %s %s attach PROG ATTACH_TYPE MAP\n"
+ " %s %s detach PROG ATTACH_TYPE MAP\n"
" %s %s help\n"
"\n"
" " HELP_SPEC_MAP "\n"
@@ -953,10 +1046,12 @@ static int do_help(int argc, char **argv)
" cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n"
" cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n"
" cgroup/sendmsg4 | cgroup/sendmsg6 }\n"
+ " ATTACH_TYPE := { msg_verdict | skb_verdict | skb_parse }\n"
" " HELP_SPEC_OPTIONS "\n"
"",
bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
- bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+ bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+ bin_name, argv[-2], bin_name, argv[-2]);
return 0;
}
@@ -968,6 +1063,8 @@ static int do_help(int argc, char **argv)
{ "dump", do_dump },
{ "pin", do_pin },
{ "load", do_load },
+ { "attach", do_attach },
+ { "detach", do_detach },
{ 0 }
};
^ permalink raw reply related
* [bpf-next PATCH v2 0/2] bpftool support for sockmap use cases
From: John Fastabend @ 2018-10-15 15:17 UTC (permalink / raw)
To: jakub.kicinski, ast, daniel; +Cc: netdev
The first patch adds support for attaching programs to maps. This is
needed to support sock{map|hash} use from bpftool. Currently, I carry
around custom code to do this so doing it using standard bpftool will
be great.
The second patch adds a compat mode to ignore non-zero entries in
the map def. This allows using bpftool with maps that have a extra
fields that the user knows can be ignored. This is needed to work
correctly with maps being loaded by other tools or directly via
syscalls.
---
John Fastabend (2):
bpf: bpftool, add support for attaching programs to maps
bpf: bpftool, add flag to allow non-compat map definitions
tools/bpf/bpftool/Documentation/bpftool-prog.rst | 11 ++
tools/bpf/bpftool/Documentation/bpftool.rst | 2
tools/bpf/bpftool/bash-completion/bpftool | 19 ++++
tools/bpf/bpftool/main.c | 7 +-
tools/bpf/bpftool/main.h | 3 -
tools/bpf/bpftool/prog.c | 101 ++++++++++++++++++++++
tools/lib/bpf/bpf.h | 3 +
tools/lib/bpf/libbpf.c | 27 ++++--
tools/lib/bpf/libbpf.h | 2
9 files changed, 160 insertions(+), 15 deletions(-)
^ permalink raw reply
* Fw: [Bug 201423] New: eth0: hw csum failure
From: Stephen Hemminger @ 2018-10-15 15:15 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
Begin forwarded message:
Date: Sun, 14 Oct 2018 10:42:48 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 201423] New: eth0: hw csum failure
https://bugzilla.kernel.org/show_bug.cgi?id=201423
Bug ID: 201423
Summary: eth0: hw csum failure
Product: Networking
Version: 2.5
Kernel Version: 4.19.0-rc7
Hardware: Intel
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: Other
Assignee: stephen@networkplumber.org
Reporter: rossi.f@inwind.it
Regression: No
I have a P6T DELUXE V2 motherboard and using the sky2 driver for the ethernet
ports. I get the following error message:
[ 433.727397] eth0: hw csum failure
[ 433.727406] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
[ 433.727406] Hardware name: System manufacturer System Product Name/P6T
DELUXE V2, BIOS 1202 12/22/2010
[ 433.727407] Call Trace:
[ 433.727409] <IRQ>
[ 433.727415] dump_stack+0x46/0x5b
[ 433.727419] __skb_checksum_complete+0xb0/0xc0
[ 433.727423] tcp_v4_rcv+0x528/0xb60
[ 433.727426] ? ipt_do_table+0x2d0/0x400
[ 433.727429] ip_local_deliver_finish+0x5a/0x110
[ 433.727430] ip_local_deliver+0xe1/0xf0
[ 433.727431] ? ip_sublist_rcv_finish+0x60/0x60
[ 433.727432] ip_rcv+0xca/0xe0
[ 433.727434] ? ip_rcv_finish_core.isra.0+0x300/0x300
[ 433.727436] __netif_receive_skb_one_core+0x4b/0x70
[ 433.727438] netif_receive_skb_internal+0x4e/0x130
[ 433.727439] napi_gro_receive+0x6a/0x80
[ 433.727442] sky2_poll+0x707/0xd20
[ 433.727446] ? rcu_check_callbacks+0x1b4/0x900
[ 433.727447] net_rx_action+0x237/0x380
[ 433.727449] __do_softirq+0xdc/0x1e0
[ 433.727452] irq_exit+0xa9/0xb0
[ 433.727453] do_IRQ+0x45/0xc0
[ 433.727455] common_interrupt+0xf/0xf
[ 433.727456] </IRQ>
[ 433.727459] RIP: 0010:cpuidle_enter_state+0x124/0x200
[ 433.727461] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
[ 433.727462] RSP: 0000:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
ffffffffffffffde
[ 433.727463] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
000000000000001f
[ 433.727464] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
0000000000000000
[ 433.727465] RBP: ffff880237b263a0 R08: 0000000000000714 R09:
000000650512105d
[ 433.727465] R10: 00000000ffffffff R11: 0000000000000342 R12:
00000064fc2a8b1c
[ 433.727466] R13: 00000064fc25b35f R14: 0000000000000004 R15:
ffffffff8204af20
[ 433.727468] ? cpuidle_enter_state+0x119/0x200
[ 433.727471] do_idle+0x1bf/0x200
[ 433.727473] cpu_startup_entry+0x6a/0x70
[ 433.727475] start_secondary+0x17f/0x1c0
[ 433.727476] secondary_startup_64+0xa4/0xb0
[ 441.662954] eth0: hw csum failure
[ 441.662959] CPU: 4 PID: 4347 Comm: radeon_cs:0 Not tainted 4.19.0-rc7 #19
[ 441.662960] Hardware name: System manufacturer System Product Name/P6T
DELUXE V2, BIOS 1202 12/22/2010
[ 441.662960] Call Trace:
[ 441.662963] <IRQ>
[ 441.662968] dump_stack+0x46/0x5b
[ 441.662972] __skb_checksum_complete+0xb0/0xc0
[ 441.662975] tcp_v4_rcv+0x528/0xb60
[ 441.662979] ? ipt_do_table+0x2d0/0x400
[ 441.662981] ip_local_deliver_finish+0x5a/0x110
[ 441.662983] ip_local_deliver+0xe1/0xf0
[ 441.662985] ? ip_sublist_rcv_finish+0x60/0x60
[ 441.662986] ip_rcv+0xca/0xe0
[ 441.662988] ? ip_rcv_finish_core.isra.0+0x300/0x300
[ 441.662990] __netif_receive_skb_one_core+0x4b/0x70
[ 441.662993] netif_receive_skb_internal+0x4e/0x130
[ 441.662994] napi_gro_receive+0x6a/0x80
[ 441.662998] sky2_poll+0x707/0xd20
[ 441.663000] net_rx_action+0x237/0x380
[ 441.663002] __do_softirq+0xdc/0x1e0
[ 441.663005] irq_exit+0xa9/0xb0
[ 441.663007] do_IRQ+0x45/0xc0
[ 441.663009] common_interrupt+0xf/0xf
[ 441.663010] </IRQ>
[ 441.663012] RIP: 0010:merge+0x22/0xb0
[ 441.663014] Code: c3 31 c0 c3 90 90 90 90 41 56 41 55 41 54 55 48 89 d5 53
48 89 cb 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 <48> 85 c9
74 70 48 85 d2 74 6b 49 89 fd 49 89 f6 49 89 e4 eb 14 48
[ 441.663015] RSP: 0018:ffffc9000090b988 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffde
[ 441.663017] RAX: 0000000000000000 RBX: ffff88021ab2d408 RCX:
ffff88021ab2d408
[ 441.663018] RDX: ffff88021ab2d388 RSI: ffffffffa021c440 RDI:
0000000000000000
[ 441.663019] RBP: ffff88021ab2d388 R08: 0000000000005ecf R09:
0000000000008500
[ 441.663020] R10: ffffea000877ec00 R11: ffff880236803500 R12:
ffffffffa021c440
[ 441.663021] R13: ffff88021ab2d448 R14: 0000000000000004 R15:
ffffc9000090b9e0
[ 441.663048] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
[ 441.663063] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
[ 441.663065] ? merge+0x57/0xb0
[ 441.663080] ? radeon_irq_kms_set_irq_n_enabled+0x120/0x120 [radeon]
[ 441.663082] list_sort+0x8b/0x230
[ 441.663094] radeon_cs_parser_fini+0xdf/0x110 [radeon]
[ 441.663110] radeon_cs_ioctl+0x2a4/0x710 [radeon]
[ 441.663113] ? __switch_to_asm+0x34/0x70
[ 441.663114] ? __switch_to_asm+0x40/0x70
[ 441.663130] ? radeon_cs_parser_init+0x20/0x20 [radeon]
[ 441.663141] drm_ioctl_kernel+0xa3/0xe0 [drm]
[ 441.663149] drm_ioctl+0x2e2/0x380 [drm]
[ 441.663164] ? radeon_cs_parser_init+0x20/0x20 [radeon]
[ 441.663168] ? page_add_new_anon_rmap+0x42/0x70
[ 441.663171] do_vfs_ioctl+0x9a/0x600
[ 441.663173] ksys_ioctl+0x35/0x60
[ 441.663175] __x64_sys_ioctl+0x11/0x20
[ 441.663177] do_syscall_64+0x3d/0xf0
[ 441.663179] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 441.663180] RIP: 0033:0x7f9377377f37
[ 441.663182] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 ad
db 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 21 4f 2c 00 f7 d8 64 89 01 48
[ 441.663183] RSP: 002b:00007f92c3130d28 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 441.663185] RAX: ffffffffffffffda RBX: 0000564498327ec0 RCX:
00007f9377377f37
[ 441.663186] RDX: 0000564498337ec8 RSI: 00000000c0206466 RDI:
0000000000000010
[ 441.663186] RBP: 0000564498337ec8 R08: 0000000000000000 R09:
0000000000000000
[ 441.663187] R10: 0000000000000000 R11: 0000000000000246 R12:
00000000c0206466
[ 441.663188] R13: 0000000000000010 R14: 0000000000000000 R15:
0000564497a38120
[ 462.833418] eth0: hw csum failure
[ 462.833428] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.19.0-rc7 #19
[ 462.833429] Hardware name: System manufacturer System Product Name/P6T
DELUXE V2, BIOS 1202 12/22/2010
[ 462.833429] Call Trace:
[ 462.833432] <IRQ>
[ 462.833438] dump_stack+0x46/0x5b
[ 462.833442] __skb_checksum_complete+0xb0/0xc0
[ 462.833446] tcp_v4_rcv+0x528/0xb60
[ 462.833449] ? ipt_do_table+0x2d0/0x400
[ 462.833452] ip_local_deliver_finish+0x5a/0x110
[ 462.833454] ip_local_deliver+0xe1/0xf0
[ 462.833455] ? ip_sublist_rcv_finish+0x60/0x60
[ 462.833457] ip_rcv+0xca/0xe0
[ 462.833459] ? ip_rcv_finish_core.isra.0+0x300/0x300
[ 462.833461] __netif_receive_skb_one_core+0x4b/0x70
[ 462.833464] netif_receive_skb_internal+0x4e/0x130
[ 462.833466] napi_gro_receive+0x6a/0x80
[ 462.833469] sky2_poll+0x707/0xd20
[ 462.833471] net_rx_action+0x237/0x380
[ 462.833474] __do_softirq+0xdc/0x1e0
[ 462.833477] irq_exit+0xa9/0xb0
[ 462.833479] do_IRQ+0x45/0xc0
[ 462.833481] common_interrupt+0xf/0xf
[ 462.833482] </IRQ>
[ 462.833486] RIP: 0010:cpuidle_enter_state+0x124/0x200
[ 462.833488] Code: 53 60 89 c3 e8 dd 90 ad ff 65 8b 3d 96 58 a7 7e e8 d1 8f
ad ff 31 ff 49 89 c4 e8 27 99 ad ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1
4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
[ 462.833489] RSP: 0018:ffffc900000a3e98 EFLAGS: 00000282 ORIG_RAX:
ffffffffffffffde
[ 462.833491] RAX: ffff880237b1f280 RBX: 0000000000000004 RCX:
000000000000001f
[ 462.833492] RDX: 20c49ba5e353f7cf RSI: 000000002fe419c1 RDI:
0000000000000000
[ 462.833493] RBP: ffff880237b263a0 R08: 0000000000000000 R09:
0000000000000000
[ 462.833494] R10: 00000000ffffffff R11: 0000000000000273 R12:
0000006bc3052131
[ 462.833495] R13: 0000006bc2f99f57 R14: 0000000000000004 R15:
ffffffff8204af20
[ 462.833498] ? cpuidle_enter_state+0x119/0x200
[ 462.833503] do_idle+0x1bf/0x200
[ 462.833506] cpu_startup_entry+0x6a/0x70
[ 462.833510] start_secondary+0x17f/0x1c0
[ 462.833513] secondary_startup_64+0xa4/0xb0
Something is changed between 4.17.12 and 4.18, after bisecting the problem I
got the following first bad commit:
commit 88078d98d1bb085d72af8437707279e203524fa5
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Apr 18 11:43:15 2018 -0700
net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.
While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.
We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply
* [PATCH] Bluetooth: Fix locking in bt_accept_enqueue() for BH context
From: Matthias Kaehlcke @ 2018-10-15 22:39 UTC (permalink / raw)
To: Marcel Holtmann, Johan Hedberg, David S . Miller, Dean Jenkins
Cc: linux-bluetooth, netdev, linux-kernel, Konstantin Khlebnikov,
Balakrishna Godavarthi, Douglas Anderson, Dmitry Grinberg,
Matthias Kaehlcke
With commit e16337622016 ("Bluetooth: Handle bt_accept_enqueue() socket
atomically") lock_sock[_nested]() is used to acquire the socket lock
before manipulating the socket. lock_sock[_nested]() may block, which
is problematic since bt_accept_enqueue() can be called in bottom half
context (e.g. from rfcomm_connect_ind()).
The socket API provides bh_lock_sock[_nested]() to acquire the socket
lock in bottom half context. Check the context in bt_accept_enqueue()
and use the appropriate locking mechanism for the context.
Fixes: e16337622016 ("Bluetooth: Handle bt_accept_enqueue() socket atomically")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
---
Not sure if this is the correct solution, it's certainly not elegant and
checkpatch.pl complains that in_atomic() shouldn't be used outside of
core kernel code. I'm open to other suggestions :)
net/bluetooth/af_bluetooth.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index deacc52d7ff1..0f0540dbb44a 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -159,10 +159,20 @@ void bt_accept_enqueue(struct sock *parent, struct sock *sk)
BT_DBG("parent %p, sk %p", parent, sk);
sock_hold(sk);
- lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
+
+ if (in_atomic())
+ bh_lock_sock_nested(sk);
+ else
+ lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
+
list_add_tail(&bt_sk(sk)->accept_q, &bt_sk(parent)->accept_q);
bt_sk(sk)->parent = parent;
- release_sock(sk);
+
+ if (in_atomic())
+ bh_unlock_sock(sk);
+ else
+ release_sock(sk);
+
parent->sk_ack_backlog++;
}
EXPORT_SYMBOL(bt_accept_enqueue);
--
2.19.1.331.ge82ca0e54c-goog
^ permalink raw reply related
* Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast
From: Eric Dumazet @ 2018-10-15 14:50 UTC (permalink / raw)
To: Gasper Zejn; +Cc: Eric Dumazet, Kevin Yang, netdev
In-Reply-To: <252ea882-bcd2-b205-7d68-541e88b5d617@gmail.com>
On Mon, Oct 15, 2018 at 3:26 AM Gasper Zejn <zelo.zejn@gmail.com> wrote:
>
>
> I've tried to isolate the issue as best I could. There seems to be an
> issue if the TCP socket has keepalive set and send queue is not empty
> and the route goes away.
>
> https://github.com/zejn/bbr_pfifo_interrupts_issue
>
> Hope this helps,
> Gasper
This is awesome Gasper, I will take a look thanks.
Note that we are about to send a patch series (targeting net-next) to
polish the EDT patch series that was merged last month for linux-4.20.
TCP internal pacing is going to be much better performance-wise.
^ permalink raw reply
* Re: net/wan: hostess_sv11 + z85230 problems
From: Randy Dunlap @ 2018-10-15 22:24 UTC (permalink / raw)
To: Krzysztof Hałasa; +Cc: netdev@vger.kernel.org, LKML, Alan Cox
In-Reply-To: <m34ldnd3ju.fsf@t19.piap.pl>
On 10/15/18 1:20 AM, Krzysztof Hałasa wrote:
> Hi,
>
> Randy Dunlap <rdunlap@infradead.org> writes:
>
>> kernel 4.19-rc7, on i386, with NO wan/hdlc/hostess/z85230 hardware:
>>
>> modprobe hostess_sv11 + autoload of z85230 give:
>
> BTW Hostess SV11 is apparently an ISA card, with all those problems.
Yeah.
>> [ 3162.511877] Call Trace:
...
>
> Not sure about z8530 internals (driver and hw), but I guess the sv11
> driver should initialize the hw first, and only then request_irq().
> Perhaps there should be no "default address" either? The user would
> have to provide the hardware parameters explicitly.
>
> How about this (totally untested):
> Fix the Hostess SV11 driver trying to use the hardware before its
> existence is detected.
>
> Signed-off-by: Krzysztof Halasa <khalasa@piap.pl>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
or you can just rm it, like Alan suggested.
thanks.
--
~Randy
^ permalink raw reply
* [PATCH net] rxrpc: Fix a missing rxrpc_put_peer() in the error_report handler
From: David Howells @ 2018-10-15 21:37 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
Fix a missing call to rxrpc_put_peer() on the main path through the
rxrpc_error_report() function. This manifests itself as a ref leak
whenever an ICMP packet or other error comes in.
In commit f334430316e7, the hand-off of the ref to a work item was removed
and was not replaced with a put.
Fixes: f334430316e7 ("rxrpc: Fix error distribution")
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/peer_event.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/rxrpc/peer_event.c b/net/rxrpc/peer_event.c
index 05b51bdbdd41..bd2fa3b7caa7 100644
--- a/net/rxrpc/peer_event.c
+++ b/net/rxrpc/peer_event.c
@@ -195,6 +195,7 @@ void rxrpc_error_report(struct sock *sk)
rxrpc_store_error(peer, serr);
rcu_read_unlock();
rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+ rxrpc_put_peer(peer);
_leave("");
}
^ permalink raw reply related
* Re: [PATCH net 1/2] geneve, vxlan: Don't check skb_dst() twice
From: Nicolas Dichtel @ 2018-10-15 12:24 UTC (permalink / raw)
To: Stefano Brivio; +Cc: David S. Miller, Xin Long, Sabrina Dubroca, netdev
In-Reply-To: <20181015130830.1c177301@redhat.com>
Le 15/10/2018 à 13:08, Stefano Brivio a écrit :
> On Mon, 15 Oct 2018 12:19:41 +0200
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> Le 12/10/2018 à 23:53, Stefano Brivio a écrit :
>>> Commit f15ca723c1eb ("net: don't call update_pmtu unconditionally") avoids
>>> that we try updating PMTU for a non-existent destination, but didn't clean
>>> up cases where the check was already explicit. Drop those redundant checks.
>> Yes, I leave them to avoid calculating the new mtu value when not needed. We are
>> in the xmit path.
>
> Before 2/2 of this series, though, we call skb_dst_update_pmtu() (and
> in turn dst->ops->update_pmtu()) for *every* packet with a dst, which
Not if dst is of type md_dst_ops.
> I'd dare saying is by far the most common case. Besides, 2/2 needs
> anyway to calculate the MTU to fix a bug.
>
> So I think this is a vast improvement overall.
Fair point.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox