* [PATCH iproute2] man ip-link: Fix indentation for 'ip link show' options
From: Vadim Kochan @ 2014-12-02 20:39 UTC (permalink / raw)
To: netdev; +Cc: Vadim Kochan
BEFORE:
The show command has additional formatting options:
-s, -stats, -statistics
output more statistics about packet usage.
-d, -details
output more detailed information.
-h, -human, -human-readble
output statistics with human readable values number followed by suffix
-iec print human readable rates in IEC units (ie. 1K = 1024).
AFTER:
The show command has additional formatting options:
-s, -stats, -statistics
output more statistics about packet usage.
-d, -details
output more detailed information.
-h, -human, -human-readble
output statistics with human readable values number followed by suffix
-iec print human readable rates in IEC units (ie. 1K = 1024).
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
man/man8/ip-link.8.in | 3 +++
1 file changed, 3 insertions(+)
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 9d4e3da..cdd0a42 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -688,8 +688,10 @@ only display running interfaces.
.I DEVICE
specifies the master device which enslaves devices to show.
+.TP
The show command has additional formatting options:
+.RS
.TP
.BR "\-s" , " \-stats", " \-statistics"
output more statistics about packet usage.
@@ -705,6 +707,7 @@ output statistics with human readable values number followed by suffix
.TP
.BR "\-iec"
print human readable rates in IEC units (ie. 1K = 1024).
+.RE
.SS ip link help - display help
--
2.1.3
^ permalink raw reply related
* linux-next Problems with VPN tunnel - no packets sent
From: Valdis Kletnieks @ 2014-12-02 20:41 UTC (permalink / raw)
To: Herbert Xu, davem, Jason Wang; +Cc: netdev, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]
Recent linux-next has broken my Juniper VPN client. The tunnel gets created,
routes get added, but trying to actually send packets across results in packets
just disappearing. 'ifconfig' consistently reports exactly 1 packet sent (even
after a 'ping' command or similar should have sent multiple packets.
tun0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 1400
inet 172.27.1.40 netmask 255.255.255.255 destination 172.27.1.40
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 500 (UNSPEC)
RX packets 1 bytes 355 (355.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1 bytes 61 (61.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Still broken in next-20141201, and bisection fingers this commit:
commit e0b46d0ee9c240c7430a47e9b0365674d4a04522
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri Nov 7 21:22:23 2014 +0800
tun: Use iovec iterators
This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.
This commit is in the kernel, and does *not* fix the problem:
commit 8c847d254146d32c86574a1b16923ff91bb784dd
Author: Jason Wang <jasowang@redhat.com>
Date: Thu Nov 13 16:54:14 2014 +0800
tun: fix issues of iovec iterators using in tun_put_user()
So there's apparently additional issues that Jason didn't address. I tried to
revert Herbert's patch for testing, but there's at least 5 or 6 other patches
that need reverting first, so I abandoned that unless it becomes necessary...
What's the best way to proceed?
[-- Attachment #2: Type: application/pgp-signature, Size: 848 bytes --]
^ permalink raw reply
* Re: [PATCH 1/1] net: dsa: replacing the hard-coded sized array "dsa_switch" by dynamic one
From: Florian Fainelli @ 2014-12-02 20:40 UTC (permalink / raw)
To: Andrey Volkov, netdev
In-Reply-To: <547DD1C6.2090304@nexvision.fr>
On 02/12/14 06:50, Andrey Volkov wrote:
> Hello,
>
> In time of developing one of our devices (with huge, more then 6, number of onboard switches),
> I've bumped with this ancient, I hope, restriction in the 'struct dsa_switch_tree' definition.
> So this simple patch remove this restriction and make dsa_switch_tree more scalable for
> the "usual" 1-2 switches configuration too.
Sounds reasonable to me, you probably want to resubmit and trim the
"Hello" form your commit message.
>
> P.S. I've plans to fix hardcoded number of ports too, but it is not so easy as with number of switches.
> So if someone have any objections/suggestions I'll happy to discuss them.
I think the number of ports in a switch is something that should come
from the switch driver, and eventually intersected with what the
platform configuration has provided.
The difficulty is in case of sparse port number allocation because you
still want to allocate e.g: 6 ports even though Port 0 and 5 are used, I
don't think we want to introduce a logical to physical mapping, that
would be too error prone.
>
> Signed-off-by: Andrey Volkov <andrey.volkov@nexvision.fr>
> ---
> include/net/dsa.h | 3 +--
> net/dsa/dsa.c | 7 +++----
> 2 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index ed3c34b..733db2e 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -28,7 +28,6 @@ enum dsa_tag_protocol {
> DSA_TAG_PROTO_BRCM,
> };
>
> -#define DSA_MAX_SWITCHES 4
> #define DSA_MAX_PORTS 12
>
> struct dsa_chip_data {
> @@ -117,7 +116,7 @@ struct dsa_switch_tree {
> /*
> * Data for the individual switch chips.
> */
> - struct dsa_switch *ds[DSA_MAX_SWITCHES];
> + struct dsa_switch *ds[];
> };
>
> struct dsa_switch {
> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index 322c778..c081a19 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -604,8 +604,6 @@ static int dsa_of_probe(struct platform_device *pdev)
> pdev->dev.platform_data = pd;
> pd->netdev = ðernet_dev->dev;
> pd->nr_chips = of_get_child_count(np);
> - if (pd->nr_chips > DSA_MAX_SWITCHES)
> - pd->nr_chips = DSA_MAX_SWITCHES;
>
> pd->chip = kcalloc(pd->nr_chips, sizeof(struct dsa_chip_data),
> GFP_KERNEL);
> @@ -717,7 +715,7 @@ static int dsa_probe(struct platform_device *pdev)
> pd = pdev->dev.platform_data;
> }
>
> - if (pd == NULL || pd->netdev == NULL)
> + if (pd == NULL || pd->netdev == NULL || pd->nr_chips == 0)
> return -EINVAL;
>
> dev = dev_to_net_device(pd->netdev);
> @@ -732,7 +730,8 @@ static int dsa_probe(struct platform_device *pdev)
> goto out;
> }
>
> - dst = kzalloc(sizeof(*dst), GFP_KERNEL);
> + dst = kzalloc(sizeof(*dst) +
> + sizeof(struct dsa_switch *) * pd->nr_chips, GFP_KERNEL);
> if (dst == NULL) {
> dev_put(dev);
> ret = -ENOMEM;
>
^ permalink raw reply
* Re: [PATCH net-next] tcp: Add TCP tracer
From: Martin Lau @ 2014-12-02 20:40 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1417552662-16398-1-git-send-email-kafai@fb.com>
Please ignore this patch which is not completely ready. It is sent out by
mistake.
On Tue, Dec 02, 2014 at 12:37:42PM -0800, Martin KaFai Lau wrote:
> Define probes and register them to the TCP tracepoints. The probes
> collect the data defined in struct tcp_sk_trace and record them to
> the tracing's ring_buffer.
> ---
> include/uapi/linux/tcp_trace.h | 9 +-
> kernel/trace/tcp_trace.c | 448 +++++++++++++++++++++++++++++++++++++++++
> kernel/trace/trace.h | 1 +
> 3 files changed, 451 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/tcp_trace.h b/include/uapi/linux/tcp_trace.h
> index 2644f7f..d913a3c 100644
> --- a/include/uapi/linux/tcp_trace.h
> +++ b/include/uapi/linux/tcp_trace.h
> @@ -22,11 +22,11 @@ struct tcp_stats {
> __u32 other_segs_retrans;
> __u32 other_octets_retrans;
> __u32 loss_segs_retrans;
> - __u32 loss_octects_retrans;
> + __u32 loss_octets_retrans;
> __u32 segs_in;
> __u32 data_segs_in;
> - __u64 rtt_sample_us;
> __u64 data_octets_in;
> + __u64 rtt_sample_us;
> __u64 max_rtt_us;
> __u64 min_rtt_us;
> __u64 sum_rtt_us;
> @@ -64,9 +64,4 @@ struct tcp_trace_stats {
> struct tcp_stats stats;
> } __packed;
>
> -typedef struct tcp_trace_basic tcp_trace_establish;
> -typedef struct tcp_trace_basic tcp_trace_retrans;
> -typedef struct tcp_trace_stats tcp_trace_periodic;
> -typedef struct tcp_trace_stats tcp_trace_close;
> -
> #endif /* UAPI_TCP_TRACE_H */
> diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
> index 9d09fd0..376580b 100644
> --- a/kernel/trace/tcp_trace.c
> +++ b/kernel/trace/tcp_trace.c
> @@ -1,9 +1,27 @@
> #include <net/tcp_trace.h>
> +#include <net/tcp.h>
> +#include <trace/events/tcp.h>
> #include <linux/tcp.h>
> +#include <linux/ipv6.h>
> +#include <linux/ftrace_event.h>
> +#include <linux/jiffies.h>
> #include <uapi/linux/tcp_trace.h>
>
> +#include "trace_output.h"
> +
> +#define REPORT_INTERVAL_MS 2000
> +
> +static struct trace_array *tcp_tr;
> static bool tcp_trace_enabled __read_mostly;
>
> +static struct trace_print_flags tcp_trace_event_names[] = {
> + { TCP_TRACE_EVENT_ESTABLISHED, "established" },
> + { TCP_TRACE_EVENT_PERIODIC, "periodic" },
> + { TCP_TRACE_EVENT_RETRANS, "retrans" },
> + { TCP_TRACE_EVENT_RETRANS_LOSS, "retrans_loss" },
> + { TCP_TRACE_EVENT_CLOSE, "close" }
> +};
> +
> struct tcp_sk_trace {
> struct tcp_stats stats;
> unsigned long start_ts;
> @@ -35,3 +53,433 @@ void tcp_sk_trace_destruct(struct sock *sk)
> {
> kfree(tcp_sk(sk)->trace);
> }
> +
> +static void tcp_trace_init(struct tcp_trace *tr,
> + enum tcp_trace_events trev,
> + struct sock *sk)
> +{
> + tr->event = trev;
> + if (sk->sk_family == AF_INET) {
> + tr->ipv6 = 0;
> + tr->local_addr[0] = inet_sk(sk)->inet_saddr;
> + tr->remote_addr[0] = inet_sk(sk)->inet_daddr;
> + } else {
> + BUG_ON(sk->sk_family != AF_INET6);
> + tr->ipv6 = 1;
> + memcpy(tr->local_addr, inet6_sk(sk)->saddr.s6_addr32,
> + sizeof(tr->local_addr));
> + memcpy(tr->remote_addr, sk->sk_v6_daddr.s6_addr32,
> + sizeof(tr->remote_addr));
> + }
> + tr->local_port = inet_sk(sk)->inet_sport;
> + tr->remote_port = inet_sk(sk)->inet_dport;
> +}
> +
> +static void tcp_trace_basic_init(struct tcp_trace_basic *trb,
> + enum tcp_trace_events trev,
> + struct sock *sk)
> +{
> + struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
> + tcp_trace_init((struct tcp_trace *)trb, trev, sk);
> + trb->snd_cwnd = tcp_sk(sk)->snd_cwnd * tcp_sk(sk)->mss_cache;
> + trb->mss = tcp_sk(sk)->mss_cache;
> + trb->ssthresh = tcp_current_ssthresh(sk);
> + trb->srtt_us = tcp_sk(sk)->srtt_us >> 3;
> + trb->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
> + trb->life_ms = jiffies_to_msecs(jiffies - sktr->start_ts);
> +}
> +
> +static void tcp_trace_basic_add(enum tcp_trace_events trev, struct sock *sk)
> +{
> + struct ring_buffer *buffer;
> + int pc;
> + struct ring_buffer_event *event;
> + struct tcp_trace_basic *trb;
> + struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
> +
> + if (!sktr)
> + return;
> +
> + tracing_record_cmdline(current);
> + buffer = tcp_tr->trace_buffer.buffer;
> + pc = preempt_count();
> + event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
> + sizeof(*trb), 0, pc);
> + if (!event)
> + return;
> + trb = ring_buffer_event_data(event);
> + tcp_trace_basic_init(trb, trev, sk);
> + trace_buffer_unlock_commit(buffer, event, 0, pc);
> +}
> +
> +static void tcp_trace_stats_init(struct tcp_trace_stats *trs,
> + enum tcp_trace_events trev,
> + struct sock *sk)
> +{
> + struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
> +
> + tcp_trace_basic_init((struct tcp_trace_basic *)trs, trev, sk);
> + memcpy(&trs->stats, &sktr->stats, sizeof(sktr->stats));
> +}
> +
> +static void tcp_trace_stats_add(enum tcp_trace_events trev, struct sock *sk)
> +{
> + struct ring_buffer *buffer;
> + int pc;
> + struct ring_buffer_event *event;
> + struct tcp_trace_stats *trs;
> + struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
> +
> + if (!sktr)
> + return;
> +
> + tracing_record_cmdline(current);
> + buffer = tcp_tr->trace_buffer.buffer;
> + pc = preempt_count();
> + event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
> + sizeof(*trs), 0, pc);
> + if (!event)
> + return;
> + trs = ring_buffer_event_data(event);
> +
> + tcp_trace_stats_init(trs, trev, sk);
> +
> + trace_buffer_unlock_commit(buffer, event, 0, pc);
> +}
> +
> +static void tcp_trace_established(void *ignore, struct sock *sk)
> +{
> + tcp_trace_basic_add(TCP_TRACE_EVENT_ESTABLISHED, sk);
> +}
> +
> +static void tcp_trace_transmit_skb(void *ignore, struct sock *sk,
> + struct sk_buff *skb)
> +{
> + int pcount;
> + struct tcp_sk_trace *sktr;
> + struct tcp_skb_cb *tcb;
> + unsigned int data_len;
> + bool retrans = false;
> +
> + sktr = tcp_sk(sk)->trace;
> + if (!sktr)
> + return;
> +
> + tcb = TCP_SKB_CB(skb);
> + pcount = tcp_skb_pcount(skb);
> + data_len = tcb->end_seq - tcb->seq;
> +
> + sktr->stats.segs_out += pcount;
> +
> + if (!data_len)
> + goto out;
> +
> + sktr->stats.data_segs_out += pcount;
> + sktr->stats.data_octets_out += data_len;
> +
> + if (before(tcb->seq, tcp_sk(sk)->snd_nxt)) {
> + enum tcp_trace_events trev;
> + retrans = true;
> + if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
> + sktr->stats.loss_segs_retrans += pcount;
> + sktr->stats.loss_octets_retrans += data_len;
> + trev = TCP_TRACE_EVENT_RETRANS_LOSS;
> + } else {
> + sktr->stats.other_segs_retrans += pcount;
> + sktr->stats.other_octets_retrans += data_len;
> + trev = TCP_TRACE_EVENT_RETRANS;
> + }
> + tcp_trace_stats_add(trev, sk);
> + return;
> + }
> +
> +out:
> + if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
> + REPORT_INTERVAL_MS) {
> + sktr->last_ts = jiffies;
> + tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
> + }
> +}
> +
> +static void tcp_trace_rcv_established(void *ignore, struct sock *sk,
> + struct sk_buff *skb)
> +{
> + struct tcp_sk_trace *sktr;
> + unsigned int data_len;
> + struct tcphdr *th;
> +
> + sktr = tcp_sk(sk)->trace;
> + if (!sktr)
> + return;
> +
> + th = tcp_hdr(skb);
> + WARN_ON_ONCE(skb->len < th->doff << 2);
> +
> + sktr->stats.segs_in++;
> + data_len = skb->len - (th->doff << 2);
> + if (data_len) {
> + if (TCP_SKB_CB(skb)->ack_seq == tcp_sk(sk)->snd_una)
> + sktr->stats.dup_acks_in++;
> + } else {
> + sktr->stats.data_segs_in++;
> + sktr->stats.data_segs_in += data_len;
> + }
> +
> + if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
> + REPORT_INTERVAL_MS) {
> + sktr->last_ts = jiffies;
> + tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
> + }
> +}
> +
> +static void tcp_trace_close(void *ignore, struct sock *sk)
> +{
> + struct tcp_sk_trace *sktr;
> + sktr = tcp_sk(sk)->trace;
> + if (!sktr)
> + return;
> +
> + tcp_trace_stats_add(TCP_TRACE_EVENT_CLOSE, sk);
> +}
> +
> +static void tcp_trace_ooo_rcv(void *ignore, struct sock *sk)
> +{
> + struct tcp_sk_trace *sktr;
> +
> + sktr = tcp_sk(sk)->trace;
> + if (!sktr)
> + return;
> +
> + sktr->stats.ooo_in++;
> +}
> +
> +static void tcp_trace_sacks_rcv(void *ignore, struct sock *sk, int num_sacks)
> +{
> + struct tcp_sk_trace *sktr;
> +
> + sktr = tcp_sk(sk)->trace;
> + if (!sktr)
> + return;
> +
> + sktr->stats.sacks_in++;
> + sktr->stats.sack_blks_in += num_sacks;
> +}
> +
> +void tcp_trace_rtt_sample(void *ignore, struct sock *sk,
> + long rtt_sample_us)
> +{
> + struct tcp_sk_trace *sktr;
> + u32 rto_ms;
> +
> + sktr = tcp_sk(sk)->trace;
> + if (!sktr)
> + return;
> +
> + rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
> +
> + sktr->stats.rtt_sample_us = rtt_sample_us;
> + sktr->stats.max_rtt_us = max_t(u64, sktr->stats.max_rtt_us, rtt_sample_us);
> + sktr->stats.min_rtt_us = min_t(u64, sktr->stats.min_rtt_us, rtt_sample_us);
> +
> + sktr->stats.count_rtt++;
> + sktr->stats.sum_rtt_us += rtt_sample_us;
> +
> + sktr->stats.max_rto_ms = max_t(u32, sktr->stats.max_rto_ms, rto_ms);
> + sktr->stats.min_rto_ms = min_t(u32, sktr->stats.min_rto_ms, rto_ms);
> +}
> +
> +static enum print_line_t
> +tcp_trace_print(struct trace_iterator *iter)
> +{
> + struct trace_seq *s = &iter->seq;
> + struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
> + struct tcp_trace_basic *trb;
> + struct tcp_stats *stats;
> + const char *last_seq_bptr, *cur_seq_bptr;
> + int ret = 0;
> +
> + union {
> + struct sockaddr_in v4;
> + struct sockaddr_in6 v6;
> + } local_sa, remote_sa;
> +
> + local_sa.v4.sin_port = tr->local_port;
> + remote_sa.v4.sin_port = tr->remote_port;
> + if (tr->ipv6) {
> + local_sa.v6.sin6_family = AF_INET6;
> + remote_sa.v6.sin6_family = AF_INET6;
> + memcpy(local_sa.v6.sin6_addr.s6_addr, tr->local_addr, 4);
> + memcpy(remote_sa.v6.sin6_addr.s6_addr, tr->remote_addr, 4);
> + } else {
> + local_sa.v4.sin_family = AF_INET;
> + remote_sa.v4.sin_family =AF_INET;
> + local_sa.v4.sin_addr.s_addr = tr->local_addr[0];
> + remote_sa.v4.sin_addr.s_addr = tr->remote_addr[0];
> + }
> +
> + last_seq_bptr = ftrace_print_symbols_seq(s, tr->event,
> + tcp_trace_event_names);
> + cur_seq_bptr = trace_seq_buffer_ptr(s);
> + if (last_seq_bptr == cur_seq_bptr)
> + goto out;
> +
> + trb = (struct tcp_trace_basic *)tr;
> + ret = trace_seq_printf(s,
> + " %pISpc %pISpc snd_cwnd=%u mss=%u ssthresh=%u"
> + " srtt_us=%llu rto_ms=%u life_ms=%u",
> + &local_sa, &remote_sa,
> + trb->snd_cwnd, trb->mss, trb->ssthresh,
> + trb->srtt_us, trb->rto_ms, trb->life_ms);
> +
> + if (tr->event == TCP_TRACE_EVENT_ESTABLISHED || ret == 0)
> + goto out;
> +
> + stats = &(((struct tcp_trace_stats *)tr)->stats);
> + ret = trace_seq_printf(s,
> + " segs_out=%u data_segs_out=%u data_octets_out=%llu"
> + " other_segs_retrans=%u other_octets_retrans=%u"
> + " loss_segs_retrans=%u loss_octets_retrans=%u"
> + " segs_in=%u data_segs_in=%u data_octets_in=%llu"
> + " max_rtt_us=%llu min_rtt_us=%llu"
> + " count_rtt=%u sum_rtt_us=%llu"
> + " rtt_sample_us=%llu"
> + " max_rto_ms=%u min_rto_ms=%u"
> + " dup_acks_in=%u sacks_in=%u"
> + " sack_blks_in=%u ooo_in=%u",
> + stats->segs_out, stats->data_segs_out, stats->data_octets_out,
> + stats->other_segs_retrans, stats->other_octets_retrans,
> + stats->loss_segs_retrans, stats->loss_octets_retrans,
> + stats->segs_in, stats->data_segs_in, stats->data_octets_in,
> + stats->max_rtt_us, stats->min_rtt_us,
> + stats->count_rtt, stats->sum_rtt_us,
> + stats->rtt_sample_us,
> + stats->max_rto_ms, stats->min_rto_ms,
> + stats->dup_acks_in, stats->sacks_in,
> + stats->sack_blks_in, stats->ooo_in);
> +
> +out:
> + if (ret)
> + ret = trace_seq_putc(s, '\n');
> +
> + return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +static enum print_line_t
> +tcp_trace_print_binary(struct trace_iterator *iter)
> +{
> + int ret;
> + struct trace_seq *s = &iter->seq;
> + struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
> + u32 magic = TCP_TRACE_MAGIC_VERSION;
> +
> + ret = trace_seq_putmem(s, &magic, sizeof(magic));
> + if (!ret)
> + goto out;
> +
> + if (tr->event == TCP_TRACE_EVENT_ESTABLISHED)
> + ret = trace_seq_putmem(s, tr + sizeof(magic),
> + sizeof(struct tcp_trace_basic));
> + else
> + ret = trace_seq_putmem(s, tr + sizeof(magic),
> + sizeof(struct tcp_trace_stats));
> +
> +out:
> + return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +static enum print_line_t
> +tcp_tracer_print_line(struct trace_iterator *iter)
> +{
> + return (trace_flags & TRACE_ITER_BIN) ?
> + tcp_trace_print_binary(iter) :
> + tcp_trace_print(iter);
> +}
> +
> +static void tcp_register_tracepoints(void)
> +{
> + int ret;
> +
> + ret = register_trace_tcp_established(tcp_trace_established, NULL);
> + WARN_ON(ret);
> + ret = register_trace_tcp_close(tcp_trace_close, NULL);
> + WARN_ON(ret);
> + ret = register_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
> + WARN_ON(ret);
> + ret = register_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
> + WARN_ON(ret);
> + ret = register_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
> + WARN_ON(ret);
> + ret = register_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
> + WARN_ON(ret);
> + ret = register_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
> + WARN_ON(ret);
> +}
> +
> +static void tcp_unregister_tracepoints(void)
> +{
> + unregister_trace_tcp_established(tcp_trace_established, NULL);
> + unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
> + unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
> + unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
> + unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
> + unregister_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
> +
> + tracepoint_synchronize_unregister();
> +}
> +
> +static void tcp_tracer_start(struct trace_array *tr)
> +{
> + tcp_register_tracepoints();
> + tcp_trace_enabled = true;
> +}
> +
> +static void tcp_tracer_stop(struct trace_array *tr)
> +{
> + tcp_unregister_tracepoints();
> + tcp_trace_enabled = false;
> +}
> +
> +static void tcp_tracer_reset(struct trace_array *tr)
> +{
> + tcp_tracer_stop(tr);
> +}
> +
> +static int tcp_tracer_init(struct trace_array *tr)
> +{
> + tcp_tr = tr;
> + tcp_tracer_start(tr);
> + return 0;
> +}
> +
> +static struct tracer tcp_tracer __read_mostly = {
> + .name = "tcp",
> + .init = tcp_tracer_init,
> + .reset = tcp_tracer_reset,
> + .start = tcp_tracer_start,
> + .stop = tcp_tracer_stop,
> + .print_line = tcp_tracer_print_line,
> +};
> +
> +static struct trace_event_functions tcp_trace_event_funcs;
> +
> +static struct trace_event tcp_trace_event = {
> + .type = TRACE_TCP,
> + .funcs = &tcp_trace_event_funcs,
> +};
> +
> +static int __init init_tcp_tracer(void)
> +{
> + if (!register_ftrace_event(&tcp_trace_event)) {
> + pr_warning("Cannot register TCP trace event\n");
> + return 1;
> + }
> +
> + if (register_tracer(&tcp_tracer) != 0) {
> + pr_warning("Cannot register TCP tracer\n");
> + unregister_ftrace_event(&tcp_trace_event);
> + return 1;
> + }
> + return 0;
> +}
> +
> +device_initcall(init_tcp_tracer);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 385391f..5dc5962 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -37,6 +37,7 @@ enum trace_type {
> TRACE_USER_STACK,
> TRACE_BLK,
> TRACE_BPUTS,
> + TRACE_TCP,
>
> __TRACE_LAST_TYPE,
> };
> --
> 1.8.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at https://urldefense.proofpoint.com/v1/url?u=http://vger.kernel.org/majordomo-info.html&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=%2Faj1ZOQObwbmtLwlDw3XzQ%3D%3D%0A&m=CW4scPRBfOgsdn0GCbMgedOQVytKe3ZEBV2fC4xJFOA%3D%0A&s=d8b63403525c4df85b423582337b753283978aef9d9be19238adeb1042270caf
^ permalink raw reply
* [PATCH net-next] tcp: Add TCP tracer
From: Martin KaFai Lau @ 2014-12-02 20:37 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1413837765-5446-1-git-send-email-kafai@fb.com>
Define probes and register them to the TCP tracepoints. The probes
collect the data defined in struct tcp_sk_trace and record them to
the tracing's ring_buffer.
---
include/uapi/linux/tcp_trace.h | 9 +-
kernel/trace/tcp_trace.c | 448 +++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 1 +
3 files changed, 451 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/tcp_trace.h b/include/uapi/linux/tcp_trace.h
index 2644f7f..d913a3c 100644
--- a/include/uapi/linux/tcp_trace.h
+++ b/include/uapi/linux/tcp_trace.h
@@ -22,11 +22,11 @@ struct tcp_stats {
__u32 other_segs_retrans;
__u32 other_octets_retrans;
__u32 loss_segs_retrans;
- __u32 loss_octects_retrans;
+ __u32 loss_octets_retrans;
__u32 segs_in;
__u32 data_segs_in;
- __u64 rtt_sample_us;
__u64 data_octets_in;
+ __u64 rtt_sample_us;
__u64 max_rtt_us;
__u64 min_rtt_us;
__u64 sum_rtt_us;
@@ -64,9 +64,4 @@ struct tcp_trace_stats {
struct tcp_stats stats;
} __packed;
-typedef struct tcp_trace_basic tcp_trace_establish;
-typedef struct tcp_trace_basic tcp_trace_retrans;
-typedef struct tcp_trace_stats tcp_trace_periodic;
-typedef struct tcp_trace_stats tcp_trace_close;
-
#endif /* UAPI_TCP_TRACE_H */
diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
index 9d09fd0..376580b 100644
--- a/kernel/trace/tcp_trace.c
+++ b/kernel/trace/tcp_trace.c
@@ -1,9 +1,27 @@
#include <net/tcp_trace.h>
+#include <net/tcp.h>
+#include <trace/events/tcp.h>
#include <linux/tcp.h>
+#include <linux/ipv6.h>
+#include <linux/ftrace_event.h>
+#include <linux/jiffies.h>
#include <uapi/linux/tcp_trace.h>
+#include "trace_output.h"
+
+#define REPORT_INTERVAL_MS 2000
+
+static struct trace_array *tcp_tr;
static bool tcp_trace_enabled __read_mostly;
+static struct trace_print_flags tcp_trace_event_names[] = {
+ { TCP_TRACE_EVENT_ESTABLISHED, "established" },
+ { TCP_TRACE_EVENT_PERIODIC, "periodic" },
+ { TCP_TRACE_EVENT_RETRANS, "retrans" },
+ { TCP_TRACE_EVENT_RETRANS_LOSS, "retrans_loss" },
+ { TCP_TRACE_EVENT_CLOSE, "close" }
+};
+
struct tcp_sk_trace {
struct tcp_stats stats;
unsigned long start_ts;
@@ -35,3 +53,433 @@ void tcp_sk_trace_destruct(struct sock *sk)
{
kfree(tcp_sk(sk)->trace);
}
+
+static void tcp_trace_init(struct tcp_trace *tr,
+ enum tcp_trace_events trev,
+ struct sock *sk)
+{
+ tr->event = trev;
+ if (sk->sk_family == AF_INET) {
+ tr->ipv6 = 0;
+ tr->local_addr[0] = inet_sk(sk)->inet_saddr;
+ tr->remote_addr[0] = inet_sk(sk)->inet_daddr;
+ } else {
+ BUG_ON(sk->sk_family != AF_INET6);
+ tr->ipv6 = 1;
+ memcpy(tr->local_addr, inet6_sk(sk)->saddr.s6_addr32,
+ sizeof(tr->local_addr));
+ memcpy(tr->remote_addr, sk->sk_v6_daddr.s6_addr32,
+ sizeof(tr->remote_addr));
+ }
+ tr->local_port = inet_sk(sk)->inet_sport;
+ tr->remote_port = inet_sk(sk)->inet_dport;
+}
+
+static void tcp_trace_basic_init(struct tcp_trace_basic *trb,
+ enum tcp_trace_events trev,
+ struct sock *sk)
+{
+ struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+ tcp_trace_init((struct tcp_trace *)trb, trev, sk);
+ trb->snd_cwnd = tcp_sk(sk)->snd_cwnd * tcp_sk(sk)->mss_cache;
+ trb->mss = tcp_sk(sk)->mss_cache;
+ trb->ssthresh = tcp_current_ssthresh(sk);
+ trb->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+ trb->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+ trb->life_ms = jiffies_to_msecs(jiffies - sktr->start_ts);
+}
+
+static void tcp_trace_basic_add(enum tcp_trace_events trev, struct sock *sk)
+{
+ struct ring_buffer *buffer;
+ int pc;
+ struct ring_buffer_event *event;
+ struct tcp_trace_basic *trb;
+ struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+ if (!sktr)
+ return;
+
+ tracing_record_cmdline(current);
+ buffer = tcp_tr->trace_buffer.buffer;
+ pc = preempt_count();
+ event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+ sizeof(*trb), 0, pc);
+ if (!event)
+ return;
+ trb = ring_buffer_event_data(event);
+ tcp_trace_basic_init(trb, trev, sk);
+ trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_stats_init(struct tcp_trace_stats *trs,
+ enum tcp_trace_events trev,
+ struct sock *sk)
+{
+ struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+ tcp_trace_basic_init((struct tcp_trace_basic *)trs, trev, sk);
+ memcpy(&trs->stats, &sktr->stats, sizeof(sktr->stats));
+}
+
+static void tcp_trace_stats_add(enum tcp_trace_events trev, struct sock *sk)
+{
+ struct ring_buffer *buffer;
+ int pc;
+ struct ring_buffer_event *event;
+ struct tcp_trace_stats *trs;
+ struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+ if (!sktr)
+ return;
+
+ tracing_record_cmdline(current);
+ buffer = tcp_tr->trace_buffer.buffer;
+ pc = preempt_count();
+ event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+ sizeof(*trs), 0, pc);
+ if (!event)
+ return;
+ trs = ring_buffer_event_data(event);
+
+ tcp_trace_stats_init(trs, trev, sk);
+
+ trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_established(void *ignore, struct sock *sk)
+{
+ tcp_trace_basic_add(TCP_TRACE_EVENT_ESTABLISHED, sk);
+}
+
+static void tcp_trace_transmit_skb(void *ignore, struct sock *sk,
+ struct sk_buff *skb)
+{
+ int pcount;
+ struct tcp_sk_trace *sktr;
+ struct tcp_skb_cb *tcb;
+ unsigned int data_len;
+ bool retrans = false;
+
+ sktr = tcp_sk(sk)->trace;
+ if (!sktr)
+ return;
+
+ tcb = TCP_SKB_CB(skb);
+ pcount = tcp_skb_pcount(skb);
+ data_len = tcb->end_seq - tcb->seq;
+
+ sktr->stats.segs_out += pcount;
+
+ if (!data_len)
+ goto out;
+
+ sktr->stats.data_segs_out += pcount;
+ sktr->stats.data_octets_out += data_len;
+
+ if (before(tcb->seq, tcp_sk(sk)->snd_nxt)) {
+ enum tcp_trace_events trev;
+ retrans = true;
+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
+ sktr->stats.loss_segs_retrans += pcount;
+ sktr->stats.loss_octets_retrans += data_len;
+ trev = TCP_TRACE_EVENT_RETRANS_LOSS;
+ } else {
+ sktr->stats.other_segs_retrans += pcount;
+ sktr->stats.other_octets_retrans += data_len;
+ trev = TCP_TRACE_EVENT_RETRANS;
+ }
+ tcp_trace_stats_add(trev, sk);
+ return;
+ }
+
+out:
+ if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+ REPORT_INTERVAL_MS) {
+ sktr->last_ts = jiffies;
+ tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+ }
+}
+
+static void tcp_trace_rcv_established(void *ignore, struct sock *sk,
+ struct sk_buff *skb)
+{
+ struct tcp_sk_trace *sktr;
+ unsigned int data_len;
+ struct tcphdr *th;
+
+ sktr = tcp_sk(sk)->trace;
+ if (!sktr)
+ return;
+
+ th = tcp_hdr(skb);
+ WARN_ON_ONCE(skb->len < th->doff << 2);
+
+ sktr->stats.segs_in++;
+ data_len = skb->len - (th->doff << 2);
+ if (data_len) {
+ if (TCP_SKB_CB(skb)->ack_seq == tcp_sk(sk)->snd_una)
+ sktr->stats.dup_acks_in++;
+ } else {
+ sktr->stats.data_segs_in++;
+ sktr->stats.data_segs_in += data_len;
+ }
+
+ if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+ REPORT_INTERVAL_MS) {
+ sktr->last_ts = jiffies;
+ tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+ }
+}
+
+static void tcp_trace_close(void *ignore, struct sock *sk)
+{
+ struct tcp_sk_trace *sktr;
+ sktr = tcp_sk(sk)->trace;
+ if (!sktr)
+ return;
+
+ tcp_trace_stats_add(TCP_TRACE_EVENT_CLOSE, sk);
+}
+
+static void tcp_trace_ooo_rcv(void *ignore, struct sock *sk)
+{
+ struct tcp_sk_trace *sktr;
+
+ sktr = tcp_sk(sk)->trace;
+ if (!sktr)
+ return;
+
+ sktr->stats.ooo_in++;
+}
+
+static void tcp_trace_sacks_rcv(void *ignore, struct sock *sk, int num_sacks)
+{
+ struct tcp_sk_trace *sktr;
+
+ sktr = tcp_sk(sk)->trace;
+ if (!sktr)
+ return;
+
+ sktr->stats.sacks_in++;
+ sktr->stats.sack_blks_in += num_sacks;
+}
+
+void tcp_trace_rtt_sample(void *ignore, struct sock *sk,
+ long rtt_sample_us)
+{
+ struct tcp_sk_trace *sktr;
+ u32 rto_ms;
+
+ sktr = tcp_sk(sk)->trace;
+ if (!sktr)
+ return;
+
+ rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+
+ sktr->stats.rtt_sample_us = rtt_sample_us;
+ sktr->stats.max_rtt_us = max_t(u64, sktr->stats.max_rtt_us, rtt_sample_us);
+ sktr->stats.min_rtt_us = min_t(u64, sktr->stats.min_rtt_us, rtt_sample_us);
+
+ sktr->stats.count_rtt++;
+ sktr->stats.sum_rtt_us += rtt_sample_us;
+
+ sktr->stats.max_rto_ms = max_t(u32, sktr->stats.max_rto_ms, rto_ms);
+ sktr->stats.min_rto_ms = min_t(u32, sktr->stats.min_rto_ms, rto_ms);
+}
+
+static enum print_line_t
+tcp_trace_print(struct trace_iterator *iter)
+{
+ struct trace_seq *s = &iter->seq;
+ struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+ struct tcp_trace_basic *trb;
+ struct tcp_stats *stats;
+ const char *last_seq_bptr, *cur_seq_bptr;
+ int ret = 0;
+
+ union {
+ struct sockaddr_in v4;
+ struct sockaddr_in6 v6;
+ } local_sa, remote_sa;
+
+ local_sa.v4.sin_port = tr->local_port;
+ remote_sa.v4.sin_port = tr->remote_port;
+ if (tr->ipv6) {
+ local_sa.v6.sin6_family = AF_INET6;
+ remote_sa.v6.sin6_family = AF_INET6;
+ memcpy(local_sa.v6.sin6_addr.s6_addr, tr->local_addr, 4);
+ memcpy(remote_sa.v6.sin6_addr.s6_addr, tr->remote_addr, 4);
+ } else {
+ local_sa.v4.sin_family = AF_INET;
+ remote_sa.v4.sin_family =AF_INET;
+ local_sa.v4.sin_addr.s_addr = tr->local_addr[0];
+ remote_sa.v4.sin_addr.s_addr = tr->remote_addr[0];
+ }
+
+ last_seq_bptr = ftrace_print_symbols_seq(s, tr->event,
+ tcp_trace_event_names);
+ cur_seq_bptr = trace_seq_buffer_ptr(s);
+ if (last_seq_bptr == cur_seq_bptr)
+ goto out;
+
+ trb = (struct tcp_trace_basic *)tr;
+ ret = trace_seq_printf(s,
+ " %pISpc %pISpc snd_cwnd=%u mss=%u ssthresh=%u"
+ " srtt_us=%llu rto_ms=%u life_ms=%u",
+ &local_sa, &remote_sa,
+ trb->snd_cwnd, trb->mss, trb->ssthresh,
+ trb->srtt_us, trb->rto_ms, trb->life_ms);
+
+ if (tr->event == TCP_TRACE_EVENT_ESTABLISHED || ret == 0)
+ goto out;
+
+ stats = &(((struct tcp_trace_stats *)tr)->stats);
+ ret = trace_seq_printf(s,
+ " segs_out=%u data_segs_out=%u data_octets_out=%llu"
+ " other_segs_retrans=%u other_octets_retrans=%u"
+ " loss_segs_retrans=%u loss_octets_retrans=%u"
+ " segs_in=%u data_segs_in=%u data_octets_in=%llu"
+ " max_rtt_us=%llu min_rtt_us=%llu"
+ " count_rtt=%u sum_rtt_us=%llu"
+ " rtt_sample_us=%llu"
+ " max_rto_ms=%u min_rto_ms=%u"
+ " dup_acks_in=%u sacks_in=%u"
+ " sack_blks_in=%u ooo_in=%u",
+ stats->segs_out, stats->data_segs_out, stats->data_octets_out,
+ stats->other_segs_retrans, stats->other_octets_retrans,
+ stats->loss_segs_retrans, stats->loss_octets_retrans,
+ stats->segs_in, stats->data_segs_in, stats->data_octets_in,
+ stats->max_rtt_us, stats->min_rtt_us,
+ stats->count_rtt, stats->sum_rtt_us,
+ stats->rtt_sample_us,
+ stats->max_rto_ms, stats->min_rto_ms,
+ stats->dup_acks_in, stats->sacks_in,
+ stats->sack_blks_in, stats->ooo_in);
+
+out:
+ if (ret)
+ ret = trace_seq_putc(s, '\n');
+
+ return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
+}
+
+static enum print_line_t
+tcp_trace_print_binary(struct trace_iterator *iter)
+{
+ int ret;
+ struct trace_seq *s = &iter->seq;
+ struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+ u32 magic = TCP_TRACE_MAGIC_VERSION;
+
+ ret = trace_seq_putmem(s, &magic, sizeof(magic));
+ if (!ret)
+ goto out;
+
+ if (tr->event == TCP_TRACE_EVENT_ESTABLISHED)
+ ret = trace_seq_putmem(s, tr + sizeof(magic),
+ sizeof(struct tcp_trace_basic));
+ else
+ ret = trace_seq_putmem(s, tr + sizeof(magic),
+ sizeof(struct tcp_trace_stats));
+
+out:
+ return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
+}
+
+static enum print_line_t
+tcp_tracer_print_line(struct trace_iterator *iter)
+{
+ return (trace_flags & TRACE_ITER_BIN) ?
+ tcp_trace_print_binary(iter) :
+ tcp_trace_print(iter);
+}
+
+static void tcp_register_tracepoints(void)
+{
+ int ret;
+
+ ret = register_trace_tcp_established(tcp_trace_established, NULL);
+ WARN_ON(ret);
+ ret = register_trace_tcp_close(tcp_trace_close, NULL);
+ WARN_ON(ret);
+ ret = register_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+ WARN_ON(ret);
+ ret = register_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+ WARN_ON(ret);
+ ret = register_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+ WARN_ON(ret);
+ ret = register_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+ WARN_ON(ret);
+ ret = register_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+ WARN_ON(ret);
+}
+
+static void tcp_unregister_tracepoints(void)
+{
+ unregister_trace_tcp_established(tcp_trace_established, NULL);
+ unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+ unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+ unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+ unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+ unregister_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+
+ tracepoint_synchronize_unregister();
+}
+
+static void tcp_tracer_start(struct trace_array *tr)
+{
+ tcp_register_tracepoints();
+ tcp_trace_enabled = true;
+}
+
+static void tcp_tracer_stop(struct trace_array *tr)
+{
+ tcp_unregister_tracepoints();
+ tcp_trace_enabled = false;
+}
+
+static void tcp_tracer_reset(struct trace_array *tr)
+{
+ tcp_tracer_stop(tr);
+}
+
+static int tcp_tracer_init(struct trace_array *tr)
+{
+ tcp_tr = tr;
+ tcp_tracer_start(tr);
+ return 0;
+}
+
+static struct tracer tcp_tracer __read_mostly = {
+ .name = "tcp",
+ .init = tcp_tracer_init,
+ .reset = tcp_tracer_reset,
+ .start = tcp_tracer_start,
+ .stop = tcp_tracer_stop,
+ .print_line = tcp_tracer_print_line,
+};
+
+static struct trace_event_functions tcp_trace_event_funcs;
+
+static struct trace_event tcp_trace_event = {
+ .type = TRACE_TCP,
+ .funcs = &tcp_trace_event_funcs,
+};
+
+static int __init init_tcp_tracer(void)
+{
+ if (!register_ftrace_event(&tcp_trace_event)) {
+ pr_warning("Cannot register TCP trace event\n");
+ return 1;
+ }
+
+ if (register_tracer(&tcp_tracer) != 0) {
+ pr_warning("Cannot register TCP tracer\n");
+ unregister_ftrace_event(&tcp_trace_event);
+ return 1;
+ }
+ return 0;
+}
+
+device_initcall(init_tcp_tracer);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 385391f..5dc5962 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -37,6 +37,7 @@ enum trace_type {
TRACE_USER_STACK,
TRACE_BLK,
TRACE_BPUTS,
+ TRACE_TCP,
__TRACE_LAST_TYPE,
};
--
1.8.1
^ permalink raw reply related
* [PATCH net-next 6/6] sunvnet: add TSO support
From: David L Stevens @ 2014-12-02 20:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch adds TSO support for the sunvnet driver.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 95 +++++++++++++++++++++++++++++++++---
drivers/net/ethernet/sun/sunvnet.h | 9 +++-
2 files changed, 95 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index d19b358..b883add 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -120,8 +120,15 @@ static int vnet_send_attr(struct vio_driver_state *vio)
pkt.mtu = framelen + VLAN_HLEN;
}
- pkt.plnk_updt = PHYSLINK_UPDATE_NONE;
pkt.cflags = 0;
+ if (vio_version_after_eq(vio, 1, 7) && port->tso) {
+ pkt.cflags |= VNET_LSO_IPV4_CAPAB;
+ if (!port->tsolen)
+ port->tsolen = VNET_MAXTSO;
+ pkt.ipv4_lso_maxlen = port->tsolen;
+ }
+
+ pkt.plnk_updt = PHYSLINK_UPDATE_NONE;
viodbg(HS, "SEND NET ATTR xmode[0x%x] atype[0x%x] addr[%llx] "
"ackfreq[%u] plnk_updt[0x%02x] opts[0x%02x] mtu[%llu] "
@@ -175,6 +182,26 @@ static int handle_attr_info(struct vio_driver_state *vio,
}
port->rmtu = localmtu;
+ /* LSO negotiation */
+ if (vio_version_after_eq(vio, 1, 7))
+ port->tso &= !!(pkt->cflags & VNET_LSO_IPV4_CAPAB);
+ else
+ port->tso = false;
+ if (port->tso) {
+ if (!port->tsolen)
+ port->tsolen = VNET_MAXTSO;
+ port->tsolen = min(port->tsolen, pkt->ipv4_lso_maxlen);
+ if (port->tsolen < VNET_MINTSO) {
+ port->tso = false;
+ port->tsolen = 0;
+ pkt->cflags &= ~VNET_LSO_IPV4_CAPAB;
+ }
+ pkt->ipv4_lso_maxlen = port->tsolen;
+ } else {
+ pkt->cflags &= ~VNET_LSO_IPV4_CAPAB;
+ pkt->ipv4_lso_maxlen = 0;
+ }
+
/* for version >= 1.6, ACK packet mode we support */
if (vio_version_after_eq(vio, 1, 6)) {
pkt->xfer_mode = VIO_NEW_DRING_MODE;
@@ -721,6 +748,8 @@ ldc_ctrl:
if (event == LDC_EVENT_RESET) {
port->rmtu = 0;
+ port->tso = true;
+ port->tsolen = 0;
vio_port_up(vio);
}
port->rx_event = 0;
@@ -1131,10 +1160,36 @@ static int vnet_handle_offloads(struct vnet_port *port, struct sk_buff *skb)
struct net_device *dev = port->vp->dev;
struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_TX_RING];
struct sk_buff *segs;
- int maclen;
+ int maclen, datalen;
int status;
+ int gso_size, gso_type, gso_segs;
+ int hlen = skb_transport_header(skb) - skb_mac_header(skb);
+ int proto = IPPROTO_IP;
+
+ if (skb->protocol == htons(ETH_P_IP))
+ proto = ip_hdr(skb)->protocol;
+ else if (skb->protocol == htons(ETH_P_IPV6))
+ proto = ipv6_hdr(skb)->nexthdr;
+
+ if (proto == IPPROTO_TCP)
+ hlen += tcp_hdr(skb)->doff * 4;
+ else if (proto == IPPROTO_UDP)
+ hlen += sizeof(struct udphdr);
+ else {
+ pr_err("vnet_handle_offloads GSO with unknown transport "
+ "protocol %d tproto %d\n", skb->protocol, proto);
+ hlen = 128; /* XXX */
+ }
+ datalen = port->tsolen - hlen;
+
+ gso_size = skb_shinfo(skb)->gso_size;
+ gso_type = skb_shinfo(skb)->gso_type;
+ gso_segs = skb_shinfo(skb)->gso_segs;
+
+ if (port->tso && gso_size < datalen)
+ gso_segs = DIV_ROUND_UP(skb->len - hlen, datalen);
- if (unlikely(vnet_tx_dring_avail(dr) < skb_shinfo(skb)->gso_segs)) {
+ if (unlikely(vnet_tx_dring_avail(dr) < gso_segs)) {
struct netdev_queue *txq;
txq = netdev_get_tx_queue(dev, port->q_index);
@@ -1147,7 +1202,19 @@ static int vnet_handle_offloads(struct vnet_port *port, struct sk_buff *skb)
maclen = skb_network_header(skb) - skb_mac_header(skb);
skb_pull(skb, maclen);
- segs = skb_gso_segment(skb, dev->features & ~NETIF_F_TSO);
+ if (port->tso && gso_size < datalen) {
+ /* segment to TSO size */
+ skb_shinfo(skb)->gso_size = datalen;
+ skb_shinfo(skb)->gso_segs = gso_segs;
+
+ segs = skb_gso_segment(skb, dev->features & ~NETIF_F_TSO);
+
+ /* restore gso_size & gso_segs */
+ skb_shinfo(skb)->gso_size = gso_size;
+ skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len - hlen,
+ gso_size);
+ } else
+ segs = skb_gso_segment(skb, dev->features & ~NETIF_F_TSO);
if (IS_ERR(segs)) {
dev->stats.tx_dropped++;
return NETDEV_TX_OK;
@@ -1162,6 +1229,13 @@ static int vnet_handle_offloads(struct vnet_port *port, struct sk_buff *skb)
segs = segs->next;
curr->next = NULL;
+ if (port->tso && curr->len > dev->mtu) {
+ skb_shinfo(curr)->gso_size = gso_size;
+ skb_shinfo(curr)->gso_type = gso_type;
+ skb_shinfo(curr)->gso_segs =
+ DIV_ROUND_UP(curr->len - hlen, gso_size);
+ } else
+ skb_shinfo(curr)->gso_size = 0;
skb_push(curr, maclen);
skb_reset_mac_header(curr);
@@ -1203,13 +1277,13 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
goto out_dropped;
}
- if (skb_is_gso(skb)) {
+ if (skb_is_gso(skb) && skb->len > port->tsolen) {
err = vnet_handle_offloads(port, skb);
rcu_read_unlock();
return err;
}
- if (skb->len > port->rmtu) {
+ if (!skb_is_gso(skb) && skb->len > port->rmtu) {
unsigned long localmtu = port->rmtu - ETH_HLEN;
if (vio_version_after_eq(&port->vio, 1, 3))
@@ -1306,6 +1380,11 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
struct vio_net_dext *dext = vio_net_ext(d);
memset(dext, 0, sizeof(*dext));
+ if (skb_is_gso(port->tx_bufs[txi].skb)) {
+ dext->ipv4_lso_mss = skb_shinfo(port->tx_bufs[txi].skb)
+ ->gso_size;
+ dext->flags |= VNET_PKT_IPV4_LSO;
+ }
if (vio_version_after_eq(&port->vio, 1, 8) &&
!port->switch_port) {
dext->flags |= VNET_PKT_HCK_IPV4_HDRCKSUM_OK;
@@ -1712,7 +1791,7 @@ static struct vnet *vnet_new(const u64 *local_mac)
dev->ethtool_ops = &vnet_ethtool_ops;
dev->watchdog_timeo = VNET_TX_TIMEOUT;
- dev->hw_features = NETIF_F_GSO | NETIF_F_GSO_SOFTWARE |
+ dev->hw_features = NETIF_F_TSO | NETIF_F_GSO | NETIF_F_GSO_SOFTWARE |
NETIF_F_HW_CSUM | NETIF_F_SG;
dev->features = dev->hw_features;
@@ -1892,6 +1971,8 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
if (mdesc_get_property(hp, vdev->mp, "switch-port", NULL) != NULL)
switch_port = 1;
port->switch_port = switch_port;
+ port->tso = true;
+ port->tsolen = 0;
spin_lock_irqsave(&vp->lock, flags);
if (switch_port)
diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h
index cd5d343..01ca781 100644
--- a/drivers/net/ethernet/sun/sunvnet.h
+++ b/drivers/net/ethernet/sun/sunvnet.h
@@ -20,6 +20,9 @@
#define VNET_TX_RING_SIZE 512
#define VNET_TX_WAKEUP_THRESH(dr) ((dr)->pending / 4)
+#define VNET_MINTSO 2048 /* VIO protocol's minimum TSO len */
+#define VNET_MAXTSO 65535 /* VIO protocol's maximum TSO len */
+
/* VNET packets are sent in buffers with the first 6 bytes skipped
* so that after the ethernet header the IPv4/IPv6 headers are aligned
* properly.
@@ -40,8 +43,9 @@ struct vnet_port {
struct hlist_node hash;
u8 raddr[ETH_ALEN];
- u8 switch_port;
- u8 __pad;
+ unsigned switch_port:1;
+ unsigned tso:1;
+ unsigned __pad:14;
struct vnet *vp;
@@ -56,6 +60,7 @@ struct vnet_port {
struct timer_list clean_timer;
u64 rmtu;
+ u16 tsolen;
struct napi_struct napi;
u32 napi_stop_idx;
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 5/6] sunvnet: add GSO support
From: David L Stevens @ 2014-12-02 20:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch adds GSO support to the sunvnet driver.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 73 +++++++++++++++++++++++++++++++++++-
1 files changed, 72 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index b172eda..d19b358 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -1102,6 +1102,10 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, int ncookies)
return NULL;
}
(void)skb_put(nskb, skb->len);
+ if (skb_is_gso(skb)) {
+ skb_shinfo(nskb)->gso_size = skb_shinfo(skb)->gso_size;
+ skb_shinfo(nskb)->gso_type = skb_shinfo(skb)->gso_type;
+ }
dev_kfree_skb(skb);
skb = nskb;
}
@@ -1120,6 +1124,66 @@ vnet_select_queue(struct net_device *dev, struct sk_buff *skb,
return port->q_index;
}
+static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev);
+
+static int vnet_handle_offloads(struct vnet_port *port, struct sk_buff *skb)
+{
+ struct net_device *dev = port->vp->dev;
+ struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_TX_RING];
+ struct sk_buff *segs;
+ int maclen;
+ int status;
+
+ if (unlikely(vnet_tx_dring_avail(dr) < skb_shinfo(skb)->gso_segs)) {
+ struct netdev_queue *txq;
+
+ txq = netdev_get_tx_queue(dev, port->q_index);
+ netif_tx_stop_queue(txq);
+ if (vnet_tx_dring_avail(dr) < skb_shinfo(skb)->gso_segs)
+ return NETDEV_TX_BUSY;
+ netif_tx_wake_queue(txq);
+ }
+
+ maclen = skb_network_header(skb) - skb_mac_header(skb);
+ skb_pull(skb, maclen);
+
+ segs = skb_gso_segment(skb, dev->features & ~NETIF_F_TSO);
+ if (IS_ERR(segs)) {
+ dev->stats.tx_dropped++;
+ return NETDEV_TX_OK;
+ }
+
+ skb_push(skb, maclen);
+ skb_reset_mac_header(skb);
+
+ status = 0;
+ while (segs) {
+ struct sk_buff *curr = segs;
+
+ segs = segs->next;
+ curr->next = NULL;
+
+ skb_push(curr, maclen);
+ skb_reset_mac_header(curr);
+ memcpy(skb_mac_header(curr), skb_mac_header(skb),
+ maclen);
+ curr->csum_start = skb_transport_header(curr) - curr->head;
+ if (ip_hdr(curr)->protocol == IPPROTO_TCP)
+ curr->csum_offset = offsetof(struct tcphdr, check);
+ else if (ip_hdr(curr)->protocol == IPPROTO_UDP)
+ curr->csum_offset = offsetof(struct udphdr, check);
+
+ if (!(status & NETDEV_TX_MASK))
+ status = vnet_start_xmit(curr, dev);
+ if (status & NETDEV_TX_MASK)
+ dev_kfree_skb_any(curr);
+ }
+
+ if (!(status & NETDEV_TX_MASK))
+ dev_kfree_skb_any(skb);
+ return status;
+}
+
static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct vnet *vp = netdev_priv(dev);
@@ -1139,6 +1203,12 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
goto out_dropped;
}
+ if (skb_is_gso(skb)) {
+ err = vnet_handle_offloads(port, skb);
+ rcu_read_unlock();
+ return err;
+ }
+
if (skb->len > port->rmtu) {
unsigned long localmtu = port->rmtu - ETH_HLEN;
@@ -1642,7 +1712,8 @@ static struct vnet *vnet_new(const u64 *local_mac)
dev->ethtool_ops = &vnet_ethtool_ops;
dev->watchdog_timeo = VNET_TX_TIMEOUT;
- dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG;
+ dev->hw_features = NETIF_F_GSO | NETIF_F_GSO_SOFTWARE |
+ NETIF_F_HW_CSUM | NETIF_F_SG;
dev->features = dev->hw_features;
err = register_netdev(dev);
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 4/6] sunvnet: add checksum offload support
From: David L Stevens @ 2014-12-02 20:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch adds support for sender-side checksum offloading.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 37 +++++++++++++++++++++++++++++++++--
1 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index b6a5336..b172eda 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -375,6 +375,8 @@ static int vnet_rx_one(struct vnet_port *port, struct vio_net_desc *desc)
}
}
+ skb->ip_summed = port->switch_port ? CHECKSUM_NONE : CHECKSUM_PARTIAL;
+
dev->stats.rx_packets++;
dev->stats.rx_bytes += len;
napi_gro_receive(&port->napi, skb);
@@ -1047,7 +1049,8 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, int ncookies)
if (((unsigned long)skb->data & 7) != VNET_PACKET_SKIP ||
skb_tailroom(skb) < pad ||
skb_headroom(skb) < VNET_PACKET_SKIP || docopy) {
- int offset;
+ int start = 0, offset;
+ __wsum csum;
len = skb->len > ETH_ZLEN ? skb->len : ETH_ZLEN;
nskb = alloc_and_align_skb(skb->dev, len);
@@ -1065,10 +1068,35 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, int ncookies)
offset = skb_transport_header(skb) - skb->data;
skb_set_transport_header(nskb, offset);
+ offset = 0;
nskb->csum_offset = skb->csum_offset;
nskb->ip_summed = skb->ip_summed;
- if (skb_copy_bits(skb, 0, nskb->data, skb->len)) {
+ if (skb->ip_summed == CHECKSUM_PARTIAL)
+ start = skb_checksum_start_offset(skb);
+ if (start) {
+ struct iphdr *iph = ip_hdr(nskb);
+ int offset = start + nskb->csum_offset;
+
+ if (skb_copy_bits(skb, 0, nskb->data, start)) {
+ dev_kfree_skb(nskb);
+ dev_kfree_skb(skb);
+ return NULL;
+ }
+ *(__sum16 *)(skb->data + offset) = 0;
+ csum = skb_copy_and_csum_bits(skb, start,
+ nskb->data + start,
+ skb->len - start, 0);
+ if (iph->protocol == IPPROTO_TCP ||
+ iph->protocol == IPPROTO_UDP) {
+ csum = csum_tcpudp_magic(iph->saddr, iph->daddr,
+ skb->len - start,
+ iph->protocol, csum);
+ }
+ *(__sum16 *)(nskb->data + offset) = csum;
+
+ nskb->ip_summed = CHECKSUM_NONE;
+ } else if (skb_copy_bits(skb, 0, nskb->data, skb->len)) {
dev_kfree_skb(nskb);
dev_kfree_skb(skb);
return NULL;
@@ -1150,6 +1178,9 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
goto out_dropped;
}
+ if (skb->ip_summed == CHECKSUM_PARTIAL)
+ vnet_fullcsum(skb);
+
dr = &port->vio.drings[VIO_DRIVER_TX_RING];
i = skb_get_queue_mapping(skb);
txq = netdev_get_tx_queue(dev, i);
@@ -1611,7 +1642,7 @@ static struct vnet *vnet_new(const u64 *local_mac)
dev->ethtool_ops = &vnet_ethtool_ops;
dev->watchdog_timeo = VNET_TX_TIMEOUT;
- dev->hw_features = NETIF_F_SG;
+ dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG;
dev->features = dev->hw_features;
err = register_netdev(dev);
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 3/6] sunvnet: add scatter/gather support
From: David L Stevens @ 2014-12-02 20:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch adds scatter/gather support to the sunvnet driver.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 97 ++++++++++++++++++++++++++++--------
1 files changed, 76 insertions(+), 21 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 7a8da56..b6a5336 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -15,6 +15,7 @@
#include <linux/ethtool.h>
#include <linux/etherdevice.h>
#include <linux/mutex.h>
+#include <linux/highmem.h>
#include <linux/if_vlan.h>
#if IS_ENABLED(CONFIG_IPV6)
@@ -978,11 +979,54 @@ static void vnet_clean_timer_expire(unsigned long port0)
del_timer(&port->clean_timer);
}
-static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, void **pstart,
- int *plen)
+static inline int vnet_skb_map(struct ldc_channel *lp, struct sk_buff *skb,
+ struct ldc_trans_cookie *cookies, int ncookies,
+ unsigned int map_perm)
+{
+ int i, nc, err, blen;
+
+ /* header */
+ blen = skb_headlen(skb);
+ if (blen < ETH_ZLEN)
+ blen = ETH_ZLEN;
+ blen += VNET_PACKET_SKIP;
+ blen += 8 - (blen & 7);
+
+ err = ldc_map_single(lp, skb->data-VNET_PACKET_SKIP, blen, cookies,
+ ncookies, map_perm);
+ if (err < 0)
+ return err;
+ nc = err;
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+ u8 *vaddr;
+
+ if (nc < ncookies) {
+ vaddr = kmap_atomic(skb_frag_page(f));
+ blen = skb_frag_size(f);
+ blen += 8 - (blen & 7);
+ err = ldc_map_single(lp, vaddr + f->page_offset,
+ blen, cookies + nc, ncookies - nc,
+ map_perm);
+ kunmap_atomic(vaddr);
+ } else {
+ err = -EMSGSIZE;
+ }
+
+ if (err < 0) {
+ ldc_unmap(lp, cookies, nc);
+ return err;
+ }
+ nc += err;
+ }
+ return nc;
+}
+
+static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, int ncookies)
{
struct sk_buff *nskb;
- int len, pad;
+ int i, len, pad, docopy;
len = skb->len;
pad = 0;
@@ -992,14 +1036,25 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, void **pstart,
}
len += VNET_PACKET_SKIP;
pad += 8 - (len & 7);
- len += 8 - (len & 7);
+ /* make sure we have enough cookies and alignment in every frag */
+ docopy = skb_shinfo(skb)->nr_frags >= ncookies;
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+
+ docopy |= f->page_offset & 7;
+ }
if (((unsigned long)skb->data & 7) != VNET_PACKET_SKIP ||
skb_tailroom(skb) < pad ||
- skb_headroom(skb) < VNET_PACKET_SKIP) {
+ skb_headroom(skb) < VNET_PACKET_SKIP || docopy) {
int offset;
- nskb = alloc_and_align_skb(skb->dev, skb->len);
+ len = skb->len > ETH_ZLEN ? skb->len : ETH_ZLEN;
+ nskb = alloc_and_align_skb(skb->dev, len);
+ if (nskb == NULL) {
+ dev_kfree_skb(skb);
+ return NULL;
+ }
skb_reserve(nskb, VNET_PACKET_SKIP);
nskb->protocol = skb->protocol;
@@ -1022,9 +1077,6 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, void **pstart,
dev_kfree_skb(skb);
skb = nskb;
}
-
- *pstart = skb->data - VNET_PACKET_SKIP;
- *plen = len;
return skb;
}
@@ -1049,15 +1101,9 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
unsigned int len;
struct sk_buff *freeskbs = NULL;
int i, err, txi;
- void *start = NULL;
- int nlen = 0;
unsigned pending = 0;
struct netdev_queue *txq;
- skb = vnet_skb_shape(skb, &start, &nlen);
- if (unlikely(!skb))
- goto out_dropped;
-
rcu_read_lock();
port = __tx_port_find(vp, skb);
if (unlikely(!port)) {
@@ -1097,6 +1143,13 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
goto out_dropped;
}
+ skb = vnet_skb_shape(skb, 2);
+
+ if (unlikely(!skb)) {
+ rcu_read_unlock();
+ goto out_dropped;
+ }
+
dr = &port->vio.drings[VIO_DRIVER_TX_RING];
i = skb_get_queue_mapping(skb);
txq = netdev_get_tx_queue(dev, i);
@@ -1124,16 +1177,15 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
if (len < ETH_ZLEN)
len = ETH_ZLEN;
- port->tx_bufs[txi].skb = skb;
- skb = NULL;
-
- err = ldc_map_single(port->vio.lp, start, nlen,
- port->tx_bufs[txi].cookies, VNET_MAXCOOKIES,
- (LDC_MAP_SHADOW | LDC_MAP_DIRECT | LDC_MAP_RW));
+ err = vnet_skb_map(port->vio.lp, skb, port->tx_bufs[txi].cookies, 2,
+ (LDC_MAP_SHADOW | LDC_MAP_DIRECT | LDC_MAP_RW));
if (err < 0) {
netdev_info(dev, "tx buffer map error %d\n", err);
goto out_dropped;
}
+
+ port->tx_bufs[txi].skb = skb;
+ skb = NULL;
port->tx_bufs[txi].ncookies = err;
/* We don't rely on the ACKs to free the skb in vnet_start_xmit(),
@@ -1559,6 +1611,9 @@ static struct vnet *vnet_new(const u64 *local_mac)
dev->ethtool_ops = &vnet_ethtool_ops;
dev->watchdog_timeo = VNET_TX_TIMEOUT;
+ dev->hw_features = NETIF_F_SG;
+ dev->features = dev->hw_features;
+
err = register_netdev(dev);
if (err) {
pr_err("Cannot register net device, aborting\n");
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 2/6] sunvnet: add VIO v1.7 and v1.8 support
From: David L Stevens @ 2014-12-02 20:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch adds support for VIO v1.7 (extended descriptor format)
and v1.8 (receive-side checksumming) to the sunvnet driver.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
arch/sparc/include/asm/vio.h | 19 +++++++
drivers/net/ethernet/sun/sunvnet.c | 102 ++++++++++++++++++++++++++++++++----
2 files changed, 111 insertions(+), 10 deletions(-)
diff --git a/arch/sparc/include/asm/vio.h b/arch/sparc/include/asm/vio.h
index d758c8d..fb124fe 100644
--- a/arch/sparc/include/asm/vio.h
+++ b/arch/sparc/include/asm/vio.h
@@ -247,6 +247,25 @@ struct vio_net_desc {
struct ldc_trans_cookie cookies[0];
};
+struct vio_net_dext {
+ u8 flags;
+#define VNET_PKT_HASH 0x01
+#define VNET_PKT_HCK_IPV4_HDRCKSUM 0x02
+#define VNET_PKT_HCK_FULLCKSUM 0x04
+#define VNET_PKT_IPV4_LSO 0x08
+#define VNET_PKT_HCK_IPV4_HDRCKSUM_OK 0x10
+#define VNET_PKT_HCK_FULLCKSUM_OK 0x20
+
+ u8 vnet_hashval;
+ u16 ipv4_lso_mss;
+ u32 resv3;
+};
+
+static inline struct vio_net_dext *vio_net_ext(struct vio_net_desc *desc)
+{
+ return (struct vio_net_dext *)&desc->cookies[2];
+}
+
#define VIO_MAX_RING_COOKIES 24
struct vio_dring_state {
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 62823fa..7a8da56 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -21,6 +21,7 @@
#include <linux/icmpv6.h>
#endif
+#include <net/ip.h>
#include <net/icmp.h>
#include <net/route.h>
@@ -51,6 +52,8 @@ static int __vnet_tx_trigger(struct vnet_port *port, u32 start);
/* Ordered from largest major to lowest */
static struct vio_version vnet_versions[] = {
+ { .major = 1, .minor = 8 },
+ { .major = 1, .minor = 7 },
{ .major = 1, .minor = 6 },
{ .major = 1, .minor = 0 },
};
@@ -282,10 +285,42 @@ static struct sk_buff *alloc_and_align_skb(struct net_device *dev,
return skb;
}
-static int vnet_rx_one(struct vnet_port *port, unsigned int len,
- struct ldc_trans_cookie *cookies, int ncookies)
+static inline void vnet_fullcsum(struct sk_buff *skb)
+{
+ struct iphdr *iph = ip_hdr(skb);
+ int offset = skb_transport_offset(skb);
+
+ if (skb->protocol != htons(ETH_P_IP))
+ return;
+ if (iph->protocol != IPPROTO_TCP &&
+ iph->protocol != IPPROTO_UDP)
+ return;
+ skb->ip_summed = CHECKSUM_NONE;
+ skb->csum_level = 1;
+ skb->csum = 0;
+ if (iph->protocol == IPPROTO_TCP) {
+ struct tcphdr *ptcp = tcp_hdr(skb);
+
+ ptcp->check = 0;
+ skb->csum = skb_checksum(skb, offset, skb->len - offset, 0);
+ ptcp->check = csum_tcpudp_magic(iph->saddr, iph->daddr,
+ skb->len - offset, IPPROTO_TCP,
+ skb->csum);
+ } else if (iph->protocol == IPPROTO_UDP) {
+ struct udphdr *pudp = udp_hdr(skb);
+
+ pudp->check = 0;
+ skb->csum = skb_checksum(skb, offset, skb->len - offset, 0);
+ pudp->check = csum_tcpudp_magic(iph->saddr, iph->daddr,
+ skb->len - offset, IPPROTO_UDP,
+ skb->csum);
+ }
+}
+
+static int vnet_rx_one(struct vnet_port *port, struct vio_net_desc *desc)
{
struct net_device *dev = port->vp->dev;
+ unsigned int len = desc->size;
unsigned int copy_len;
struct sk_buff *skb;
int err;
@@ -307,7 +342,7 @@ static int vnet_rx_one(struct vnet_port *port, unsigned int len,
skb_put(skb, copy_len);
err = ldc_copy(port->vio.lp, LDC_COPY_IN,
skb->data, copy_len, 0,
- cookies, ncookies);
+ desc->cookies, desc->ncookies);
if (unlikely(err < 0)) {
dev->stats.rx_frame_errors++;
goto out_free_skb;
@@ -317,6 +352,28 @@ static int vnet_rx_one(struct vnet_port *port, unsigned int len,
skb_trim(skb, len);
skb->protocol = eth_type_trans(skb, dev);
+ if (vio_version_after_eq(&port->vio, 1, 8)) {
+ struct vio_net_dext *dext = vio_net_ext(desc);
+
+ if (dext->flags & VNET_PKT_HCK_IPV4_HDRCKSUM) {
+ if (skb->protocol == ETH_P_IP) {
+ struct iphdr *iph = (struct iphdr *)skb->data;
+
+ iph->check = 0;
+ ip_send_check(iph);
+ }
+ }
+ if ((dext->flags & VNET_PKT_HCK_FULLCKSUM) &&
+ skb->ip_summed == CHECKSUM_NONE)
+ vnet_fullcsum(skb);
+ if (dext->flags & VNET_PKT_HCK_IPV4_HDRCKSUM_OK) {
+ skb->ip_summed = CHECKSUM_PARTIAL;
+ skb->csum_level = 0;
+ if (dext->flags & VNET_PKT_HCK_FULLCKSUM_OK)
+ skb->csum_level = 1;
+ }
+ }
+
dev->stats.rx_packets++;
dev->stats.rx_bytes += len;
napi_gro_receive(&port->napi, skb);
@@ -451,7 +508,7 @@ static int vnet_walk_rx_one(struct vnet_port *port,
desc->cookies[0].cookie_addr,
desc->cookies[0].cookie_size);
- err = vnet_rx_one(port, desc->size, desc->cookies, desc->ncookies);
+ err = vnet_rx_one(port, desc);
if (err == -ECONNRESET)
return err;
desc->hdr.state = VIO_DESC_DONE;
@@ -940,8 +997,22 @@ static inline struct sk_buff *vnet_skb_shape(struct sk_buff *skb, void **pstart,
if (((unsigned long)skb->data & 7) != VNET_PACKET_SKIP ||
skb_tailroom(skb) < pad ||
skb_headroom(skb) < VNET_PACKET_SKIP) {
+ int offset;
+
nskb = alloc_and_align_skb(skb->dev, skb->len);
skb_reserve(nskb, VNET_PACKET_SKIP);
+
+ nskb->protocol = skb->protocol;
+ offset = skb_mac_header(skb) - skb->data;
+ skb_set_mac_header(nskb, offset);
+ offset = skb_network_header(skb) - skb->data;
+ skb_set_network_header(nskb, offset);
+ offset = skb_transport_header(skb) - skb->data;
+ skb_set_transport_header(nskb, offset);
+
+ nskb->csum_offset = skb->csum_offset;
+ nskb->ip_summed = skb->ip_summed;
+
if (skb_copy_bits(skb, 0, nskb->data, skb->len)) {
dev_kfree_skb(nskb);
dev_kfree_skb(skb);
@@ -1078,6 +1149,16 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
d->ncookies = port->tx_bufs[txi].ncookies;
for (i = 0; i < d->ncookies; i++)
d->cookies[i] = port->tx_bufs[txi].cookies[i];
+ if (vio_version_after_eq(&port->vio, 1, 7)) {
+ struct vio_net_dext *dext = vio_net_ext(d);
+
+ memset(dext, 0, sizeof(*dext));
+ if (vio_version_after_eq(&port->vio, 1, 8) &&
+ !port->switch_port) {
+ dext->flags |= VNET_PKT_HCK_IPV4_HDRCKSUM_OK;
+ dext->flags |= VNET_PKT_HCK_FULLCKSUM_OK;
+ }
+ }
/* This has to be a non-SMP write barrier because we are writing
* to memory which is shared with the peer LDOM.
@@ -1370,15 +1451,17 @@ static void vnet_port_free_tx_bufs(struct vnet_port *port)
static int vnet_port_alloc_tx_ring(struct vnet_port *port)
{
struct vio_dring_state *dr;
- unsigned long len;
+ unsigned long len, elen;
int i, err, ncookies;
void *dring;
dr = &port->vio.drings[VIO_DRIVER_TX_RING];
- len = (VNET_TX_RING_SIZE *
- (sizeof(struct vio_net_desc) +
- (sizeof(struct ldc_trans_cookie) * 2)));
+ elen = sizeof(struct vio_net_desc) +
+ sizeof(struct ldc_trans_cookie) * 2;
+ if (vio_version_after_eq(&port->vio, 1, 7))
+ elen += sizeof(struct vio_net_dext);
+ len = VNET_TX_RING_SIZE * elen;
ncookies = VIO_MAX_RING_COOKIES;
dring = ldc_alloc_exp_dring(port->vio.lp, len,
@@ -1392,8 +1475,7 @@ static int vnet_port_alloc_tx_ring(struct vnet_port *port)
}
dr->base = dring;
- dr->entry_size = (sizeof(struct vio_net_desc) +
- (sizeof(struct ldc_trans_cookie) * 2));
+ dr->entry_size = elen;
dr->num_entries = VNET_TX_RING_SIZE;
dr->prod = dr->cons = 0;
port->start_cons = true; /* need an initial trigger */
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 1/6] sunvnet: rename vnet_port_alloc_tx_bufs and move after version negotiation
From: David L Stevens @ 2014-12-02 20:30 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch changes the name of vnet_port_alloc_tx_bufs to
vnet_port_alloc_tx_ring, since there are no buffer allocations after
transmit zero copy support was added. This patch also moves the ring
allocation to after VIO version negotiation to allow for
different-sized descriptors in later VIO versions.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
drivers/net/ethernet/sun/sunvnet.c | 18 ++++++++----------
1 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index a556eba..62823fa 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -73,13 +73,19 @@ static int vnet_handle_unknown(struct vnet_port *port, void *arg)
return -ECONNRESET;
}
+static int vnet_port_alloc_tx_ring(struct vnet_port *port);
+
static int vnet_send_attr(struct vio_driver_state *vio)
{
struct vnet_port *port = to_vnet_port(vio);
struct net_device *dev = port->vp->dev;
struct vio_net_attr_info pkt;
int framelen = ETH_FRAME_LEN;
- int i;
+ int i, err;
+
+ err = vnet_port_alloc_tx_ring(to_vnet_port(vio));
+ if (err)
+ return err;
memset(&pkt, 0, sizeof(pkt));
pkt.tag.type = VIO_TYPE_CTRL;
@@ -1361,7 +1367,7 @@ static void vnet_port_free_tx_bufs(struct vnet_port *port)
}
}
-static int vnet_port_alloc_tx_bufs(struct vnet_port *port)
+static int vnet_port_alloc_tx_ring(struct vnet_port *port)
{
struct vio_dring_state *dr;
unsigned long len;
@@ -1640,10 +1646,6 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
netif_napi_add(port->vp->dev, &port->napi, vnet_poll, NAPI_POLL_WEIGHT);
- err = vnet_port_alloc_tx_bufs(port);
- if (err)
- goto err_out_free_ldc;
-
INIT_HLIST_NODE(&port->hash);
INIT_LIST_HEAD(&port->list);
@@ -1677,10 +1679,6 @@ static int vnet_port_probe(struct vio_dev *vdev, const struct vio_device_id *id)
return 0;
-err_out_free_ldc:
- netif_napi_del(&port->napi);
- vio_ldc_free(&port->vio);
-
err_out_free_port:
kfree(port);
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 0/6] sunvnet: add SG, HW_CSUM, GSO, and TSO support
From: David L Stevens @ 2014-12-02 20:30 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sowmini Varadhan
This patch set adds everything needed for TSO support in sunvnet. On my
test hardware, this increases the single-stream TCP throughput for the
default 1500-byte MTU Linux-Linux from ~2Gbps to 10Gbps and Linux-Solaris
from ~2Gbps to 6Gbps.
David L Stevens (6):
sunvnet: rename vnet_port_alloc_tx_bufs and move after version
negotiation
sunvnet: add VIO v1.7 and v1.8 support
sunvnet: add scatter/gather support
sunvnet: add checksum offload support
sunvnet: add GSO support
sunvnet: add TSO support
arch/sparc/include/asm/vio.h | 19 ++
drivers/net/ethernet/sun/sunvnet.c | 406 ++++++++++++++++++++++++++++++++----
drivers/net/ethernet/sun/sunvnet.h | 9 +-
3 files changed, 388 insertions(+), 46 deletions(-)
^ permalink raw reply
* Re: net-PA Semi: Deletion of unnecessary checks before the function call "pci_dev_put"
From: Luis R. Rodriguez @ 2014-12-02 20:18 UTC (permalink / raw)
To: Dan Carpenter
Cc: Johannes Berg, Julia Lawall, SF Markus Elfring, Lino Sanfilippo,
Olof Johansson, netdev-u79uwXL29TY76Z2rM5mHXA,
backports-u79uwXL29TY76Z2rM5mHXA, LKML,
kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141202183509.GI4994@mwanda>
On Tue, Dec 02, 2014 at 09:35:09PM +0300, Dan Carpenter wrote:
> On Tue, Dec 02, 2014 at 05:53:28PM +0100, Johannes Berg wrote:
> > On Mon, 2014-12-01 at 21:34 +0100, Julia Lawall wrote:
> >
> > > > So this kind of evolution is no problem for the (automated) backports
> > > > using the backports project - although it can be difficult to detect
> > > > such a thing is needed.
> > >
> > > That is exactly the problem...
> >
> > I'm not convinced though that it should stop such progress in mainline.
>
> Is it progress?
I like to think of progress as using tools to help fix code where we know
it can be made simpler with a small ammendment: if you can extend the tools
to also vet for safety for backports to avoid crashes even better.
So its a small evolution but we can do better, which is the point you and
Julia are making.
> These patches match the code look simpler by passing
> hiding the NULL check inside a function call. Calling pci_dev_put(NULL)
> doesn't make sense. Just because a sanity check exists doesn't mean we
> should do insane things.
It'd crash the system if the function call didn't have the check in place
but having the code in question call pci_dev_put(NULL) is also ludicrious.
Either way in this case I think we shouldn't go beyond analyzing the
function call and if the error check was present before as it is a real
case that has introduced crashes before which Julia wanted to flag.
> It's easy enough to store which functions have a sanity check in a
> database,
This is easy but it adds complexities which I'd prefer to keep on
some other people's workstations. For the developer I think we should
strive to only have: a) git b) Coccinelle c) smatch.
> but to rememember all that as a human being trying to read the
> code is impossible.
Agreed. The problem statement presented by Julia is part of the effort
of addressing the "how do we evolve faster" problem on Linux kernel development,
what you describe adds to the mix of the complexities, and while Oleg does
note that part of this is academic there are those of us who are making things
which are academic immediately practical and a reality for Linux. This is also
how we evolve faster :)
> If we really wanted to make this code cleaner we would introduce more
> error labels with better names.
Can you describe a bit more what you mean here? If we had a label *in code*
on the caller, perhaps a comment, I can see tool-wise how it'd remove the
requirement for a database for immediate analysis for safety here, ie,
we hunt for a label on the code; but other than that its unclear what
you mean here.
If you folks agree with my simplication tool analsysi for safety can
we devise a tag for whitelisting this check for a series of routines?
Where would we put it, in the kernel or a tools package? If in the kernel
we could end up sharing it, so I think that's be better. Perhaps scripts/safety/ ?
Maybe use a header that describes the safety check that is vetted by the rule
present, followed by a list of routines vetted?
Then the Cocci file can preload this and a rule that wants this paranoid check
can include this db file for safety ?
The safety here would require vetting thirough history in git that the routine
has a check in place throughout the routines's history up to a certain point.
I propose we only care up to what kernels are listed on kernel.org as supported.
Luis
^ permalink raw reply
* Re: [PATCH] SSB / B44: fix WOL for BCM4401
From: Michael Büsch @ 2014-12-02 20:12 UTC (permalink / raw)
To: Andrey Skvortsov
Cc: Rafael J. Wysocki, Gary.Zambrano, netdev, linux-kernel, b43-dev,
Rafał Miłecki, Larry Finger
In-Reply-To: <20141202200129.GA4580@crion89>
[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]
On Tue, 2 Dec 2014 23:01:29 +0300
Andrey Skvortsov <andrej.skvortzov@gmail.com> wrote:
> On Mon, Dec 01, 2014 at 10:10:23PM +0100, Michael Büsch wrote:
> > On Mon, 1 Dec 2014 23:46:38 +0300
> > Andrey Skvortsov <andrej.skvortzov@gmail.com> wrote:
> >
> > > Wake On Lan was not working on laptop DELL Vostro 1500.
> > > If WOL was turned on, BCM4401 was powered up in suspend mode. LEDs blinked.
> > > But the laptop could not be woken up with the Magic Packet. The reason for
> > > that was that PCIE was not enabled as a system wakeup source and
> > > therefore the host PCI bridge was not powered up in suspend mode.
> > > PCIE was not enabled in suspend by PM because no child devices were
> > > registered as wakeup source during suspend process.
> > > On laptop BCM4401 is connected through the SSB bus, that is connected to the
> > > PCI-Express bus. SSB and B44 did not use standard PM wakeup functions
> > > and did not forward wakeup settings to their parents.
> > > To fix that B44 driver enables PM wakeup and registers new wakeup source
> > > using device_set_wakeup_enable(). Wakeup is automatically reported to the parent SSB
> > > bus via power.wakeup_path. SSB bus enables wakeup for the parent PCI bridge, if there is any
> > > child devices with enabled wakeup functionality. All other steps are
> > > done by PM core code.
> >
> > Thanks, this looks good.
> > I assume you tested this (I currently don't have a device to test this).
>
> Sure, I've tested it. WOL from suspend is working and after resume from hibernate Ethernet is working too.
That sounds good, indeed.
I'd still prefer, if someone with b43 (wireless) would test it, too.
--
Michael
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH] SSB / B44: fix WOL for BCM4401
From: Andrey Skvortsov @ 2014-12-02 20:01 UTC (permalink / raw)
To: Michael Büsch
Cc: Rafael J. Wysocki, Gary.Zambrano, netdev, linux-kernel, b43-dev,
Rafał Miłecki, Larry Finger
In-Reply-To: <20141201221023.79ffb40d@wiggum>
[-- Attachment #1: Type: text/plain, Size: 1965 bytes --]
On Mon, Dec 01, 2014 at 10:10:23PM +0100, Michael Büsch wrote:
> On Mon, 1 Dec 2014 23:46:38 +0300
> Andrey Skvortsov <andrej.skvortzov@gmail.com> wrote:
>
> > Wake On Lan was not working on laptop DELL Vostro 1500.
> > If WOL was turned on, BCM4401 was powered up in suspend mode. LEDs blinked.
> > But the laptop could not be woken up with the Magic Packet. The reason for
> > that was that PCIE was not enabled as a system wakeup source and
> > therefore the host PCI bridge was not powered up in suspend mode.
> > PCIE was not enabled in suspend by PM because no child devices were
> > registered as wakeup source during suspend process.
> > On laptop BCM4401 is connected through the SSB bus, that is connected to the
> > PCI-Express bus. SSB and B44 did not use standard PM wakeup functions
> > and did not forward wakeup settings to their parents.
> > To fix that B44 driver enables PM wakeup and registers new wakeup source
> > using device_set_wakeup_enable(). Wakeup is automatically reported to the parent SSB
> > bus via power.wakeup_path. SSB bus enables wakeup for the parent PCI bridge, if there is any
> > child devices with enabled wakeup functionality. All other steps are
> > done by PM core code.
>
> Thanks, this looks good.
> I assume you tested this (I currently don't have a device to test this).
Sure, I've tested it. WOL from suspend is working and after resume from hibernate Ethernet is working too.
> Larry, Rafał, any other b43 user:
> Can you please test whether this doesn't cause regressions for suspend/resume on b43?
> (Patch is attached as reference)
>
>
> > Signed-off-by: Andrey Skvortsov <Andrej.Skvortzov@gmail.com>
> > ---
> > drivers/net/ethernet/broadcom/b44.c | 2 ++
> > drivers/ssb/pcihost_wrapper.c | 33 ++++++++++++++++++++++-----------
> > 2 files changed, 24 insertions(+), 11 deletions(-)
--
Best regards,
Andrey Skvortsov
PGP Key ID: 0x57A3AEAD
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: [PATCH net-next] rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
From: Alexei Starovoitov @ 2014-12-02 19:53 UTC (permalink / raw)
To: Thomas Graf
Cc: Mahesh Bandewar, netdev, David Miller, Eric Dumazet, Roopa Prabhu,
Toshiaki Makita
In-Reply-To: <20141202100746.GA13717@casper.infradead.org>
On Tue, Dec 2, 2014 at 2:07 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/01/14 at 09:54pm, Mahesh Bandewar wrote:
>> --- a/net/core/rtnetlink.c
>> +++ b/net/core/rtnetlink.c
>> @@ -2220,8 +2220,16 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
>> return skb->len;
>> }
>>
>> -void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
>> - gfp_t flags)
>> +void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev, gfp_t flags)
>> +{
>> + struct net *net = dev_net(dev);
>> +
>> + rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, flags);
>> +}
>> +EXPORT_SYMBOL(rtmsg_ifinfo_send);
>> +
>> +struct sk_buff *rtmsg_ifinfo(int type, struct net_device *dev,
>> + unsigned int change, gfp_t flags, bool fill_only)
>> {
>> struct net *net = dev_net(dev);
>> struct sk_buff *skb;
>> @@ -2239,11 +2247,15 @@ void rtmsg_ifinfo(int type, struct net_device *dev, unsigned int change,
>> kfree_skb(skb);
>> goto errout;
>> }
>> + if (fill_only)
>> + return skb;
>> +
>> rtnl_notify(skb, net, 0, RTNLGRP_LINK, NULL, flags);
>> - return;
>> + return NULL;
>> errout:
>> if (err < 0)
>> rtnl_set_sk_err(net, RTNLGRP_LINK, err);
>> + return NULL;
>> }
>
> I think it would be cleaner to introduce a new function, for example
> rtmsg_ifinfo_build_skb() which is called from rtmsg_ifinfo(). The
> single caller that requires delayed sending can use the build skb
> function directly and then send it off.
+1
that would make patch much smaller.
^ permalink raw reply
* [net PATCH] fib_trie: Fix /proc/net/fib_trie when CONFIG_IP_MULTIPLE_TABLES is not defined
From: Alexander Duyck @ 2014-12-02 18:58 UTC (permalink / raw)
To: netdev; +Cc: davem
In recent testing I had disabled CONFIG_IP_MULTIPLE_TABLES and as a result
when I ran "cat /proc/net/fib_trie" the main trie was displayed multiple
times. I found that the problem line of code was in the function
fib_trie_seq_next. Specifically the line below caused the indexes to go in
the opposite direction of our traversal:
h = tb->tb_id & (FIB_TABLE_HASHSZ - 1);
This issue was that the RT tables are defined such that RT_TABLE_LOCAL is ID
255, while it is located at TABLE_LOCAL_INDEX of 0, and RT_TABLE_MAIN is 254
with a TABLE_MAIN_INDEX of 1. This means that the above line will return 1
for the local table and 0 for main. The result is that fib_trie_seq_next
will return NULL at the end of the local table, fib_trie_seq_start will
return the start of the main table, and then fib_trie_seq_next will loop on
main forever as h will always return 0.
The fix for this is to reverse the ordering of the two tables. It has the
advantage of making it so that the tables now print in the same order
regardless of if multiple tables are enabled or not. In order to make the
definition consistent with the multiple tables case I simply masked the to
RT_TABLE_XXX values by (FIB_TABLE_HASHSZ - 1). This way the two table
layouts should always stay consistent.
Fixes: 93456b6 ("[IPV4]: Unify access to the routing tables")
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
include/net/ip_fib.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index dc9d2a2..09a819e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -201,8 +201,8 @@ void fib_free_table(struct fib_table *tb);
#ifndef CONFIG_IP_MULTIPLE_TABLES
-#define TABLE_LOCAL_INDEX 0
-#define TABLE_MAIN_INDEX 1
+#define TABLE_LOCAL_INDEX (RT_TABLE_LOCAL & (FIB_TABLE_HASHSZ - 1))
+#define TABLE_MAIN_INDEX (RT_TABLE_MAIN & (FIB_TABLE_HASHSZ - 1))
static inline struct fib_table *fib_get_table(struct net *net, u32 id)
{
^ permalink raw reply related
* Re: net-PA Semi: Deletion of unnecessary checks before the function call "pci_dev_put"
From: Luis R. Rodriguez @ 2014-12-02 18:45 UTC (permalink / raw)
To: Johannes Berg
Cc: Julia Lawall, SF Markus Elfring, Lino Sanfilippo, Olof Johansson,
netdev@vger.kernel.org, backports@vger.kernel.org, LKML,
kernel-janitors@vger.kernel.org
In-Reply-To: <1417539208.1841.1.camel@sipsolutions.net>
On Tue, Dec 2, 2014 at 11:53 AM, Johannes Berg
<johannes@sipsolutions.net> wrote:
> On Mon, 2014-12-01 at 21:34 +0100, Julia Lawall wrote:
>
>> > So this kind of evolution is no problem for the (automated) backports
>> > using the backports project - although it can be difficult to detect
>> > such a thing is needed.
>>
>> That is exactly the problem...
>
> I'm not convinced though that it should stop such progress in mainline.
I believe this case requires a bit more information explained as to
why it was explained. The "form" of change this patch has is of the
type that can crash systems if the NULL pointer check on the caller
implementation was only added later. We might be able to grammatically
check for this situation in the future if we had a white list / black
list / kernel revision where the NULL check was added but for now we
don't have that and as such care is just required on the developer in
consideration for backports.
It should be up to the maintainer to appreciate the gains of doing
something differently to make it easier for backporting. I obviously
think its a good thing to consider, its extra work though, so only if
the maintainer has some appreciation for backporting would this make
sense.
In this particular case I've reviewed Julia's concern and I've
determined that the patch is safe up to at least v2.6.12-rc2 (which is
where our git history begins on Linus' tree), this is because the
check for NULL has been there since then:
git show 1da177e drivers/pci/pci-driver.c
+void pci_dev_put(struct pci_dev *dev)
+{
+ if (dev)
+ put_device(&dev->dev);
+}
So this type of wide collateral evolution should not cause panics.
Because of this:
Acked-by: Luis R. Rodriguez <mcgrof@suse.com>
But note -- I still think its only good for us to vet these, if we
can't why not? If the maintainer doesn't give a shit that's different,
but if there are folks out there willing to help with vetting then
well, why not :)
PS. Including something like historical vetting as I did above on the
commit log should help folks.
Luis
^ permalink raw reply
* [net_test_tools] udpflood: Add IPv6 support
From: Martin KaFai Lau @ 2014-12-02 18:41 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1413837765-5446-1-git-send-email-kafai@fb.com>
This patch:
1. Add IPv6 support
2. Print timing for every 65536 fib insert operations to observe
the gc effect (mostly for IPv6 fib).
---
udpflood.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 101 insertions(+), 24 deletions(-)
diff --git a/udpflood.c b/udpflood.c
index 6e658f7..5855012 100644
--- a/udpflood.c
+++ b/udpflood.c
@@ -6,7 +6,9 @@
#include <string.h>
#include <errno.h>
#include <unistd.h>
+#include <stdint.h>
+#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
@@ -15,57 +17,121 @@
#define _GNU_SOURCE
#include <getopt.h>
+static int debug = 0;
+
+typedef union sa_u {
+ struct sockaddr_in a4;
+ struct sockaddr_in6 a6;
+} sa_u;
+
static int usage(void)
{
printf("usage: udpflood [ -l count ] [ -m message_size ] [ -c num_ip_addrs ] IP_ADDRESS\n");
return -1;
}
-static int send_packets(in_addr_t start_addr, in_addr_t end_addr,
- int port, int count, int msg_sz)
+static uint32_t get_last32h(const sa_u *sa)
+{
+ if (sa->a4.sin_family == PF_INET)
+ return ntohl(sa->a4.sin_addr.s_addr);
+ else
+ return ntohl(sa->a6.sin6_addr.s6_addr32[3]);
+}
+
+static void set_last32h(sa_u *sa, uint32_t last32h)
+{
+ if (sa->a4.sin_family == PF_INET)
+ sa->a4.sin_addr.s_addr = htonl(last32h);
+ else
+ sa->a6.sin6_addr.s6_addr32[3] = htonl(last32h);
+}
+
+static void print_sa(const sa_u *sa, const char *msg)
+{
+ char buf[sizeof("xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx")];
+
+ if (!debug)
+ return;
+
+ switch (sa->a4.sin_family) {
+ case PF_INET:
+ inet_ntop(PF_INET, &(sa->a4.sin_addr.s_addr), buf,
+ sizeof(buf));
+ break;
+ case PF_INET6:
+ inet_ntop(PF_INET6, sa->a6.sin6_addr.s6_addr, buf, sizeof(buf));
+ break;
+ }
+
+ printf("%s: %s\n", msg, buf);
+}
+
+static long get_diff_ms(const struct timeval *now,
+ const struct timeval *start)
+{
+ long start_ms, now_ms;
+ start_ms = start->tv_sec * 1000 + (start->tv_usec / 1000);
+ now_ms = now->tv_sec * 1000 + (now->tv_usec / 1000);
+ return now_ms - start_ms;
+}
+
+static int send_packets(const sa_u *start_sa, size_t num_addrs, int count,
+ int msg_sz)
{
char *msg = malloc(msg_sz);
- struct sockaddr_in saddr;
- in_addr_t addr;
+ sa_u cur_sa;
+ uint32_t start_addr32h, end_addr32h, cur_addr32h;
int fd, i, err;
+ struct timeval last, now;
if (!msg)
return -ENOMEM;
memset(msg, 0, msg_sz);
- addr = start_addr;
-
- memset(&saddr, 0, sizeof(saddr));
- saddr.sin_family = AF_INET;
- saddr.sin_port = port;
- saddr.sin_addr.s_addr = addr;
+ memcpy(&cur_sa, start_sa, sizeof(cur_sa));
+ cur_addr32h = start_addr32h = get_last32h(&cur_sa);
+ end_addr32h = start_addr32h + num_addrs;
- fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ fd = socket(cur_sa.a4.sin_family, SOCK_DGRAM, IPPROTO_IP);
if (fd < 0) {
perror("socket");
err = fd;
goto out_nofd;
}
- err = connect(fd, (struct sockaddr *) &saddr, sizeof(saddr));
+ err = connect(fd, (struct sockaddr *) &cur_sa, sizeof(cur_sa));
if (err < 0) {
perror("connect");
- close(fd);
goto out;
}
+ print_sa(start_sa, "start_addr");
+ gettimeofday(&last, NULL);
for (i = 0; i < count; i++) {
- saddr.sin_addr.s_addr = addr;
-
+ print_sa(&cur_sa, "sendto");
err = sendto(fd, msg, msg_sz, 0,
- (struct sockaddr *) &saddr, sizeof(saddr));
+ (struct sockaddr *) &cur_sa, sizeof(cur_sa));
if (err < 0) {
perror("sendto");
goto out;
}
- if (++addr >= end_addr)
- addr = start_addr;
+ if (++cur_addr32h >= end_addr32h)
+ cur_addr32h = start_addr32h;
+ set_last32h(&cur_sa, cur_addr32h);
+
+ /*
+ * print timing info for every 65536 fib inserts to
+ * observe the gc effect (mostly for IPv6 fib).
+ */
+ if (i && (i & 0xFFFF) == 0) {
+ long diff_ms;
+ gettimeofday(&now, NULL);
+ diff_ms = get_diff_ms(&now, &last);
+ printf("%d %ld.%ld\n", i >> 16,
+ diff_ms / 1000, diff_ms % 1000);
+ memcpy(&last, &now, sizeof(last));
+ }
}
err = 0;
@@ -79,14 +145,14 @@ out_nofd:
int main(int argc, char **argv, char **envp)
{
int port, msg_sz, count, num_addrs, ret;
- in_addr_t start_addr, end_addr;
+ sa_u start_sa;
port = 6000;
msg_sz = 32;
count = 10000000;
num_addrs = 1;
- while ((ret = getopt(argc, argv, "l:s:p:c:")) >= 0) {
+ while ((ret = getopt(argc, argv, "dl:s:p:c:")) >= 0) {
switch (ret) {
case 'l':
sscanf(optarg, "%d", &count);
@@ -100,18 +166,29 @@ int main(int argc, char **argv, char **envp)
case 'c':
sscanf(optarg, "%d", &num_addrs);
break;
+ case 'd':
+ debug = 1;
+ break;
case '?':
return usage();
}
}
+ if (num_addrs < 1 || count < 1)
+ return usage();
+
if (!argv[optind])
return usage();
- start_addr = inet_addr(argv[optind]);
- if (start_addr == INADDR_NONE)
+ memset(&start_sa, 0, sizeof(start_sa));
+ start_sa.a4.sin_port = htons(port);
+ if (inet_pton(PF_INET, argv[optind], &start_sa.a4.sin_addr))
+ start_sa.a4.sin_family = PF_INET;
+ else if (inet_pton(PF_INET6, argv[optind],
+ start_sa.a6.sin6_addr.s6_addr))
+ start_sa.a6.sin6_family = PF_INET6;
+ else
return usage();
- end_addr = start_addr + num_addrs;
- return send_packets(start_addr, end_addr, port, count, msg_sz);
+ return send_packets(&start_sa, num_addrs, count, msg_sz);
}
--
1.8.1
^ permalink raw reply related
* Re: [PATCH] net: mvneta: fix Tx interrupt delay
From: Eric Dumazet @ 2014-12-02 18:39 UTC (permalink / raw)
To: Ezequiel Garcia
Cc: Willy Tarreau, netdev, Maggie Mae Roxas, Thomas Petazzoni,
Gregory CLEMENT
In-Reply-To: <547DF2EA.2020908@free-electrons.com>
On Tue, 2014-12-02 at 14:12 -0300, Ezequiel Garcia wrote:
> Eric,
>
> On 12/02/2014 09:18 AM, Eric Dumazet wrote:
> [..]
> >
> > I am surprised TCP even worked correctly with this problem.
> >
> > I highly suggest BQL for this driver, now this issue is fixed.
> >
>
> Implementing BQL for the mvneta driver was something I wanted to do a
> while ago, but you explained that these kind drivers (i.e. those with
> software TSO) didn't need BQL, because the latency that resulted from
> the ring was too small.
>
> Quoting (http://www.spinics.net/lists/netdev/msg284439.html):
> ""
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
>
> So I do not think BQL is really needed, because a 512 slots TX ring wont
> add a big latency : About 5 ms max.
>
> BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
> packet consumes 3 or 4 descriptors.
>
> Also note that TCP Small Queues should limit TX ring occupancy of a
> single bulk flow anyway.
> ""
>
> Maybe I misunderstood something?
This was indeed the case, but we added recently xmit_more support, and
it uses BQL information. So you might add BQL anyway, if xmit_more
support is useful for this hardware.
^ permalink raw reply
* Re: net-PA Semi: Deletion of unnecessary checks before the function call "pci_dev_put"
From: Dan Carpenter @ 2014-12-02 18:35 UTC (permalink / raw)
To: Johannes Berg
Cc: Julia Lawall, SF Markus Elfring, Lino Sanfilippo, Olof Johansson,
netdev-u79uwXL29TY76Z2rM5mHXA, backports-u79uwXL29TY76Z2rM5mHXA,
LKML, kernel-janitors-u79uwXL29TY76Z2rM5mHXA, Luis R. Rodriguez
In-Reply-To: <1417539208.1841.1.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
On Tue, Dec 02, 2014 at 05:53:28PM +0100, Johannes Berg wrote:
> On Mon, 2014-12-01 at 21:34 +0100, Julia Lawall wrote:
>
> > > So this kind of evolution is no problem for the (automated) backports
> > > using the backports project - although it can be difficult to detect
> > > such a thing is needed.
> >
> > That is exactly the problem...
>
> I'm not convinced though that it should stop such progress in mainline.
Is it progress? These patches match the code look simpler by passing
hiding the NULL check inside a function call. Calling pci_dev_put(NULL)
doesn't make sense. Just because a sanity check exists doesn't mean we
should do insane things.
It's easy enough to store which functions have a sanity check in a
database, but to rememember all that as a human being trying to read the
code is impossible.
If we really wanted to make this code cleaner we would introduce more
error labels with better names.
regards,
dan carpenter
^ permalink raw reply
* Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
From: Jesse Gross @ 2014-12-02 18:26 UTC (permalink / raw)
To: Tom Herbert
Cc: Joe Stringer, netdev, Shannon Nelson, Brandeburg, Jesse,
Jeff Kirsher, linux.nics, Linux Kernel Mailing List
In-Reply-To: <CA+mtBx9RwkJ9b84d_OkCtOunFBsuqw276=5E+Qhoqq75utCR4w@mail.gmail.com>
On Mon, Dec 1, 2014 at 4:09 PM, Tom Herbert <therbert@google.com> wrote:
> On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer <joestringer@nicira.com> wrote:
>>>> On 21 November 2014 at 09:59, Joe Stringer <joestringer@nicira.com> wrote:
>>>>> On 20 November 2014 16:19, Jesse Gross <jesse@nicira.com> wrote:
>>>>>> I don't know if we need to have the check at all for IPIP though -
>>>>>> after all the driver doesn't expose support for it all (actually it
>>>>>> doesn't expose GRE either). This raises kind of an interesting
>>>>>> question about the checks though - it's pretty easy to add support to
>>>>>> the driver for a new GSO type (and I imagine that people will be
>>>>>> adding GRE soon) and forget to update the check.
>>>>>
>>>>> If the check is more conservative, then testing would show that it's
>>>>> not working and lead people to figure out why (and update the check).
>>>>
>>>> More concretely, one suggestion would be something like following at
>>>> the start of each gso_check():
>>>>
>>>> + const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
>>>> + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
>>>> +
>>>> + if (skb_shinfo(skb)->gso_type & ~supported)
>>>> + return false;
>>>
>>> This should already be handled by net_gso_ok.
>>
>> My original point wasn't so much that this isn't handled at the moment
>> but that it's easy to add a supported GSO type but then forget to
>> update this check - i.e. if a driver already supports UDP_TUNNEL and
>> adds support for GRE with the same constraints. It seems not entirely
>> ideal that this function is acting as a blacklist rather than a
>> whitelist.
>
> Agreed, it would be nice to have all the checking logic in one place.
> If all the drivers end up implementing ndo_gso_check then we could
> potentially get rid of the GSO types as features. This probably
> wouldn't be a bad thing since we already know that the features
> mechanism doesn't scale (for instance there's no way to indicate that
> certain combinations of GSO types are supported by a device).
This crossed my mind and I agree that it's pretty clear that the
features mechanism isn't scaling very well. Presumably, the logical
extension of this is that each driver would have a function that looks
at a packet and returns a set of offload operations that it can
support rather than exposing a set of protocols. However, it seems
like it would probably result in a bunch of duplicate code in each
driver.
^ permalink raw reply
* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Jesse Gross @ 2014-12-02 18:12 UTC (permalink / raw)
To: Thomas Graf
Cc: Michael S. Tsirkin, Du, Fan, Jason Wang, netdev@vger.kernel.org,
davem@davemloft.net, fw@strlen.de, dev@openvswitch.org,
Pravin Shelar
In-Reply-To: <20141202174158.GB9457@casper.infradead.org>
On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> > > What about containers or any other virtualization environment that
>> > > doesn't use Virtio?
>> >
>> > The host can dictate the MTU in that case for both veth or OVS
>> > internal which would be primary container plumbing techniques.
>>
>> It typically can't do this easily for VMs with emulated devices:
>> real ethernet uses a fixed MTU.
>>
>> IMHO it's confusing to suggest MTU as a fix for this bug, it's
>> an unrelated optimization.
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>
> PMTU discovery only resolves the issue if an actual IP stack is
> running inside the VM. This may not be the case at all.
It's also only really a correct thing to do if the ICMP packet is
coming from an L3 node. If you are doing straight bridging then you
have to resort to hacks like OVS had before, which I agree are not
particularly desirable.
> I agree that exposing an MTU towards the guest is not applicable
> in all situations, in particular because it is difficult to decide
> what MTU to expose. It is a relatively elegant solution in a lot
> of virtualization host cases hooked up to an orchestration system
> though.
I also think this is the right thing to do as a common case
optimization and I know other platforms (such as Hyper-V) do it. It's
not a complete solution so we still need the original patch in this
thread to handle things transparently.
^ permalink raw reply
* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Jesse Gross @ 2014-12-02 18:06 UTC (permalink / raw)
To: Du, Fan, Jason Wang, netdev@vger.kernel.org, davem@davemloft.net,
fw@strlen.de
In-Reply-To: <20141202154425.GA5344@t520.home>
On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
> On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>>
>>
>> >-----Original Message-----
>> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >Sent: Friday, November 28, 2014 3:02 PM
>> >To: Du, Fan
>> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>> >
>> >
>> >
>> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> Test scenario: two KVM guests sitting in different hosts communicate
>> >> to each other with a vxlan tunnel.
>> >>
>> >> All interface MTU is default 1500 Bytes, from guest point of view, its
>> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> >> goes through vxlan encapuslation, individual segments length of a gso
>> >> packet could exceed physical NIC MTU 1500, which will be lost at
>> >> recevier side.
>> >>
>> >> So it's possible in virtualized environment, locally created skb len
>> >> after encapslation could be bigger than underlayer MTU. In such case,
>> >> it's reasonable to do GSO first, then fragment any packet bigger than
>> >> MTU as possible.
>> >>
>> >> +---------------+ TX RX +---------------+
>> >> | KVM Guest | -> ... -> | KVM Guest |
>> >> +-+-----------+-+ +-+-----------+-+
>> >> |Qemu/VirtIO| |Qemu/VirtIO|
>> >> +-----------+ +-----------+
>> >> | |
>> >> v tap0 tap0 v
>> >> +-----------+ +-----------+
>> >> | ovs bridge| | ovs bridge|
>> >> +-----------+ +-----------+
>> >> | vxlan vxlan |
>> >> v v
>> >> +-----------+ +-----------+
>> >> | NIC | <------> | NIC |
>> >> +-----------+ +-----------+
>> >>
>> >> Steps to reproduce:
>> >> 1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> 2. Runing iperf without -M, communication will stuck.
>> >
>> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>> >believe.
>>
>> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
>> without any further ip segmentation.
>>
>> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
>> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
>>
>> For PMTU to work, that's another issue I will try to address later on.
>>
>> >>
>> >>
>> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> ---
>> >> net/ipv4/ip_output.c | 7 ++++---
>> >> 1 files changed, 4 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> bc6471d..558b5f8 100644
>> >> --- a/net/ipv4/ip_output.c
>> >> +++ b/net/ipv4/ip_output.c
>> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> >> *skb)
>> >> struct sk_buff *segs;
>> >> int ret = 0;
>> >>
>> >> - /* common case: locally created skb or seglen is <= mtu */
>> >> - if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> - skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> + /* Both locally created skb and forwarded skb could exceed
>> >> + * MTU size, so make a unified rule for them all.
>> >> + */
>> >> + if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> return ip_finish_output2(skb);
>
>
> Are you using kernel's vxlan device or openvswitch's vxlan device?
>
> Because for kernel's vxlan devices the MTU accounts for the header
> overhead so I believe your patch would work. However, the MTU is
> not visible for the ovs's vxlan devices, so that wouldn't work.
This is being called after the tunnel code, so the MTU that is being
looked at in all cases is the physical device's. Since the packet has
already been encapsulated, tunnel header overhead is already accounted
for in skb_gso_network_seglen() and this should be fine for both OVS
and non-OVS cases.
^ permalink raw reply
* Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
From: Thomas Graf @ 2014-12-02 17:41 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Du, Fan, 'Jason Wang', netdev@vger.kernel.org,
davem@davemloft.net, fw@strlen.de, dev, jesse, pshelar
In-Reply-To: <20141202173401.GB4126@redhat.com>
On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > > What about containers or any other virtualization environment that
> > > doesn't use Virtio?
> >
> > The host can dictate the MTU in that case for both veth or OVS
> > internal which would be primary container plumbing techniques.
>
> It typically can't do this easily for VMs with emulated devices:
> real ethernet uses a fixed MTU.
>
> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> an unrelated optimization.
> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
PMTU discovery only resolves the issue if an actual IP stack is
running inside the VM. This may not be the case at all.
I agree that exposing an MTU towards the guest is not applicable
in all situations, in particular because it is difficult to decide
what MTU to expose. It is a relatively elegant solution in a lot
of virtualization host cases hooked up to an orchestration system
though.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox