* [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: Weiping Pan @ 2012-12-12 14:29 UTC (permalink / raw)
To: davem; +Cc: brutus, netdev, Weiping Pan
In-Reply-To: <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>
1 do not share tail skb between sender and receiver
2 reduce the use of sock->sk_lock.slock
--------------------------------------------------------------------------
TCP friends performance results start
BASE means normal tcp with friends DISABLED.
AF_UNIX means sockets for local interprocess communication, for reference.
FRIENDS means tcp with friends ENABLED.
I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by default.
The first percentage number is FRIENDS/BASE.
The second percentage number is FRIENDS/AF_UNIX.
We set -i 10,2 -I 95,20 to stabilize the statistics.
BASE AF_UNIX FRIENDS TCP_STREAM
7952.97 10864.86 13440.08 168% 123%
BASE AF_UNIX FRIENDS TCP_MAERTS
6743.78 - 13809.97 204% -%
BASE AF_UNIX FRIENDS TCP_SENDFILE
11758 - 18483 157% -%
TCP_SENDFILE can not work with -i 10,2 -I 95,20 (strange), so I use average.
MS BASE AF_UNIX FRIENDS TCP_STREAM_MS
1 10.70 5.40 4.02 37% 74%
2 28.01 9.67 7.97 28% 82%
4 55.53 19.78 16.48 29% 83%
8 115.40 38.22 33.51 29% 87%
16 227.31 81.06 67.70 29% 83%
32 446.20 166.59 129.31 28% 77%
64 849.04 336.77 259.43 30% 77%
128 1440.50 661.88 530.43 36% 80%
256 2404.70 1279.67 1029.15 42% 80%
512 4331.53 2501.30 1942.21 44% 77%
1024 6819.78 4622.37 4128.10 60% 89%
2048 10544.60 6348.81 6349.59 60% 100%
4096 12830.41 8324.43 7984.43 62% 95%
8192 13462.65 8355.49 11079.37 82% 132%
16384 9960.87 10840.13 13037.81 130% 120%
32768 8749.31 11372.15 15087.08 172% 132%
65536 7580.27 12150.23 14971.42 197% 123%
131072 6727.74 11451.34 13604.78 202% 118%
262144 7673.14 11613.10 11436.97 149% 98%
524288 7366.17 11675.95 11559.43 156% 99%
1048576 6608.57 11883.01 10103.20 152% 85%
MS means Message Size in bytes, that is -m -M for netperf
RR BASE AF_UNIX FRIENDS TCP_RR_RR
1 19716.88 34451.39 34574.12 175% 100%
2 19836.74 34297.00 34671.29 174% 101%
4 19874.71 34456.48 34552.13 173% 100%
8 18882.93 34123.00 34661.48 183% 101%
16 19179.09 34358.47 34599.16 180% 100%
32 20140.08 34326.35 34616.30 171% 100%
64 19473.39 34382.05 34583.10 177% 100%
128 19699.62 34012.03 34566.14 175% 101%
256 19740.44 34529.71 34624.07 175% 100%
512 18929.46 33673.06 33932.83 179% 100%
1024 18738.98 33724.78 33313.44 177% 98%
2048 17315.61 32982.24 32361.39 186% 98%
4096 16585.81 31345.85 31073.32 187% 99%
8192 11933.16 27851.10 27166.94 227% 97%
16384 9717.19 21746.12 22583.40 232% 103%
32768 7044.35 12927.23 16253.26 230% 125%
65536 5038.96 8945.74 7982.61 158% 89%
131072 2860.64 4981.78 4417.16 154% 88%
262144 1633.45 2765.27 2739.36 167% 99%
524288 796.68 1429.79 1445.21 181% 101%
1048576 379.78 per 730.05 192% %
RR means Request Response Message Size in bytes, that is -r req,resp for netperf
RR BASE AF_UNIX FRIENDS TCP_CRR_RR
1 5531.49 - 5861.86 105% -%
2 5506.13 - 5845.53 106% -%
4 5523.27 - 5853.43 105% -%
8 5503.73 - 5836.44 106% -%
16 5516.23 - 5842.29 105% -%
32 5557.37 - 5858.29 105% -%
64 5517.51 - 5892.64 106% -%
128 5504.18 - 5841.44 106% -%
256 5512.82 - 5842.60 105% -%
512 5496.36 - 5837.72 106% -%
1024 5465.24 - 5827.99 106% -%
2048 5550.15 - 5812.88 104% -%
4096 5292.75 - 5824.45 110% -%
8192 4917.06 - 5705.12 116% -%
16384 4278.63 - 5318.39 124% -%
32768 3611.86 - 4930.30 136% -%
65536 77.35 - 3847.43 4974% -%
131072 47.65 - 2811.58 5900% -%
262144 805.13 - 4.88 0% -%
524288 583.08 - 4.78 0% -%
1048576 369.52 - 5.02 1% -%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf -H 127.0.0.1
TCP friends performance results end
--------------------------------------------------------------------------
Performance analysis:
1 Friends shows better performance than loopback in TCP_RR, TCP_MAERTS and
TCP_SENDFILE, same in TCP_CRR_RR.
2 In TCP_STREAM, Friends shows much worse perofrmance (30%) than loopback if
the message size if small, and it shows worse performance (80%) than AF_UNIX.
3 Compared with last performance report, Friends shows worse performance in
TCP_RR.
Friends VS AF_UNIX
I think the lock use is much similar this time.
May the locking contention is not the bottle neck ?
Friends VS loopback
I have reduced the locking contention as much as possible,
but it still shows bad performance.
May the locking contention is not the bottle neck ?
Signed-off-by: Weiping Pan <wpan@redhat.com>
---
include/net/tcp.h | 10 --
net/ipv4/tcp.c | 327 ++++++++++++++++++++++-------------------------------
2 files changed, 136 insertions(+), 201 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5f82770..80a8ec9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -688,15 +688,6 @@ void tcp_send_window_probe(struct sock *sk);
#define TCPHDR_ECE 0x40
#define TCPHDR_CWR 0x80
-/* If skb_get_friend() != NULL, TCP friends per packet state.
- */
-struct friend_skb_parm {
- bool tail_inuse; /* In use by skb_get_friend() send while */
- /* on sk_receive_queue for tail put */
-};
-
-#define TCP_FRIEND_CB(tcb) (&(tcb)->header.hf)
-
/* This is what the send packet queuing engine uses to pass
* TCP per-packet control information to the transmission code.
* We also store the host-order sequence numbers in here too.
@@ -709,7 +700,6 @@ struct tcp_skb_cb {
#if IS_ENABLED(CONFIG_IPV6)
struct inet6_skb_parm h6;
#endif
- struct friend_skb_parm hf;
} header; /* For incoming frames */
__u32 seq; /* Starting sequence number */
__u32 end_seq; /* SEQ + FIN + SYN + datalen */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e9d82e0..f008d60 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -336,25 +336,24 @@ static inline int tcp_friend_validate(struct sock *sk, struct sock **friendp,
return 1;
}
-static inline int tcp_friend_send_lock(struct sock *friend)
+static inline int tcp_friend_get_state(struct sock *friend)
{
int err = 0;
spin_lock_bh(&friend->sk_lock.slock);
- if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN)) {
- spin_unlock_bh(&friend->sk_lock.slock);
+ if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN))
err = -ECONNRESET;
- }
+ spin_unlock_bh(&friend->sk_lock.slock);
return err;
}
-static inline void tcp_friend_recv_lock(struct sock *friend)
+static inline void tcp_friend_state_lock(struct sock *friend)
{
spin_lock_bh(&friend->sk_lock.slock);
}
-static void tcp_friend_unlock(struct sock *friend)
+static inline void tcp_friend_state_unlock(struct sock *friend)
{
spin_unlock_bh(&friend->sk_lock.slock);
}
@@ -639,71 +638,32 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
}
EXPORT_SYMBOL(tcp_ioctl);
-/*
- * Friend receive_queue tail skb space? If true, set tail_inuse.
- * Else if RCV_SHUTDOWN, return *copy = -ECONNRESET.
- */
-static inline struct sk_buff *tcp_friend_tail(struct sock *friend, int *copy)
-{
- struct sk_buff *skb = NULL;
- int sz = 0;
-
- if (skb_peek_tail(&friend->sk_receive_queue)) {
- sz = tcp_friend_send_lock(friend);
- if (!sz) {
- skb = skb_peek_tail(&friend->sk_receive_queue);
- if (skb && skb->friend) {
- if (!*copy)
- sz = skb_tailroom(skb);
- else {
- sz = *copy - skb->len;
- if (sz < 0)
- sz = 0;
- }
- if (sz > 0)
- TCP_FRIEND_CB(TCP_SKB_CB(skb))->
- tail_inuse = true;
- }
- tcp_friend_unlock(friend);
- }
- }
-
- *copy = sz;
- return skb;
-}
-
-static inline void tcp_friend_seq(struct sock *sk, int copy, int charge)
-{
- struct sock *friend = sk->sk_friend;
- struct tcp_sock *tp = tcp_sk(friend);
-
- if (charge) {
- sk_mem_charge(friend, charge);
- atomic_add(charge, &friend->sk_rmem_alloc);
- }
- tp->rcv_nxt += copy;
- tp->rcv_wup += copy;
- tcp_friend_unlock(friend);
-
- tp = tcp_sk(sk);
- tp->snd_nxt += copy;
- tp->pushed_seq += copy;
- tp->snd_una += copy;
- tp->snd_up += copy;
-}
-
static inline bool tcp_friend_push(struct sock *sk, struct sk_buff *skb)
{
- struct sock *friend = sk->sk_friend;
- int wait = false;
+ struct sock *friend = sk->sk_friend;
+ struct tcp_sock *tp = NULL;
+ int wait = false;
+
+ tcp_friend_state_lock(friend);
skb_set_owner_r(skb, friend);
- __skb_queue_tail(&friend->sk_receive_queue, skb);
if (!sk_rmem_schedule(friend, skb, skb->truesize))
wait = true;
+ __skb_queue_tail(&friend->sk_receive_queue, skb);
+
+ tcp_friend_state_unlock(friend);
- tcp_friend_seq(sk, skb->len, 0);
- if (skb == skb_peek(&friend->sk_receive_queue))
+ tp = tcp_sk(friend);
+ tp->rcv_nxt += skb->len;
+ tp->rcv_wup += skb->len;
+
+ tp = tcp_sk(sk);
+ tp->snd_nxt += skb->len;
+ tp->pushed_seq += skb->len;
+ tp->snd_una += skb->len;
+ tp->snd_up += skb->len;
+
+ if (skb_queue_len(&friend->sk_receive_queue) == 1)
friend->sk_data_ready(friend, 0);
return wait;
@@ -728,7 +688,6 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
tcb->seq = tcb->end_seq = tp->write_seq;
if (sk->sk_friend) {
skb->friend = sk;
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
return;
}
skb->csum = 0;
@@ -1048,8 +1007,17 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
+ if (friend) {
+ err = tcp_friend_get_state(friend);
+ if (err) {
+ sk->sk_err = -err;
+ err = -EPIPE;
+ goto out_err;
+ }
+ }
+
while (psize > 0) {
- struct sk_buff *skb;
+ struct sk_buff *skb = NULL;
struct tcp_skb_cb *tcb;
struct page *page = pages[poffset / PAGE_SIZE];
int copy, i;
@@ -1059,12 +1027,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
if (friend) {
copy = size_goal;
- skb = tcp_friend_tail(friend, ©);
- if (copy < 0) {
- sk->sk_err = -copy;
- err = -EPIPE;
- goto out_err;
- }
+ if (skb)
+ copy = copy - skb->len;
+ else
+ copy = 0;
} else if (!tcp_send_head(sk)) {
skb = NULL;
copy = 0;
@@ -1078,9 +1044,17 @@ new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
- if (friend)
+ if (friend) {
+ if (skb) {
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
+ }
+
+ /*
+ * new skb
+ */
skb = tcp_friend_alloc_skb(sk, 0);
- else
+ } else
skb = sk_stream_alloc_skb(sk, 0,
sk->sk_allocation);
if (!skb)
@@ -1097,10 +1071,7 @@ new_segment:
i = skb_shinfo(skb)->nr_frags;
can_coalesce = skb_can_coalesce(skb, i, page, offset);
if (!can_coalesce && i >= MAX_SKB_FRAGS) {
- if (friend) {
- if (TCP_FRIEND_CB(tcb)->tail_inuse)
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- } else
+ if (!friend)
tcp_mark_push(tp, skb);
goto new_segment;
}
@@ -1124,20 +1095,9 @@ new_segment:
psize -= copy;
if (friend) {
- err = tcp_friend_send_lock(friend);
- if (err) {
- sk->sk_err = -err;
- err = -EPIPE;
- goto out_err;
- }
tcb->end_seq += copy;
- if (TCP_FRIEND_CB(tcb)->tail_inuse) {
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- tcp_friend_seq(sk, copy, copy);
- } else {
- if (tcp_friend_push(sk, skb))
- goto wait_for_sndbuf;
- }
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
if (!psize)
goto out;
continue;
@@ -1172,6 +1132,18 @@ wait_for_memory:
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
+ if (friend) {
+ if (skb) {
+ tcp_friend_state_lock(friend);
+ if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+ tcp_friend_state_unlock(friend);
+ goto wait_for_sndbuf;
+ }
+ tcp_friend_state_unlock(friend);
+ skb = NULL;
+ }
+ }
+
if (!friend)
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
@@ -1266,7 +1238,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
struct iovec *iov;
struct sock *friend = sk->sk_friend;
struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *skb;
+ struct sk_buff *skb = NULL;
struct tcp_skb_cb *tcb;
int iovlen, flags, err, copied = 0;
int mss_now = 0, size_goal = size, copied_syn = 0, offset = 0;
@@ -1330,6 +1302,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
sg = !!(sk->sk_route_caps & NETIF_F_SG);
+ if (friend) {
+ err = tcp_friend_get_state(friend);
+ if (err) {
+ sk->sk_err = -err;
+ err = -EPIPE;
+ goto out_err;
+ }
+ }
+
while (--iovlen >= 0) {
size_t seglen = iov->iov_len;
unsigned char __user *from = iov->iov_base;
@@ -1350,12 +1331,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
int max = size_goal;
if (friend) {
- skb = tcp_friend_tail(friend, ©);
- if (copy < 0) {
- sk->sk_err = -copy;
- err = -EPIPE;
- goto out_err;
- }
+ if (skb)
+ copy = skb_availroom(skb);
+ else
+ copy = 0;
} else {
skb = tcp_write_queue_tail(sk);
if (tcp_send_head(sk)) {
@@ -1370,9 +1349,21 @@ new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
- if (friend)
+ if (friend) {
+ if (skb) {
+ /*
+ * Friend push old skb
+ */
+
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
+ }
+
+ /*
+ * new skb
+ */
skb = tcp_friend_alloc_skb(sk, max);
- else {
+ } else {
/* Allocate new segment. If the
* interface is SG, allocate skb
* fitting to single page.
@@ -1455,32 +1446,23 @@ new_segment:
copied += copy;
seglen -= copy;
- if (friend) {
- err = tcp_friend_send_lock(friend);
- if (err) {
- sk->sk_err = -err;
- err = -EPIPE;
- goto out_err;
- }
- tcb->end_seq += copy;
- if (TCP_FRIEND_CB(tcb)->tail_inuse) {
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- tcp_friend_seq(sk, copy, 0);
- } else {
- if (tcp_friend_push(sk, skb))
- goto wait_for_sndbuf;
- }
- continue;
- }
-
tcb->end_seq += copy;
+
skb_shinfo(skb)->gso_segs = 0;
if (copied == copy)
tcb->tcp_flags &= ~TCPHDR_PSH;
- if (seglen == 0 && iovlen == 0)
+ if (seglen == 0 && iovlen == 0) {
+ if (friend && skb) {
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
+ }
goto out;
+ }
+
+ if (friend)
+ continue;
if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
continue;
@@ -1501,6 +1483,17 @@ wait_for_memory:
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
+ if (friend) {
+ if (skb) {
+ tcp_friend_state_lock(friend);
+ if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+ tcp_friend_state_unlock(friend);
+ goto wait_for_sndbuf;
+ }
+ tcp_friend_state_unlock(friend);
+ skb = NULL;
+ }
+ }
if (!friend)
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
@@ -1514,10 +1507,7 @@ out:
do_fault:
if (skb->friend) {
- if (TCP_FRIEND_CB(tcb)->tail_inuse)
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- else
- __kfree_skb(skb);
+ __kfree_skb(skb);
} else if (!skb->len) {
tcp_unlink_write_queue(skb, sk);
/* It is the one place in all of TCP, except connection
@@ -1787,8 +1777,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
err = tcp_friend_validate(sk, &friend, &timeo);
if (err < 0)
return err;
- if (friend)
- tcp_friend_recv_lock(sk);
while ((skb = tcp_recv_skb(sk, seq, &offset, &len)) != NULL) {
if (len > 0) {
@@ -1803,9 +1791,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
break;
}
- if (friend)
- tcp_friend_unlock(sk);
-
used = recv_actor(desc, skb, offset, len);
if (used < 0) {
if (!copied)
@@ -1817,21 +1802,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
offset += used;
}
- if (friend)
- tcp_friend_recv_lock(sk);
- if (skb->friend) {
- len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
- if (len > 0) {
- /*
- * Friend did an skb_put() while we
- * were away so process the same skb.
- */
- if (!desc->count)
- break;
- tp->copied_seq = seq;
- goto again;
- }
- } else {
+ if (!skb->friend) {
/*
* If recv_actor drops the lock (e.g. TCP
* splice receive) the skb pointer might be
@@ -1844,19 +1815,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
break;
}
}
+
if (!skb->friend && tcp_hdr(skb)->fin) {
sk_eat_skb(sk, skb, false);
++seq;
break;
}
if (skb->friend) {
- if (!TCP_FRIEND_CB(TCP_SKB_CB(skb))->tail_inuse) {
- __skb_unlink(skb, &sk->sk_receive_queue);
- __kfree_skb(skb);
- tcp_friend_write_space(sk);
+ len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
+ if (len > 0) {
+ if (!desc->count)
+ break;
+ tp->copied_seq = seq;
+ goto again;
}
- tcp_friend_unlock(sk);
- tcp_friend_recv_lock(sk);
+ tcp_friend_state_lock(sk);
+ __skb_unlink(skb, &sk->sk_receive_queue);
+ __kfree_skb(skb);
+ tcp_friend_state_unlock(sk);
+ tcp_friend_write_space(sk);
} else
sk_eat_skb(sk, skb, 0);
if (!desc->count)
@@ -1866,7 +1843,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
tp->copied_seq = seq;
if (friend) {
- tcp_friend_unlock(sk);
tcp_friend_write_space(sk);
} else {
tcp_rcv_space_adjust(sk);
@@ -1903,7 +1879,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
bool copied_early = false;
struct sk_buff *skb;
u32 urg_hole = 0;
- bool locked = false;
lock_sock(sk);
@@ -1991,11 +1966,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
* slock, end_seq updated, so we can only use the bytes
* from *seq to end_seq!
*/
- if (friend && !locked) {
- tcp_friend_recv_lock(sk);
- locked = true;
- }
-
skb_queue_walk(&sk->sk_receive_queue, skb) {
tcb = TCP_SKB_CB(skb);
offset = *seq - tcb->seq;
@@ -2003,20 +1973,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
if (skb->friend) {
used = (u32)(tcb->end_seq - *seq);
if (used > 0) {
- tcp_friend_unlock(sk);
- locked = false;
/* Can use it all */
goto found_ok_skb;
}
/* No data to copyout */
if (flags & MSG_PEEK)
continue;
- if (!TCP_FRIEND_CB(tcb)->tail_inuse)
- goto unlink;
- break;
+ goto unlink;
}
- tcp_friend_unlock(sk);
- locked = false;
}
/* Now that we have two receive queues this
@@ -2043,11 +2007,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
/* Well, if we have backlog, try to process it now yet. */
- if (friend && locked) {
- tcp_friend_unlock(sk);
- locked = false;
- }
-
if (copied >= target && !sk->sk_backlog.tail)
break;
@@ -2262,17 +2221,7 @@ do_prequeue:
len -= used;
offset += used;
- tcp_rcv_space_adjust(sk);
-
-skip_copy:
- if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
- tp->urg_data = 0;
- tcp_fast_path_check(sk);
- }
-
if (skb->friend) {
- tcp_friend_recv_lock(sk);
- locked = true;
used = (u32)(tcb->end_seq - *seq);
if (used) {
/*
@@ -2280,29 +2229,28 @@ skip_copy:
* so if more to do process the same skb.
*/
if (len > 0) {
- tcp_friend_unlock(sk);
- locked = false;
goto found_ok_skb;
}
continue;
}
- if (TCP_FRIEND_CB(tcb)->tail_inuse) {
- /* Give sendmsg a chance */
- tcp_friend_unlock(sk);
- locked = false;
- continue;
- }
if (!(flags & MSG_PEEK)) {
unlink:
+ tcp_friend_state_lock(sk);
__skb_unlink(skb, &sk->sk_receive_queue);
__kfree_skb(skb);
- tcp_friend_unlock(sk);
- locked = false;
+ tcp_friend_state_unlock(sk);
tcp_friend_write_space(sk);
}
continue;
}
+ tcp_rcv_space_adjust(sk);
+skip_copy:
+ if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
+ tp->urg_data = 0;
+ tcp_fast_path_check(sk);
+ }
+
if (offset < skb->len)
continue;
else if (tcp_hdr(skb)->fin)
@@ -2323,9 +2271,6 @@ skip_copy:
break;
} while (len > 0);
- if (friend && locked)
- tcp_friend_unlock(sk);
-
if (user_recv) {
if (!skb_queue_empty(&tp->ucopy.prequeue)) {
int chunk;
--
1.7.4.4
^ permalink raw reply related
* RE: [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: David Laight @ 2012-12-12 14:57 UTC (permalink / raw)
To: Weiping Pan, davem; +Cc: brutus, netdev
In-Reply-To: <5e333588f6cb48cc3464b2263dcaa734b952e4c1.1355320534.git.wpan@redhat.com>
> MS BASE AF_UNIX FRIENDS TCP_STREAM_MS
> 1 10.70 5.40 4.02 37% 74%
> 2 28.01 9.67 7.97 28% 82%
> 4 55.53 19.78 16.48 29% 83%
> 8 115.40 38.22 33.51 29% 87%
> 16 227.31 81.06 67.70 29% 83%
> 32 446.20 166.59 129.31 28% 77%
> 64 849.04 336.77 259.43 30% 77%
> 128 1440.50 661.88 530.43 36% 80%
> 256 2404.70 1279.67 1029.15 42% 80%
> 512 4331.53 2501.30 1942.21 44% 77%
> 1024 6819.78 4622.37 4128.10 60% 89%
> 2048 10544.60 6348.81 6349.59 60% 100%
> 4096 12830.41 8324.43 7984.43 62% 95%
> 8192 13462.65 8355.49 11079.37 82% 132%
> 16384 9960.87 10840.13 13037.81 130% 120%
> 32768 8749.31 11372.15 15087.08 172% 132%
> 65536 7580.27 12150.23 14971.42 197% 123%
> 131072 6727.74 11451.34 13604.78 202% 118%
> 262144 7673.14 11613.10 11436.97 149% 98%
> 524288 7366.17 11675.95 11559.43 156% 99%
> 1048576 6608.57 11883.01 10103.20 152% 85%
> MS means Message Size in bytes, that is -m -M for netperf
If I read that table correctly, it seems to imply that
something goes badly wrong for 'normal' TCP loopback
connections when the read/write size exceeds 8k.
Putting effort into fixing that would appear to be
more worthwhile than the 'friends' code.
David
^ permalink raw reply
* [RFC] net : add tx timestamp to packet mmap.
From: Paul Chavent @ 2012-12-12 15:29 UTC (permalink / raw)
To: davem, edumazet, daniel.borkmann, xemul, ebiederm, netdev; +Cc: Paul Chavent
This patch allow to generate tx timestamps of packets sent by the packet mmap interface.
Actually, you can't get tx timestamps with the sample code below.
I wonder if my current implementation is good. And if not, how should i get the timestamps ?
Wouldn't be a good idea to put timestamps in the ring buffer frame before give it back to the user ?
Thanks for your comments.
/* BEGIN OF SAMPLE CODE */
struct timespec ts = {0,0};
struct sockaddr from_addr;
static uint8_t tmp_data[256];
struct iovec msg_iov = {tmp_data, sizeof(tmp_data)};
static uint8_t cmsg_buff[256];
struct msghdr msghdr = {&from_addr, sizeof(from_addr),
&msg_iov, 1,
cmsg_buff, sizeof(cmsg_buff),
0};
ssize_t err = recvmsg(itf->sock_fd, &msghdr, MSG_ERRQUEUE);
if(err < 0)
{
perror("recvmsg failed");
return -1;
}
struct cmsghdr *cmsg;
for(cmsg = CMSG_FIRSTHDR(&msghdr); cmsg != NULL; cmsg = CMSG_NXTHDR(&msghdr, cmsg))
{
if(cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_TIMESTAMPING)
{
ts = *(struct timespec *)CMSG_DATA(cmsg);
fprintf(stderr, "SCM_TIMESTAMPING available\n");
}
else if (cmsg->cmsg_level == SOL_PACKET && cmsg->cmsg_type == PACKET_TX_TIMESTAMP)
{
ts = *(struct timespec *)CMSG_DATA(cmsg);
fprintf(stderr, "PACKET_TX_TIMESTAMP available\n");
}
}
/* END OF SAMPLE CODE */
Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
---
net/packet/af_packet.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e639645..948748b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1857,6 +1857,10 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
void *data;
int err;
+ err = sock_tx_timestamp(&po->sk, &skb_shinfo(skb)->tx_flags);
+ if (err < 0)
+ return err;
+
ph.raw = frame;
skb->protocol = proto;
--
1.7.12.1
^ permalink raw reply related
* Network namespace bugs in L2TP
From: Tom Parkin @ 2012-12-12 15:51 UTC (permalink / raw)
To: ebiederm; +Cc: netdev
[-- Attachment #1: Type: text/plain, Size: 3545 bytes --]
Hi Eric,
I'm following up on this thread from later October in which you
pointed out some network namespace bugs in L2TP:
http://www.spinics.net/lists/netdev/msg214776.html
I use L2TP, and I'd like to help fix these bugs. But I'm not very
conversant with network namespaces, and so I'm struggling to fully
appreciate the issues you pointed out previously. Could you give me a
hand getting to grips with this?
So far I've tested L2TP within network namespaces, using both iproute2
to create sessions between two namespaces on the same host, and an
L2TP daemon running in a namespace to create sessions between two
hosts. In both cases I've done a bit of trivial ping and iperf
testing using Ethernet pseudowires.
To make this work I've had to add a couple of trivial patches (see
below).
There are two things I'm uncertain about:
1. Why do we need to change the namespace of the socket created in
l2tp_tunnel_sock_create? So far as I can tell, sock_create
defaults to the namespace of the calling process. Is the issue
here that this code may run from a work queue or similar?
2. You mentioned the need to keep track of sockets allocated within a
namespace in order to be able to clean them up when the namespace
is deleted. Should we be keeping a list of sockets we create and
then destroying them in the namespace pernet_ops exit function?
Thanks,
Tom
From b9c095fdf32c895b79a5954020c4745fe5518141 Mon Sep 17 00:00:00 2001
From: Tom Parkin <tparkin@katalix.com>
Date: Tue, 11 Dec 2012 13:03:48 +0000
Subject: [PATCH 1/2] l2tp: set netnsok flag for netlink messages
The L2TP netlink code can run in namespaces. Set the netnsok flag in
genl_family to true to reflect that fact.
---
net/l2tp/l2tp_netlink.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index bbba3a1..c1bab22 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -37,6 +37,7 @@ static struct genl_family l2tp_nl_family = {
.version = L2TP_GENL_VERSION,
.hdrsize = 0,
.maxattr = L2TP_ATTR_MAX,
+ .netnsok = true,
};
/* Accessed under genl lock */
--
1.7.9.5
From 13e9b0ddc48a16b384ffbf5ff64e6413cfa612f5 Mon Sep 17 00:00:00 2001
From: Tom Parkin <tparkin@katalix.com>
Date: Wed, 12 Dec 2012 12:50:54 +0000
Subject: [PATCH 2/2] l2tp: prevent tunnel creation on netns mismatch
l2tp_tunnel_create is passed a pointer to the network namespace for the
tunnel, along with an optional file descriptor for the tunnel which may
be passed in from userspace via. netlink.
In the case where the file descriptor is defined, ensure that the namespace
associated with that socket matches the namespace explicitly passed to
l2tp_tunnel_create.
---
net/l2tp/l2tp_core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 1a9f372..f8d200b 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1528,6 +1528,13 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
tunnel_id, fd, err);
goto err;
}
+
+ /* Reject namespace mismatches */
+ if (!net_eq(sock_net(sock->sk), net)) {
+ pr_err("tunl %hu: netns mismatch\n", tunnel_id);
+ err = -EBADF; /* TODO -- what value? */
+ goto err;
+ }
}
sk = sock->sk;
--
1.7.9.5
--
Tom Parkin
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply related
* [PATCHv2 iproute2] add DOVE extensions for iproute2
From: David L Stevens @ 2012-12-12 16:10 UTC (permalink / raw)
To: David Miller, Stephen Hemminger; +Cc: netdev
This patch adds a new flag to iproute2 for vxlan devices to enable
DOVE features. It also adds support for L2 and L3 switch lookup miss
netlink messages to "ip monitor".
Changes since v1:
- split "dove" flag into separate feature flags:
- "proxy" for ARP reduction
- "rsc" for route short circuiting
- "l2miss" for L2 switch miss notifications
- "l3miss" for L3 switch miss notifications
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 012d95a..a163702 100644
- --- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -283,6 +283,10 @@ enum {
IFLA_VXLAN_AGEING,
IFLA_VXLAN_LIMIT,
IFLA_VXLAN_PORT_RANGE,
+ IFLA_VXLAN_PROXY,
+ IFLA_VXLAN_RSC,
+ IFLA_VXLAN_L2MISS,
+ IFLA_VXLAN_L3MISS,
__IFLA_VXLAN_MAX
};
#define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index ba5c4ab..f2e6bef 100644
- --- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -26,6 +26,8 @@ static void explain(void)
fprintf(stderr, "Usage: ... vxlan id VNI [ group ADDR ] [ local ADDR ]\n");
fprintf(stderr, " [ ttl TTL ] [ tos TOS ] [ dev PHYS_DEV ]\n");
fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
+ fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
+ fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
fprintf(stderr, "\n");
fprintf(stderr, "Where: VNI := 0-16777215\n");
fprintf(stderr, " ADDR := { IP_ADDRESS | any }\n");
@@ -44,6 +46,10 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
__u8 tos = 0;
__u8 ttl = 0;
__u8 learning = 1;
+ __u8 proxy = 0;
+ __u8 rsc = 0;
+ __u8 l2miss = 0;
+ __u8 l3miss = 0;
__u8 noage = 0;
__u32 age = 0;
__u32 maxaddr = 0;
@@ -123,6 +129,22 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
learning = 0;
} else if (!matches(*argv, "learning")) {
learning = 1;
+ } else if (!matches(*argv, "noproxy")) {
+ proxy = 0;
+ } else if (!matches(*argv, "proxy")) {
+ proxy = 1;
+ } else if (!matches(*argv, "norsc")) {
+ rsc = 0;
+ } else if (!matches(*argv, "rsc")) {
+ rsc = 1;
+ } else if (!matches(*argv, "nol2miss")) {
+ l2miss = 0;
+ } else if (!matches(*argv, "l2miss")) {
+ l2miss = 1;
+ } else if (!matches(*argv, "nol3miss")) {
+ l3miss = 0;
+ } else if (!matches(*argv, "l3miss")) {
+ l3miss = 1;
} else if (matches(*argv, "help") == 0) {
explain();
return -1;
@@ -148,6 +170,10 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
+ addattr8(n, 1024, IFLA_VXLAN_PROXY, proxy);
+ addattr8(n, 1024, IFLA_VXLAN_RSC, rsc);
+ addattr8(n, 1024, IFLA_VXLAN_L2MISS, l2miss);
+ addattr8(n, 1024, IFLA_VXLAN_L3MISS, l3miss);
if (noage)
addattr32(n, 1024, IFLA_VXLAN_AGEING, 0);
else if (age)
@@ -213,6 +239,18 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
if (tb[IFLA_VXLAN_LEARNING] &&
!rta_getattr_u8(tb[IFLA_VXLAN_LEARNING]))
fputs("nolearning ", f);
+
+ if (tb[IFLA_VXLAN_PROXY] && rta_getattr_u8(tb[IFLA_VXLAN_PROXY]))
+ fputs("proxy ", f);
+
+ if (tb[IFLA_VXLAN_RSC] && rta_getattr_u8(tb[IFLA_VXLAN_RSC]))
+ fputs("rsc ", f);
+
+ if (tb[IFLA_VXLAN_L2MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L2MISS]))
+ fputs("l2miss ", f);
+
+ if (tb[IFLA_VXLAN_L3MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L3MISS]))
+ fputs("l3miss ", f);
if (tb[IFLA_VXLAN_TOS] &&
(tos = rta_getattr_u8(tb[IFLA_VXLAN_TOS]))) {
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 4b1d469..7a7cc88 100644
- --- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -67,7 +67,8 @@ int accept_msg(const struct sockaddr_nl *who,
print_addrlabel(who, n, arg);
return 0;
}
- - if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH) {
+ if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH ||
+ n->nlmsg_type == RTM_GETNEIGH) {
if (prefix_banner)
fprintf(fp, "[NEIGH]");
print_neigh(who, n, arg);
diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index 56e56b2..1b7600b 100644
- --- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -189,7 +189,8 @@ int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
struct rtattr * tb[NDA_MAX+1];
char abuf[256];
- - if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH) {
+ if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH &&
+ n->nlmsg_type != RTM_GETNEIGH) {
fprintf(stderr, "Not RTM_NEWNEIGH: %08x %08x %08x\n",
n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
@@ -251,6 +252,8 @@ int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
if (n->nlmsg_type == RTM_DELNEIGH)
fprintf(fp, "delete ");
+ else if (n->nlmsg_type == RTM_GETNEIGH)
+ fprintf(fp, "miss ");
if (tb[NDA_DST]) {
fprintf(fp, "%s ",
format_host(r->ndm_family,
^ permalink raw reply
* Re: [patch net-next 0/4] net: allow to change carrier from userspace
From: Stephen Hemminger @ 2012-12-12 16:15 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, edumazet, bhutchings, mirqus, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>
On Wed, 12 Dec 2012 11:58:03 +0100
Jiri Pirko <jiri@resnulli.us> wrote:
> This is basically a repost of my previous patchset:
> "[patch net-next-2.6 0/2] net: allow to change carrier via sysfs" from Aug 30
>
> The way net-sysfs stores values changed and this patchset reflects it.
> Also, I exposed carrier via rtnetlink iface.
>
> So far, only dummy driver uses carrier change ndo. In very near future
> team driver will use that as well.
>
> Jiri Pirko (4):
> net: add change_carrier netdev op
> net: allow to change carrier via sysfs
> rtnl: expose carrier value with possibility to set it
> dummy: implement carrier change
>
> drivers/net/dummy.c | 10 ++++++++++
> include/linux/netdevice.h | 7 +++++++
> include/uapi/linux/if_link.h | 1 +
> net/core/dev.c | 19 +++++++++++++++++++
> net/core/net-sysfs.c | 15 ++++++++++++++-
> net/core/rtnetlink.c | 10 ++++++++++
> 6 files changed, 61 insertions(+), 1 deletion(-)
>
I needed to do the same thing for a project we are working on and discovered
that there already is a working documented interface for doing that via
operstate mode. Therefore I can't recommend that the additional complexity
of a new API for this is required.
^ permalink raw reply
* Re: [PATCH] iproute2: fix tc ematch manpage section
From: Stephen Hemminger @ 2012-12-12 16:16 UTC (permalink / raw)
To: Andreas Henriksson; +Cc: netdev
In-Reply-To: <20121212112348.GA6520@amd64.fatal.se>
On Wed, 12 Dec 2012 12:23:48 +0100
Andreas Henriksson <andreas@fatal.se> wrote:
> The debian package checking tool, lintian, spotted that the
> tc ematch manpage seems to have an error in the specified section.
>
> Signed-off-by: Andreas Henriksson <andreas@fatal.se>
>
> diff --git a/man/man8/tc-ematch.8 b/man/man8/tc-ematch.8
> index 2eafc29..957a22e 100644
> --- a/man/man8/tc-ematch.8
> +++ b/man/man8/tc-ematch.8
> @@ -1,4 +1,4 @@
> -.TH filter ematch "6 August 2012" iproute2 Linux
> +.TH ematch 8 "6 August 2012" iproute2 Linux
> .
> .SH NAME
> ematch \- extended matches for use with "basic" or "flow" filters
Applied, thanks.
^ permalink raw reply
* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: Eric Dumazet @ 2012-12-12 16:25 UTC (permalink / raw)
To: Weiping Pan; +Cc: davem, brutus, netdev
In-Reply-To: <5e333588f6cb48cc3464b2263dcaa734b952e4c1.1355320534.git.wpan@redhat.com>
On Wed, 2012-12-12 at 22:29 +0800, Weiping Pan wrote:
>
> MS BASE AF_UNIX FRIENDS TCP_STREAM_MS
> 1 10.70 5.40 4.02 37% 74%
> 2 28.01 9.67 7.97 28% 82%
> 4 55.53 19.78 16.48 29% 83%
> 8 115.40 38.22 33.51 29% 87%
> 16 227.31 81.06 67.70 29% 83%
> 32 446.20 166.59 129.31 28% 77%
> 64 849.04 336.77 259.43 30% 77%
> 128 1440.50 661.88 530.43 36% 80%
> 256 2404.70 1279.67 1029.15 42% 80%
> 512 4331.53 2501.30 1942.21 44% 77%
> 1024 6819.78 4622.37 4128.10 60% 89%
> 2048 10544.60 6348.81 6349.59 60% 100%
> 4096 12830.41 8324.43 7984.43 62% 95%
> 8192 13462.65 8355.49 11079.37 82% 132%
> 16384 9960.87 10840.13 13037.81 130% 120%
> 32768 8749.31 11372.15 15087.08 172% 132%
> 65536 7580.27 12150.23 14971.42 197% 123%
> 131072 6727.74 11451.34 13604.78 202% 118%
> 262144 7673.14 11613.10 11436.97 149% 98%
> 524288 7366.17 11675.95 11559.43 156% 99%
> 1048576 6608.57 11883.01 10103.20 152% 85%
> MS means Message Size in bytes, that is -m -M for netperf
I cant reproduce your strange numbers here, they make no sense to me.
for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
65536 131072 262144 524288 1048576
do
./netperf -- -m $s -M $s | tail -n1
done
Results :
87380 16384 1 10.00 34.68
87380 16384 2 10.00 68.07
87380 16384 4 10.00 126.27
87380 16384 8 10.00 284.50
87380 16384 16 10.00 574.38
87380 16384 32 10.00 1091.74
87380 16384 64 10.00 2130.23
87380 16384 128 10.00 4001.83
87380 16384 256 10.00 7666.01
87380 16384 512 10.00 13425.81
87380 16384 1024 10.00 21146.43
87380 16384 2048 10.00 28551.42
87380 16384 4096 10.00 37878.95
87380 16384 8192 10.00 42507.23
87380 16384 16384 10.00 46782.53
87380 16384 32768 10.00 42410.97
87380 16384 65536 10.00 43053.09
87380 16384 131072 10.00 44504.20
87380 16384 262144 10.00 50211.74
87380 16384 524288 10.00 54004.23
87380 16384 1048576 10.00 53852.26
^ permalink raw reply
* Re: [PATCH] net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
From: Daniel Borkmann @ 2012-12-12 16:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Ani Sinha
In-Reply-To: <1355314964.9139.173.camel@edumazet-glaptop>
On 12/12/2012 01:22 PM, Eric Dumazet wrote:
> On Wed, 2012-12-12 at 10:31 +0100, Daniel Borkmann wrote:
>> Currently, we return -EINVAL for malicious or wrong BPF filters.
>> However, this is not done for BPF_S_ANC* operations, which makes it
>> more difficult to detect if it's actually supported or not by the
>> BPF machine. Therefore, we should also return -EINVAL if K is within
>> the SKF_AD_OFF universe and the ancillary operation did not match.
>>
>> Cc: Ani Sinha <ani@aristanetworks.com>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> ---
>> net/core/filter.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index c23543c..de9bed4 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -531,7 +531,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> [BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
>> [BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
>> };
>> - int pc;
>> + int pc, anc_found;
>>
>> if (flen == 0 || flen > BPF_MAXINSNS)
>> return -EINVAL;
>> @@ -592,8 +592,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> case BPF_S_LD_W_ABS:
>> case BPF_S_LD_H_ABS:
>> case BPF_S_LD_B_ABS:
>> + anc_found = 0;
>> #define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
>> code = BPF_S_ANC_##CODE; \
>> + anc_found = 1; \
>> break
>> switch (ftest->k) {
>> ANCILLARY(PROTOCOL);
>> @@ -610,6 +612,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> ANCILLARY(VLAN_TAG);
>> ANCILLARY(VLAN_TAG_PRESENT);
>> }
>> +
>> + /* ancillary operation unkown or unsupported */
>> + if (anc_found == 0 && ftest->k >= SKF_AD_OFF)
>> + return -EINVAL;
>> }
>> ftest->code = code;
>> }
>
> Several points :
>
> 1) This might break a userland filter that was previously working, by
> returning 0 when load_pointer() returns NULL.
>
> Specifying an offset bigger than skb->len is not _invalid_, it only
> makes a filter returns 0, because load_pointer() returns NULL.
I think it will not break for code, that calls load_pointer() in such a
circumstance which passed the sk_chk_filter() test. However, it will
"break" for code that calls ...
{ BPF_LD | BPF_(W|H|B) | BPF_ABS, 0, 0, <K> },
... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is not an ancillary.
But ...
Assuming some old code will have such an instruction where <K> is between
[0xfffff000, 0xffffffff] and it doesn't know ancillary operations, then
this will give a non-expected/unwanted behavior as well (since we do not
return the BPF machine with 0 as it probably was the case before anc.ops,
but load sth. into the accumulator instead and continue with the next
instruction, for instance), right? Thus, following this argumentation, user
space code would already have been broken by introducing ancillary
operations into the BPF machine per se.
This is probably just an assumption, but code that does such a direct load,
e.g. "load word at packet offset 0xffffffff into accumulator" ("ld [0xffffffff]")
is quite broken, isn't it? Isn't the whole assumption of ancillary operations
that no-one intentionally calls things like "ld [0xffffffff]" and expect this
word to be loaded from the packet offset?
> 2) This wont help applications running on old kernels where your patch
> wont be applied, as already mentioned yesterday.
Agreed, but leaving old kernels aside, it would be nice if newer kernels
could validate that, so at least from kernel <xyz> onwards it could be
checked _for sure_ if anc.op <abc> is present and can be used.
> 3) Misses a "Reported-by" tag
>
> 4) anc_found is a boolean
3 + 4 agreed, sorry for that. I could do a v2 of the patch with 3 + 4 fixed
and resubmit it, if there's interest ...
> To be truly portable, userland should not rely on kernel doing a full
> validation of ancillaries.
^ permalink raw reply
* Re: [PATCH] tun: allow setting ethernet addresss while running
From: Stephen Hemminger @ 2012-12-12 16:38 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: davem, netdev, jasowang
In-Reply-To: <alpine.LNX.2.01.1212120427370.16297@nerf07.vanv.qr>
On Wed, 12 Dec 2012 04:27:54 +0100 (CET)
Jan Engelhardt <jengelh@inai.de> wrote:
> On Tuesday 2012-12-11 02:16, Stephen Hemminger wrote:
>
> >This is a pure software device, and ok with live address change.
> >--- a/drivers/net/tun.c
> >+++ b/drivers/net/tun.c
> >@@ -849,6 +849,7 @@ static void tun_net_init(struct net_device *dev)
> > /* Ethernet TAP Device */
> > ether_setup(dev);
> > dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> >+ dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> >
> > eth_hw_addr_random(dev);
>
> Would this possibly apply to L2TP devices as well?
L2TP does not allow changing mac address at all right now.
Only drivers that use eth_mac_addr, can take advantage of the flag.
Looking around here are the other places that could use it.
vxlan, xen-netfront?, gre, gre6, virtio_net?, hyperv?
Also the following look buggy.
c2 allows changing mac address but never tells hardware?
isdn/hysdn_net.c allows setting mac address but then resets it
card value in net_open
xpnet allows setting address but it looks like it fixed by hardware
ipddp allows ethernet address but protocol is not ethernet
^ permalink raw reply
* [PATCH net-next] uapi: add missing netconf.h to export list
From: Stephen Hemminger @ 2012-12-12 16:58 UTC (permalink / raw)
To: David Miller; +Cc: Nicolas Dichtel, netdev
In-Reply-To: <1355305907-7102-1-git-send-email-nicolas.dichtel@6wind.com>
Add netconf.h for use by iproute2.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/include/uapi/linux/Kbuild 2012-10-25 09:11:15.499273810 -0700
+++ b/include/uapi/linux/Kbuild 2012-12-12 08:56:36.130263710 -0800
@@ -258,6 +258,7 @@ header-y += neighbour.h
header-y += net.h
header-y += net_dropmon.h
header-y += net_tstamp.h
+header-y += netconf.h
header-y += netdevice.h
header-y += netfilter.h
header-y += netfilter_arp.h
^ permalink raw reply
* Re: [PATCH iproute2 1/3] ip: add support of netconf messages
From: Stephen Hemminger @ 2012-12-12 16:59 UTC (permalink / raw)
To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1355305907-7102-1-git-send-email-nicolas.dichtel@6wind.com>
Ok, but the headers for all of iproute2 are supposed to come from
sanitized kernel headers from "make headers_install"
You missed that piece in the original patch.
^ permalink raw reply
* Re: [PATCH iproute2 1/3] ip: add support of netconf messages
From: Nicolas Dichtel @ 2012-12-12 17:03 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20121212085959.31d8cd79@nehalam.linuxnetplumber.net>
Le 12/12/2012 17:59, Stephen Hemminger a écrit :
> Ok, but the headers for all of iproute2 are supposed to come from
> sanitized kernel headers from "make headers_install"
>
> You missed that piece in the original patch.
>
Right! I will update the patch.
^ permalink raw reply
* Re: [PATCH net-next] uapi: add missing netconf.h to export list
From: Nicolas Dichtel @ 2012-12-12 17:04 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20121212085852.5b840314@nehalam.linuxnetplumber.net>
Le 12/12/2012 17:58, Stephen Hemminger a écrit :
> Add netconf.h for use by iproute2.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>
> --- a/include/uapi/linux/Kbuild 2012-10-25 09:11:15.499273810 -0700
> +++ b/include/uapi/linux/Kbuild 2012-12-12 08:56:36.130263710 -0800
> @@ -258,6 +258,7 @@ header-y += neighbour.h
> header-y += net.h
> header-y += net_dropmon.h
> header-y += net_tstamp.h
> +header-y += netconf.h
> header-y += netdevice.h
> header-y += netfilter.h
> header-y += netfilter_arp.h
>
^ permalink raw reply
* Re: [patch net-next 0/4] net: allow to change carrier from userspace
From: Jiri Pirko @ 2012-12-12 17:05 UTC (permalink / raw)
To: Stephen Hemminger
Cc: netdev, davem, edumazet, bhutchings, mirqus, greearb, fbl
In-Reply-To: <20121212081500.24085752@nehalam.linuxnetplumber.net>
Wed, Dec 12, 2012 at 05:15:00PM CET, shemminger@vyatta.com wrote:
>On Wed, 12 Dec 2012 11:58:03 +0100
>Jiri Pirko <jiri@resnulli.us> wrote:
>
>> This is basically a repost of my previous patchset:
>> "[patch net-next-2.6 0/2] net: allow to change carrier via sysfs" from Aug 30
>>
>> The way net-sysfs stores values changed and this patchset reflects it.
>> Also, I exposed carrier via rtnetlink iface.
>>
>> So far, only dummy driver uses carrier change ndo. In very near future
>> team driver will use that as well.
>>
>> Jiri Pirko (4):
>> net: add change_carrier netdev op
>> net: allow to change carrier via sysfs
>> rtnl: expose carrier value with possibility to set it
>> dummy: implement carrier change
>>
>> drivers/net/dummy.c | 10 ++++++++++
>> include/linux/netdevice.h | 7 +++++++
>> include/uapi/linux/if_link.h | 1 +
>> net/core/dev.c | 19 +++++++++++++++++++
>> net/core/net-sysfs.c | 15 ++++++++++++++-
>> net/core/rtnetlink.c | 10 ++++++++++
>> 6 files changed, 61 insertions(+), 1 deletion(-)
>>
>
>I needed to do the same thing for a project we are working on and discovered
>that there already is a working documented interface for doing that via
>operstate mode. Therefore I can't recommend that the additional complexity
>of a new API for this is required.
I might be missing something, but I'm unable to find how operstate set
can affect value returned by netif_carrier_ok()
^ permalink raw reply
* Re: [PATCH iproute2 3/3] ip: add support of 'ip link type [ipip|sit]'
From: Stephen Hemminger @ 2012-12-12 17:11 UTC (permalink / raw)
To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1355305907-7102-3-git-send-email-nicolas.dichtel@6wind.com>
On Wed, 12 Dec 2012 10:51:47 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
> This patch allows to manage ip tunnels via the interface ip link.
> The syntax for parameters is the same that 'ip tunnel'.
>
> It also allows to display tunnels parameters with 'ip -details link' or
> 'ip -details monitor link'.
>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
> ip/Makefile | 3 +-
> ip/iplink.c | 2 +-
> ip/link_iptnl.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 343 insertions(+), 2 deletions(-)
> create mode 100644 ip/link_iptnl.c
>
> diff --git a/ip/Makefile b/ip/Makefile
> index abf54bf..2b606d4 100644
> --- a/ip/Makefile
> +++ b/ip/Makefile
> @@ -4,7 +4,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
> ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
> iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
> iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o \
> - iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o
> + iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
> + link_iptnl.o
>
> RTMONOBJ=rtmon.o
>
> diff --git a/ip/iplink.c b/ip/iplink.c
> index 8aac9fc..d73c705 100644
> --- a/ip/iplink.c
> +++ b/ip/iplink.c
> @@ -84,7 +84,7 @@ void iplink_usage(void)
> if (iplink_have_newlink()) {
> fprintf(stderr, "\n");
> fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | can |\n");
> - fprintf(stderr, " bridge | ipoib | ip6tnl }\n");
> + fprintf(stderr, " bridge | ipoib | ip6tnl | ipip | sit }\n");
> }
> exit(-1);
> }
> diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
> new file mode 100644
> index 0000000..238722d
> --- /dev/null
> +++ b/ip/link_iptnl.c
> @@ -0,0 +1,340 @@
> +/*
> + * link_iptnl.c ipip and sit driver module
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + *
> + * Authors: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> + *
> + */
> +
> +#include <string.h>
> +#include <net/if.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <arpa/inet.h>
> +
> +#include <linux/ip.h>
> +#include <linux/if_tunnel.h>
> +#include "rt_names.h"
> +#include "utils.h"
> +#include "ip_common.h"
> +#include "tunnel.h"
> +
> +static void usage(int sit) __attribute__((noreturn));
> +static void usage(int sit)
> +{
> + fprintf(stderr, "Usage: ip link { add | set | change | replace | del } NAME\n");
> + fprintf(stderr, " type { ipip | sit } [ remote ADDR ] [ local ADDR ]\n");
> + fprintf(stderr, " [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]\n");
> + fprintf(stderr, " [ 6rd-prefix ADDR ] [ 6rd-relay_prefix ADDR ] [ 6rd-reset ]\n");
> + if (sit)
> + fprintf(stderr, " [ isatap ]\n");
> + fprintf(stderr, "\n");
> + fprintf(stderr, "Where: NAME := STRING\n");
> + fprintf(stderr, " ADDR := { IP_ADDRESS | any }\n");
> + fprintf(stderr, " TOS := { NUMBER | inherit }\n");
> + fprintf(stderr, " TTL := { 1..255 | inherit }\n");
> + exit(-1);
> +}
> +
> +static int iptunnel_parse_opt(struct link_util *lu, int argc, char **argv,
> + struct nlmsghdr *n)
> +{
> + struct {
> + struct nlmsghdr n;
> + struct ifinfomsg i;
> + char buf[2048];
> + } req;
> + struct ifinfomsg *ifi = (struct ifinfomsg *)(n + 1);
> + struct rtattr *tb[IFLA_MAX + 1];
> + struct rtattr *linkinfo[IFLA_INFO_MAX+1];
> + struct rtattr *iptuninfo[IFLA_IPTUN_MAX + 1];
> + int len;
> + __u32 link = 0;
> + __u32 laddr = 0;
> + __u32 raddr = 0;
> + __u8 ttl = 0;
> + __u8 tos = 0;
> + __u8 pmtudisc = 1;
> + __u16 iflags = 0;
> + struct in6_addr ip6rdprefix;
> + __u16 ip6rdprefixlen = 0;
> + __u32 ip6rdrelayprefix = 0;
> + __u16 ip6rdrelayprefixlen = 0;
> +
> + memset(&ip6rdprefix, 0, sizeof(ip6rdprefix));
> +
> + if (!(n->nlmsg_flags & NLM_F_CREATE)) {
> + memset(&req, 0, sizeof(req));
> +
> + req.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi));
> + req.n.nlmsg_flags = NLM_F_REQUEST;
> + req.n.nlmsg_type = RTM_GETLINK;
> + req.i.ifi_family = preferred_family;
> + req.i.ifi_index = ifi->ifi_index;
> +
> + if (rtnl_talk(&rth, &req.n, 0, 0, &req.n) < 0) {
> +get_failed:
> + fprintf(stderr,
> + "Failed to get existing tunnel info.\n");
> + return -1;
> + }
> +
> + len = req.n.nlmsg_len;
> + len -= NLMSG_LENGTH(sizeof(*ifi));
> + if (len < 0)
> + goto get_failed;
> +
> + parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
> +
> + if (!tb[IFLA_LINKINFO])
> + goto get_failed;
> +
> + parse_rtattr_nested(linkinfo, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
> +
> + if (!linkinfo[IFLA_INFO_DATA])
> + goto get_failed;
> +
> + parse_rtattr_nested(iptuninfo, IFLA_IPTUN_MAX,
> + linkinfo[IFLA_INFO_DATA]);
> +
> + if (iptuninfo[IFLA_IPTUN_LOCAL])
> + laddr = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LOCAL]);
> +
> + if (iptuninfo[IFLA_IPTUN_REMOTE])
> + raddr = rta_getattr_u32(iptuninfo[IFLA_IPTUN_REMOTE]);
> +
> + if (iptuninfo[IFLA_IPTUN_TTL])
> + ttl = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TTL]);
> +
> + if (iptuninfo[IFLA_IPTUN_TOS])
> + tos = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TOS]);
> +
> + if (iptuninfo[IFLA_IPTUN_PMTUDISC])
> + pmtudisc =
> + rta_getattr_u8(iptuninfo[IFLA_IPTUN_PMTUDISC]);
> +
> + if (iptuninfo[IFLA_IPTUN_FLAGS])
> + iflags = rta_getattr_u16(iptuninfo[IFLA_IPTUN_FLAGS]);
> +
> + if (iptuninfo[IFLA_IPTUN_LINK])
> + link = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LINK]);
> +
> + if (iptuninfo[IFLA_IPTUN_6RD_PREFIX])
> + memcpy(&ip6rdprefix,
> + RTA_DATA(iptuninfo[IFLA_IPTUN_6RD_PREFIX]),
> + sizeof(laddr));
> +
> + if (iptuninfo[IFLA_IPTUN_6RD_PREFIXLEN])
> + ip6rdprefixlen =
> + rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_PREFIXLEN]);
> +
> + if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIX])
> + ip6rdrelayprefix =
> + rta_getattr_u32(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIX]);
> +
> + if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN])
> + ip6rdrelayprefixlen =
> + rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
> + }
> +
> + while (argc > 0) {
> + if (strcmp(*argv, "remote") == 0) {
> + NEXT_ARG();
> + if (strcmp(*argv, "any"))
> + raddr = get_addr32(*argv);
> + else
> + raddr = 0;
> + } else if (strcmp(*argv, "local") == 0) {
> + NEXT_ARG();
> + if (strcmp(*argv, "any"))
> + laddr = get_addr32(*argv);
> + else
> + laddr = 0;
> + } else if (matches(*argv, "dev") == 0) {
> + NEXT_ARG();
> + link = if_nametoindex(*argv);
> + if (link == 0)
> + invarg("\"dev\" is invalid", *argv);
> + } else if (strcmp(*argv, "ttl") == 0 ||
> + strcmp(*argv, "hoplimit") == 0) {
> + NEXT_ARG();
> + if (strcmp(*argv, "inherit") != 0) {
> + if (get_u8(&ttl, *argv, 0))
> + invarg("invalid TTL\n", *argv);
> + } else
> + ttl = 0;
> + } else if (strcmp(*argv, "tos") == 0 ||
> + strcmp(*argv, "tclass") == 0 ||
> + matches(*argv, "dsfield") == 0) {
> + __u32 uval;
> + NEXT_ARG();
> + if (strcmp(*argv, "inherit") != 0) {
> + if (rtnl_dsfield_a2n(&uval, *argv))
> + invarg("bad TOS value", *argv);
> + tos = uval;
> + } else
> + tos = 1;
> + } else if (strcmp(*argv, "nopmtudisc") == 0) {
> + pmtudisc = 0;
> + } else if (strcmp(*argv, "pmtudisc") == 0) {
> + pmtudisc = 1;
> + } else if (strcmp(lu->id, "sit") == 0 &&
> + strcmp(*argv, "isatap") == 0) {
> + iflags |= SIT_ISATAP;
> + } else if (strcmp(*argv, "6rd-prefix") == 0) {
> + inet_prefix prefix;
> + NEXT_ARG();
> + if (get_prefix(&prefix, *argv, AF_INET6))
> + invarg("invalid 6rd_prefix\n", *argv);
> + memcpy(&ip6rdprefix, prefix.data, 16);
> + ip6rdprefixlen = prefix.bitlen;
> + } else if (strcmp(*argv, "6rd-relay_prefix") == 0) {
> + inet_prefix prefix;
> + NEXT_ARG();
> + if (get_prefix(&prefix, *argv, AF_INET))
> + invarg("invalid 6rd-relay_prefix\n", *argv);
> + memcpy(&ip6rdrelayprefix, prefix.data, 4);
> + ip6rdrelayprefixlen = prefix.bitlen;
> + } else if (strcmp(*argv, "6rd-reset") == 0) {
> + inet_prefix prefix;
> + get_prefix(&prefix, "2002::", AF_INET6);
> + memcpy(&ip6rdprefix, prefix.data, 16);
> + ip6rdprefixlen = 16;
> + ip6rdrelayprefix = 0;
> + ip6rdrelayprefixlen = 0;
> + } else
> + usage(strcmp(lu->id, "sit") == 0);
> + argc--, argv++;
> + }
> +
> + if (ttl && pmtudisc == 0) {
> + fprintf(stderr, "ttl != 0 and noptmudisc are incompatible\n");
> + exit(-1);
> + }
> +
> + addattr32(n, 1024, IFLA_IPTUN_LINK, link);
> + addattr32(n, 1024, IFLA_IPTUN_LOCAL, laddr);
> + addattr32(n, 1024, IFLA_IPTUN_REMOTE, raddr);
> + addattr8(n, 1024, IFLA_IPTUN_TTL, ttl);
> + addattr8(n, 1024, IFLA_IPTUN_TOS, tos);
> + addattr8(n, 1024, IFLA_IPTUN_PMTUDISC, pmtudisc);
> + if (strcmp(lu->id, "sit") == 0) {
> + addattr16(n, 1024, IFLA_IPTUN_FLAGS, iflags);
> + if (ip6rdprefixlen) {
> + addattr_l(n, 1024, IFLA_IPTUN_6RD_PREFIX,
> + &ip6rdprefix, sizeof(ip6rdprefix));
> + addattr16(n, 1024, IFLA_IPTUN_6RD_PREFIXLEN,
> + ip6rdprefixlen);
> + addattr32(n, 1024, IFLA_IPTUN_6RD_RELAY_PREFIX,
> + ip6rdrelayprefix);
> + addattr16(n, 1024, IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
> + ip6rdrelayprefixlen);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void iptunnel_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
> +{
> + char s1[1024];
> + char s2[64];
> + const char *local = "any";
> + const char *remote = "any";
> +
> + if (!tb)
> + return;
> +
> + if (tb[IFLA_IPTUN_REMOTE]) {
> + unsigned addr = rta_getattr_u32(tb[IFLA_IPTUN_REMOTE]);
> +
> + if (addr)
> + remote = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
> + }
> +
> + fprintf(f, "remote %s ", remote);
> +
> + if (tb[IFLA_IPTUN_LOCAL]) {
> + unsigned addr = rta_getattr_u32(tb[IFLA_IPTUN_LOCAL]);
> +
> + if (addr)
> + local = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
> + }
> +
> + fprintf(f, "local %s ", local);
> +
> + if (tb[IFLA_IPTUN_LINK] && rta_getattr_u32(tb[IFLA_IPTUN_LINK])) {
> + unsigned link = rta_getattr_u32(tb[IFLA_IPTUN_LINK]);
> + const char *n = if_indextoname(link, s2);
> +
> + if (n)
> + fprintf(f, "dev %s ", n);
> + else
> + fprintf(f, "dev %u ", link);
> + }
> +
> + if (tb[IFLA_IPTUN_TTL] && rta_getattr_u8(tb[IFLA_IPTUN_TTL]))
> + fprintf(f, "ttl %d ", rta_getattr_u8(tb[IFLA_IPTUN_TTL]));
> + else
> + fprintf(f, "ttl inherit ");
> +
> + if (tb[IFLA_IPTUN_TOS] && rta_getattr_u8(tb[IFLA_IPTUN_TOS])) {
> + int tos = rta_getattr_u8(tb[IFLA_IPTUN_TOS]);
> +
> + fputs("tos ", f);
> + if (tos == 1)
> + fputs("inherit ", f);
> + else
> + fprintf(f, "0x%x ", tos);
> + }
> +
> + if (tb[IFLA_IPTUN_PMTUDISC] && rta_getattr_u8(tb[IFLA_IPTUN_PMTUDISC]))
> + fprintf(f, "pmtudisc ");
> + else
> + fprintf(f, "nopmtudisc ");
> +
> + if (tb[IFLA_IPTUN_FLAGS]) {
> + __u16 iflags = rta_getattr_u16(tb[IFLA_IPTUN_FLAGS]);
> +
> + if (iflags & SIT_ISATAP)
> + fprintf(f, "isatap ");
> + }
> +
> + if (tb[IFLA_IPTUN_6RD_PREFIXLEN] &&
> + *(__u16 *)RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIXLEN])) {
> + __u16 prefixlen = rta_getattr_u16(tb[IFLA_IPTUN_6RD_PREFIXLEN]);
> + __u16 relayprefixlen =
> + rta_getattr_u16(tb[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
> + __u32 relayprefix =
> + rta_getattr_u32(tb[IFLA_IPTUN_6RD_RELAY_PREFIX]);
> +
> + printf("6rd-prefix %s/%u ",
> + inet_ntop(AF_INET6, RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIX]),
> + s1, sizeof(s1)),
> + prefixlen);
> + if (relayprefix) {
> + printf("6rd-relay_prefix %s/%u ",
> + format_host(AF_INET, 4, &relayprefix, s1,
> + sizeof(s1)),
> + relayprefixlen);
> + }
> + }
> +}
> +
> +struct link_util ipip_link_util = {
> + .id = "ipip",
> + .maxattr = IFLA_IPTUN_MAX,
> + .parse_opt = iptunnel_parse_opt,
> + .print_opt = iptunnel_print_opt,
> +};
> +
> +struct link_util sit_link_util = {
> + .id = "sit",
> + .maxattr = IFLA_IPTUN_MAX,
> + .parse_opt = iptunnel_parse_opt,
> + .print_opt = iptunnel_print_opt,
> +};
All applied with minor corrections to header files.
Could you please add man pages for this new functionality?
^ permalink raw reply
* Re: [PATCHv2 iproute2] add DOVE extensions for iproute2
From: Stephen Hemminger @ 2012-12-12 17:14 UTC (permalink / raw)
To: David L Stevens; +Cc: David Miller, netdev
In-Reply-To: <201212121612.qBCGAimS017147@lab1.dls>
On Wed, 12 Dec 2012 11:10:44 -0500
David L Stevens <dlstevens@us.ibm.com> wrote:
>
> This patch adds a new flag to iproute2 for vxlan devices to enable
> DOVE features. It also adds support for L2 and L3 switch lookup miss
> netlink messages to "ip monitor".
>
> Changes since v1:
> - split "dove" flag into separate feature flags:
> - "proxy" for ARP reduction
> - "rsc" for route short circuiting
> - "l2miss" for L2 switch miss notifications
> - "l3miss" for L3 switch miss notifications
>
> Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
>
> diff --git a/include/linux/if_link.h b/include/linux/if_link.h
> index 012d95a..a163702 100644
> - --- a/include/linux/if_link.h
> +++ b/include/linux/if_link.h
> @@ -283,6 +283,10 @@ enum {
> IFLA_VXLAN_AGEING,
> IFLA_VXLAN_LIMIT,
> IFLA_VXLAN_PORT_RANGE,
> + IFLA_VXLAN_PROXY,
> + IFLA_VXLAN_RSC,
> + IFLA_VXLAN_L2MISS,
> + IFLA_VXLAN_L3MISS,
> __IFLA_VXLAN_MAX
> };
> #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
> diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
> index ba5c4ab..f2e6bef 100644
> - --- a/ip/iplink_vxlan.c
> +++ b/ip/iplink_vxlan.c
> @@ -26,6 +26,8 @@ static void explain(void)
> fprintf(stderr, "Usage: ... vxlan id VNI [ group ADDR ] [ local ADDR ]\n");
> fprintf(stderr, " [ ttl TTL ] [ tos TOS ] [ dev PHYS_DEV ]\n");
> fprintf(stderr, " [ port MIN MAX ] [ [no]learning ]\n");
> + fprintf(stderr, " [ [no]proxy ] [ [no]rsc ]\n");
> + fprintf(stderr, " [ [no]l2miss ] [ [no]l3miss ]\n");
> fprintf(stderr, "\n");
> fprintf(stderr, "Where: VNI := 0-16777215\n");
> fprintf(stderr, " ADDR := { IP_ADDRESS | any }\n");
> @@ -44,6 +46,10 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
> __u8 tos = 0;
> __u8 ttl = 0;
> __u8 learning = 1;
> + __u8 proxy = 0;
> + __u8 rsc = 0;
> + __u8 l2miss = 0;
> + __u8 l3miss = 0;
> __u8 noage = 0;
> __u32 age = 0;
> __u32 maxaddr = 0;
> @@ -123,6 +129,22 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
> learning = 0;
> } else if (!matches(*argv, "learning")) {
> learning = 1;
> + } else if (!matches(*argv, "noproxy")) {
> + proxy = 0;
> + } else if (!matches(*argv, "proxy")) {
> + proxy = 1;
> + } else if (!matches(*argv, "norsc")) {
> + rsc = 0;
> + } else if (!matches(*argv, "rsc")) {
> + rsc = 1;
> + } else if (!matches(*argv, "nol2miss")) {
> + l2miss = 0;
> + } else if (!matches(*argv, "l2miss")) {
> + l2miss = 1;
> + } else if (!matches(*argv, "nol3miss")) {
> + l3miss = 0;
> + } else if (!matches(*argv, "l3miss")) {
> + l3miss = 1;
> } else if (matches(*argv, "help") == 0) {
> explain();
> return -1;
> @@ -148,6 +170,10 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
> addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
> addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
> addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
> + addattr8(n, 1024, IFLA_VXLAN_PROXY, proxy);
> + addattr8(n, 1024, IFLA_VXLAN_RSC, rsc);
> + addattr8(n, 1024, IFLA_VXLAN_L2MISS, l2miss);
> + addattr8(n, 1024, IFLA_VXLAN_L3MISS, l3miss);
> if (noage)
> addattr32(n, 1024, IFLA_VXLAN_AGEING, 0);
> else if (age)
> @@ -213,6 +239,18 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
> if (tb[IFLA_VXLAN_LEARNING] &&
> !rta_getattr_u8(tb[IFLA_VXLAN_LEARNING]))
> fputs("nolearning ", f);
> +
> + if (tb[IFLA_VXLAN_PROXY] && rta_getattr_u8(tb[IFLA_VXLAN_PROXY]))
> + fputs("proxy ", f);
> +
> + if (tb[IFLA_VXLAN_RSC] && rta_getattr_u8(tb[IFLA_VXLAN_RSC]))
> + fputs("rsc ", f);
> +
> + if (tb[IFLA_VXLAN_L2MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L2MISS]))
> + fputs("l2miss ", f);
> +
> + if (tb[IFLA_VXLAN_L3MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L3MISS]))
> + fputs("l3miss ", f);
>
> if (tb[IFLA_VXLAN_TOS] &&
> (tos = rta_getattr_u8(tb[IFLA_VXLAN_TOS]))) {
> diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
> index 4b1d469..7a7cc88 100644
> - --- a/ip/ipmonitor.c
> +++ b/ip/ipmonitor.c
> @@ -67,7 +67,8 @@ int accept_msg(const struct sockaddr_nl *who,
> print_addrlabel(who, n, arg);
> return 0;
> }
> - - if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH) {
> + if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH ||
> + n->nlmsg_type == RTM_GETNEIGH) {
> if (prefix_banner)
> fprintf(fp, "[NEIGH]");
> print_neigh(who, n, arg);
> diff --git a/ip/ipneigh.c b/ip/ipneigh.c
> index 56e56b2..1b7600b 100644
> - --- a/ip/ipneigh.c
> +++ b/ip/ipneigh.c
> @@ -189,7 +189,8 @@ int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
> struct rtattr * tb[NDA_MAX+1];
> char abuf[256];
>
> - - if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH) {
> + if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH &&
> + n->nlmsg_type != RTM_GETNEIGH) {
> fprintf(stderr, "Not RTM_NEWNEIGH: %08x %08x %08x\n",
> n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
>
> @@ -251,6 +252,8 @@ int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
>
> if (n->nlmsg_type == RTM_DELNEIGH)
> fprintf(fp, "delete ");
> + else if (n->nlmsg_type == RTM_GETNEIGH)
> + fprintf(fp, "miss ");
> if (tb[NDA_DST]) {
> fprintf(fp, "%s ",
> format_host(r->ndm_family,
>
This patch doesn't apply cleanly against the current version in iproute2 git.
Not your fault, conflicts arose from earlier patches applied. Could you fix
the conflicts and resubmit please.
^ permalink raw reply
* Re: [PATCH net-next 1/2] net: ethtool: Add destination MAC address to flow steering API
From: Ben Hutchings @ 2012-12-12 17:17 UTC (permalink / raw)
To: Amir Vadai; +Cc: David S. Miller, netdev, Or Gerlitz, Yan Burman
In-Reply-To: <1355227436-18383-2-git-send-email-amirv@mellanox.com>
On Tue, 2012-12-11 at 14:03 +0200, Amir Vadai wrote:
> From: Yan Burman <yanb@mellanox.com>
>
> Add ability to specify destination MAC address for L3/L4 flow spec
> in order to be able to specify action for different VM's under vSwitch
> configuration. This change is transparent to older userspace.
>
> Signed-off-by: Yan Burman <yanb@mellanox.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> ---
> include/uapi/linux/ethtool.h | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index d3eaaaf..be8c41e 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -500,13 +500,15 @@ union ethtool_flow_union {
> struct ethtool_ah_espip4_spec esp_ip4_spec;
> struct ethtool_usrip4_spec usr_ip4_spec;
> struct ethhdr ether_spec;
> - __u8 hdata[60];
> + __u8 hdata[52];
> };
>
> struct ethtool_flow_ext {
> - __be16 vlan_etype;
> - __be16 vlan_tci;
> - __be32 data[2];
> + __u8 padding[2];
> + unsigned char h_dest[ETH_ALEN]; /* destination eth addr */
> + __be16 vlan_etype;
> + __be16 vlan_tci;
> + __be32 data[2];
> };
>
> /**
> @@ -1027,6 +1029,7 @@ enum ethtool_sfeatures_retval_bits {
> #define ETHER_FLOW 0x12 /* spec only (ether_spec) */
> /* Flag to enable additional fields in struct ethtool_rx_flow_spec */
> #define FLOW_EXT 0x80000000
> +#define FLOW_MAC_EXT 0x40000000
You'll need to document exactly which flags and fields are related.
Adding kernel-doc to struct ethtool_flow_ext is probably the best way to
do that.
Ben.
> /* L3-L4 network traffic flow hash options */
> #define RXH_L2DA (1 << 1)
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH iproute2 3/3] ip: add support of 'ip link type [ipip|sit]'
From: Nicolas Dichtel @ 2012-12-12 17:20 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20121212091118.390b8a2b@nehalam.linuxnetplumber.net>
Le 12/12/2012 18:11, Stephen Hemminger a écrit :
> On Wed, 12 Dec 2012 10:51:47 +0100
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> This patch allows to manage ip tunnels via the interface ip link.
>> The syntax for parameters is the same that 'ip tunnel'.
>>
>> It also allows to display tunnels parameters with 'ip -details link' or
>> 'ip -details monitor link'.
>>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> ---
>> ip/Makefile | 3 +-
>> ip/iplink.c | 2 +-
>> ip/link_iptnl.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 343 insertions(+), 2 deletions(-)
>> create mode 100644 ip/link_iptnl.c
>>
>> diff --git a/ip/Makefile b/ip/Makefile
>> index abf54bf..2b606d4 100644
>> --- a/ip/Makefile
>> +++ b/ip/Makefile
>> @@ -4,7 +4,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
>> ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
>> iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
>> iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o \
>> - iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o
>> + iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
>> + link_iptnl.o
>>
>> RTMONOBJ=rtmon.o
>>
>> diff --git a/ip/iplink.c b/ip/iplink.c
>> index 8aac9fc..d73c705 100644
>> --- a/ip/iplink.c
>> +++ b/ip/iplink.c
>> @@ -84,7 +84,7 @@ void iplink_usage(void)
>> if (iplink_have_newlink()) {
>> fprintf(stderr, "\n");
>> fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | can |\n");
>> - fprintf(stderr, " bridge | ipoib | ip6tnl }\n");
>> + fprintf(stderr, " bridge | ipoib | ip6tnl | ipip | sit }\n");
>> }
>> exit(-1);
>> }
>> diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
>> new file mode 100644
>> index 0000000..238722d
>> --- /dev/null
>> +++ b/ip/link_iptnl.c
>> @@ -0,0 +1,340 @@
>> +/*
>> + * link_iptnl.c ipip and sit driver module
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version
>> + * 2 of the License, or (at your option) any later version.
>> + *
>> + * Authors: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> + *
>> + */
>> +
>> +#include <string.h>
>> +#include <net/if.h>
>> +#include <sys/types.h>
>> +#include <sys/socket.h>
>> +#include <arpa/inet.h>
>> +
>> +#include <linux/ip.h>
>> +#include <linux/if_tunnel.h>
>> +#include "rt_names.h"
>> +#include "utils.h"
>> +#include "ip_common.h"
>> +#include "tunnel.h"
>> +
>> +static void usage(int sit) __attribute__((noreturn));
>> +static void usage(int sit)
>> +{
>> + fprintf(stderr, "Usage: ip link { add | set | change | replace | del } NAME\n");
>> + fprintf(stderr, " type { ipip | sit } [ remote ADDR ] [ local ADDR ]\n");
>> + fprintf(stderr, " [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]\n");
>> + fprintf(stderr, " [ 6rd-prefix ADDR ] [ 6rd-relay_prefix ADDR ] [ 6rd-reset ]\n");
>> + if (sit)
>> + fprintf(stderr, " [ isatap ]\n");
>> + fprintf(stderr, "\n");
>> + fprintf(stderr, "Where: NAME := STRING\n");
>> + fprintf(stderr, " ADDR := { IP_ADDRESS | any }\n");
>> + fprintf(stderr, " TOS := { NUMBER | inherit }\n");
>> + fprintf(stderr, " TTL := { 1..255 | inherit }\n");
>> + exit(-1);
>> +}
>> +
>> +static int iptunnel_parse_opt(struct link_util *lu, int argc, char **argv,
>> + struct nlmsghdr *n)
>> +{
>> + struct {
>> + struct nlmsghdr n;
>> + struct ifinfomsg i;
>> + char buf[2048];
>> + } req;
>> + struct ifinfomsg *ifi = (struct ifinfomsg *)(n + 1);
>> + struct rtattr *tb[IFLA_MAX + 1];
>> + struct rtattr *linkinfo[IFLA_INFO_MAX+1];
>> + struct rtattr *iptuninfo[IFLA_IPTUN_MAX + 1];
>> + int len;
>> + __u32 link = 0;
>> + __u32 laddr = 0;
>> + __u32 raddr = 0;
>> + __u8 ttl = 0;
>> + __u8 tos = 0;
>> + __u8 pmtudisc = 1;
>> + __u16 iflags = 0;
>> + struct in6_addr ip6rdprefix;
>> + __u16 ip6rdprefixlen = 0;
>> + __u32 ip6rdrelayprefix = 0;
>> + __u16 ip6rdrelayprefixlen = 0;
>> +
>> + memset(&ip6rdprefix, 0, sizeof(ip6rdprefix));
>> +
>> + if (!(n->nlmsg_flags & NLM_F_CREATE)) {
>> + memset(&req, 0, sizeof(req));
>> +
>> + req.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi));
>> + req.n.nlmsg_flags = NLM_F_REQUEST;
>> + req.n.nlmsg_type = RTM_GETLINK;
>> + req.i.ifi_family = preferred_family;
>> + req.i.ifi_index = ifi->ifi_index;
>> +
>> + if (rtnl_talk(&rth, &req.n, 0, 0, &req.n) < 0) {
>> +get_failed:
>> + fprintf(stderr,
>> + "Failed to get existing tunnel info.\n");
>> + return -1;
>> + }
>> +
>> + len = req.n.nlmsg_len;
>> + len -= NLMSG_LENGTH(sizeof(*ifi));
>> + if (len < 0)
>> + goto get_failed;
>> +
>> + parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
>> +
>> + if (!tb[IFLA_LINKINFO])
>> + goto get_failed;
>> +
>> + parse_rtattr_nested(linkinfo, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
>> +
>> + if (!linkinfo[IFLA_INFO_DATA])
>> + goto get_failed;
>> +
>> + parse_rtattr_nested(iptuninfo, IFLA_IPTUN_MAX,
>> + linkinfo[IFLA_INFO_DATA]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_LOCAL])
>> + laddr = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LOCAL]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_REMOTE])
>> + raddr = rta_getattr_u32(iptuninfo[IFLA_IPTUN_REMOTE]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_TTL])
>> + ttl = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TTL]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_TOS])
>> + tos = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TOS]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_PMTUDISC])
>> + pmtudisc =
>> + rta_getattr_u8(iptuninfo[IFLA_IPTUN_PMTUDISC]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_FLAGS])
>> + iflags = rta_getattr_u16(iptuninfo[IFLA_IPTUN_FLAGS]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_LINK])
>> + link = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LINK]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_6RD_PREFIX])
>> + memcpy(&ip6rdprefix,
>> + RTA_DATA(iptuninfo[IFLA_IPTUN_6RD_PREFIX]),
>> + sizeof(laddr));
>> +
>> + if (iptuninfo[IFLA_IPTUN_6RD_PREFIXLEN])
>> + ip6rdprefixlen =
>> + rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_PREFIXLEN]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIX])
>> + ip6rdrelayprefix =
>> + rta_getattr_u32(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIX]);
>> +
>> + if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN])
>> + ip6rdrelayprefixlen =
>> + rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
>> + }
>> +
>> + while (argc > 0) {
>> + if (strcmp(*argv, "remote") == 0) {
>> + NEXT_ARG();
>> + if (strcmp(*argv, "any"))
>> + raddr = get_addr32(*argv);
>> + else
>> + raddr = 0;
>> + } else if (strcmp(*argv, "local") == 0) {
>> + NEXT_ARG();
>> + if (strcmp(*argv, "any"))
>> + laddr = get_addr32(*argv);
>> + else
>> + laddr = 0;
>> + } else if (matches(*argv, "dev") == 0) {
>> + NEXT_ARG();
>> + link = if_nametoindex(*argv);
>> + if (link == 0)
>> + invarg("\"dev\" is invalid", *argv);
>> + } else if (strcmp(*argv, "ttl") == 0 ||
>> + strcmp(*argv, "hoplimit") == 0) {
>> + NEXT_ARG();
>> + if (strcmp(*argv, "inherit") != 0) {
>> + if (get_u8(&ttl, *argv, 0))
>> + invarg("invalid TTL\n", *argv);
>> + } else
>> + ttl = 0;
>> + } else if (strcmp(*argv, "tos") == 0 ||
>> + strcmp(*argv, "tclass") == 0 ||
>> + matches(*argv, "dsfield") == 0) {
>> + __u32 uval;
>> + NEXT_ARG();
>> + if (strcmp(*argv, "inherit") != 0) {
>> + if (rtnl_dsfield_a2n(&uval, *argv))
>> + invarg("bad TOS value", *argv);
>> + tos = uval;
>> + } else
>> + tos = 1;
>> + } else if (strcmp(*argv, "nopmtudisc") == 0) {
>> + pmtudisc = 0;
>> + } else if (strcmp(*argv, "pmtudisc") == 0) {
>> + pmtudisc = 1;
>> + } else if (strcmp(lu->id, "sit") == 0 &&
>> + strcmp(*argv, "isatap") == 0) {
>> + iflags |= SIT_ISATAP;
>> + } else if (strcmp(*argv, "6rd-prefix") == 0) {
>> + inet_prefix prefix;
>> + NEXT_ARG();
>> + if (get_prefix(&prefix, *argv, AF_INET6))
>> + invarg("invalid 6rd_prefix\n", *argv);
>> + memcpy(&ip6rdprefix, prefix.data, 16);
>> + ip6rdprefixlen = prefix.bitlen;
>> + } else if (strcmp(*argv, "6rd-relay_prefix") == 0) {
>> + inet_prefix prefix;
>> + NEXT_ARG();
>> + if (get_prefix(&prefix, *argv, AF_INET))
>> + invarg("invalid 6rd-relay_prefix\n", *argv);
>> + memcpy(&ip6rdrelayprefix, prefix.data, 4);
>> + ip6rdrelayprefixlen = prefix.bitlen;
>> + } else if (strcmp(*argv, "6rd-reset") == 0) {
>> + inet_prefix prefix;
>> + get_prefix(&prefix, "2002::", AF_INET6);
>> + memcpy(&ip6rdprefix, prefix.data, 16);
>> + ip6rdprefixlen = 16;
>> + ip6rdrelayprefix = 0;
>> + ip6rdrelayprefixlen = 0;
>> + } else
>> + usage(strcmp(lu->id, "sit") == 0);
>> + argc--, argv++;
>> + }
>> +
>> + if (ttl && pmtudisc == 0) {
>> + fprintf(stderr, "ttl != 0 and noptmudisc are incompatible\n");
>> + exit(-1);
>> + }
>> +
>> + addattr32(n, 1024, IFLA_IPTUN_LINK, link);
>> + addattr32(n, 1024, IFLA_IPTUN_LOCAL, laddr);
>> + addattr32(n, 1024, IFLA_IPTUN_REMOTE, raddr);
>> + addattr8(n, 1024, IFLA_IPTUN_TTL, ttl);
>> + addattr8(n, 1024, IFLA_IPTUN_TOS, tos);
>> + addattr8(n, 1024, IFLA_IPTUN_PMTUDISC, pmtudisc);
>> + if (strcmp(lu->id, "sit") == 0) {
>> + addattr16(n, 1024, IFLA_IPTUN_FLAGS, iflags);
>> + if (ip6rdprefixlen) {
>> + addattr_l(n, 1024, IFLA_IPTUN_6RD_PREFIX,
>> + &ip6rdprefix, sizeof(ip6rdprefix));
>> + addattr16(n, 1024, IFLA_IPTUN_6RD_PREFIXLEN,
>> + ip6rdprefixlen);
>> + addattr32(n, 1024, IFLA_IPTUN_6RD_RELAY_PREFIX,
>> + ip6rdrelayprefix);
>> + addattr16(n, 1024, IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
>> + ip6rdrelayprefixlen);
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static void iptunnel_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
>> +{
>> + char s1[1024];
>> + char s2[64];
>> + const char *local = "any";
>> + const char *remote = "any";
>> +
>> + if (!tb)
>> + return;
>> +
>> + if (tb[IFLA_IPTUN_REMOTE]) {
>> + unsigned addr = rta_getattr_u32(tb[IFLA_IPTUN_REMOTE]);
>> +
>> + if (addr)
>> + remote = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
>> + }
>> +
>> + fprintf(f, "remote %s ", remote);
>> +
>> + if (tb[IFLA_IPTUN_LOCAL]) {
>> + unsigned addr = rta_getattr_u32(tb[IFLA_IPTUN_LOCAL]);
>> +
>> + if (addr)
>> + local = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
>> + }
>> +
>> + fprintf(f, "local %s ", local);
>> +
>> + if (tb[IFLA_IPTUN_LINK] && rta_getattr_u32(tb[IFLA_IPTUN_LINK])) {
>> + unsigned link = rta_getattr_u32(tb[IFLA_IPTUN_LINK]);
>> + const char *n = if_indextoname(link, s2);
>> +
>> + if (n)
>> + fprintf(f, "dev %s ", n);
>> + else
>> + fprintf(f, "dev %u ", link);
>> + }
>> +
>> + if (tb[IFLA_IPTUN_TTL] && rta_getattr_u8(tb[IFLA_IPTUN_TTL]))
>> + fprintf(f, "ttl %d ", rta_getattr_u8(tb[IFLA_IPTUN_TTL]));
>> + else
>> + fprintf(f, "ttl inherit ");
>> +
>> + if (tb[IFLA_IPTUN_TOS] && rta_getattr_u8(tb[IFLA_IPTUN_TOS])) {
>> + int tos = rta_getattr_u8(tb[IFLA_IPTUN_TOS]);
>> +
>> + fputs("tos ", f);
>> + if (tos == 1)
>> + fputs("inherit ", f);
>> + else
>> + fprintf(f, "0x%x ", tos);
>> + }
>> +
>> + if (tb[IFLA_IPTUN_PMTUDISC] && rta_getattr_u8(tb[IFLA_IPTUN_PMTUDISC]))
>> + fprintf(f, "pmtudisc ");
>> + else
>> + fprintf(f, "nopmtudisc ");
>> +
>> + if (tb[IFLA_IPTUN_FLAGS]) {
>> + __u16 iflags = rta_getattr_u16(tb[IFLA_IPTUN_FLAGS]);
>> +
>> + if (iflags & SIT_ISATAP)
>> + fprintf(f, "isatap ");
>> + }
>> +
>> + if (tb[IFLA_IPTUN_6RD_PREFIXLEN] &&
>> + *(__u16 *)RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIXLEN])) {
>> + __u16 prefixlen = rta_getattr_u16(tb[IFLA_IPTUN_6RD_PREFIXLEN]);
>> + __u16 relayprefixlen =
>> + rta_getattr_u16(tb[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
>> + __u32 relayprefix =
>> + rta_getattr_u32(tb[IFLA_IPTUN_6RD_RELAY_PREFIX]);
>> +
>> + printf("6rd-prefix %s/%u ",
>> + inet_ntop(AF_INET6, RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIX]),
>> + s1, sizeof(s1)),
>> + prefixlen);
>> + if (relayprefix) {
>> + printf("6rd-relay_prefix %s/%u ",
>> + format_host(AF_INET, 4, &relayprefix, s1,
>> + sizeof(s1)),
>> + relayprefixlen);
>> + }
>> + }
>> +}
>> +
>> +struct link_util ipip_link_util = {
>> + .id = "ipip",
>> + .maxattr = IFLA_IPTUN_MAX,
>> + .parse_opt = iptunnel_parse_opt,
>> + .print_opt = iptunnel_print_opt,
>> +};
>> +
>> +struct link_util sit_link_util = {
>> + .id = "sit",
>> + .maxattr = IFLA_IPTUN_MAX,
>> + .parse_opt = iptunnel_parse_opt,
>> + .print_opt = iptunnel_print_opt,
>> +};
>
> All applied with minor corrections to header files.
>
> Could you please add man pages for this new functionality?
>
Ok.
^ permalink raw reply
* [RFC PATCH net-next 1/5] netns: allocate an unique id to identify a netns
From: Nicolas Dichtel @ 2012-12-12 17:24 UTC (permalink / raw)
To: netdev; +Cc: davem, ebiederm, aatteka, Nicolas Dichtel
In-Reply-To: <1355333081-4018-1-git-send-email-nicolas.dichtel@6wind.com>
This patch simply adds a field nsindex, which will contain a unique index.
The goal is to prepare the monitoring of netns activities with rtnelink and to
ease netns management by userland apps.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/net/net_namespace.h | 1 +
net/core/net_namespace.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index c5a43f5..5db7a1b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -55,6 +55,7 @@ struct net {
struct list_head exit_list; /* Use only net_mutex */
struct user_namespace *user_ns; /* Owning user namespace */
+ int nsindex; /* index to identify this ns */
struct proc_dir_entry *proc_net;
struct proc_dir_entry *proc_net_stat;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6456439..f5267e4 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -27,6 +27,7 @@ static DEFINE_MUTEX(net_mutex);
LIST_HEAD(net_namespace_list);
EXPORT_SYMBOL_GPL(net_namespace_list);
+static DEFINE_IDA(net_namespace_ids);
struct net init_net = {
.dev_base_head = LIST_HEAD_INIT(init_net.dev_base_head),
@@ -157,6 +158,15 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
atomic_set(&net->passive, 1);
net->dev_base_seq = 1;
net->user_ns = user_ns;
+again:
+ error = ida_get_new_above(&net_namespace_ids, 1, &net->nsindex);
+ if (error < 0) {
+ if (error == -EAGAIN) {
+ ida_pre_get(&net_namespace_ids, GFP_KERNEL);
+ goto again;
+ }
+ return error;
+ }
#ifdef NETNS_REFCNT_DEBUG
atomic_set(&net->use_count, 0);
@@ -171,6 +181,7 @@ out:
return error;
out_undo:
+ ida_remove(&net_namespace_ids, net->nsindex);
/* Walk through the list backwards calling the exit functions
* for the pernet modules whose init functions did not fail.
*/
@@ -297,6 +308,11 @@ static void cleanup_net(struct work_struct *work)
*/
synchronize_rcu();
+ list_for_each_entry(net, &net_exit_list, exit_list) {
+ /* Free the index */
+ ida_remove(&net_namespace_ids, net->nsindex);
+ }
+
/* Run all of the network namespace exit methods */
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list);
--
1.8.0.1
^ permalink raw reply related
* [RFC PATCH net-next 3/5] dev/netns: allow to get netns from nsindex in rtnl msg
From: Nicolas Dichtel @ 2012-12-12 17:24 UTC (permalink / raw)
To: netdev; +Cc: davem, ebiederm, aatteka, Nicolas Dichtel
In-Reply-To: <1355333081-4018-1-git-send-email-nicolas.dichtel@6wind.com>
This patch allows to move a netdevice to another netns by giving the nsindex.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/net/net_namespace.h | 1 +
include/uapi/linux/if_link.h | 1 +
net/core/net_namespace.c | 14 ++++++++++++++
net/core/rtnetlink.c | 7 ++++++-
4 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index c373f2e..68e7a36 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -151,6 +151,7 @@ extern struct list_head net_namespace_list;
extern struct net *get_net_ns_by_pid(pid_t pid);
extern struct net *get_net_ns_by_fd(int pid);
+extern struct net *get_net_ns_by_nsindex(int nsindex);
#ifdef CONFIG_NET_NS
extern void __put_net(struct net *net);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 60f3b6b..6720a47 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -142,6 +142,7 @@ enum {
#define IFLA_PROMISCUITY IFLA_PROMISCUITY
IFLA_NUM_TX_QUEUES,
IFLA_NUM_RX_QUEUES,
+ IFLA_NET_NS_INDEX,
__IFLA_MAX
};
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2ae22b0..18fc62f 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -399,6 +399,20 @@ struct net *get_net_ns_by_pid(pid_t pid)
}
EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
+struct net *get_net_ns_by_nsindex(int nsindex)
+{
+ struct net *net;
+
+ ASSERT_RTNL();
+ for_each_net(net)
+ if (net->nsindex == nsindex) {
+ get_net(net);
+ break;
+ }
+ return net;
+}
+EXPORT_SYMBOL_GPL(get_net_ns_by_nsindex);
+
static struct genl_family netns_nl_family = {
.id = GENL_ID_GENERATE,
.name = NETNS_GENL_NAME,
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1868625..e22954a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1115,6 +1115,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_LINKINFO] = { .type = NLA_NESTED },
[IFLA_NET_NS_PID] = { .type = NLA_U32 },
[IFLA_NET_NS_FD] = { .type = NLA_U32 },
+ [IFLA_NET_NS_INDEX] = { .type = NLA_U32 },
[IFLA_IFALIAS] = { .type = NLA_STRING, .len = IFALIASZ-1 },
[IFLA_VFINFO_LIST] = {. type = NLA_NESTED },
[IFLA_VF_PORTS] = { .type = NLA_NESTED },
@@ -1171,6 +1172,8 @@ struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[])
net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
else if (tb[IFLA_NET_NS_FD])
net = get_net_ns_by_fd(nla_get_u32(tb[IFLA_NET_NS_FD]));
+ else if (tb[IFLA_NET_NS_INDEX])
+ net = get_net_ns_by_nsindex(nla_get_u32(tb[IFLA_NET_NS_INDEX]));
else
net = get_net(src_net);
return net;
@@ -1310,7 +1313,9 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
int send_addr_notify = 0;
int err;
- if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]) {
+ if (tb[IFLA_NET_NS_PID] ||
+ tb[IFLA_NET_NS_FD] ||
+ tb[IFLA_NET_NS_INDEX]) {
struct net *net = rtnl_link_get_net(dev_net(dev), tb);
if (IS_ERR(net)) {
err = PTR_ERR(net);
--
1.8.0.1
^ permalink raw reply related
* [RFC PATCH net-next 2/5] netns: allow to dump netns with netlink
From: Nicolas Dichtel @ 2012-12-12 17:24 UTC (permalink / raw)
To: netdev; +Cc: davem, ebiederm, aatteka, Nicolas Dichtel
In-Reply-To: <1355333081-4018-1-git-send-email-nicolas.dichtel@6wind.com>
This patch adds the basic support of netlink for netns. The user can dump all
existing netns and get associated nsindex.
He also can get nsindex associated to a pid or fd.
To initialize genetlink family for netns, there is a problem of chicken and
eggs. genetlink init is done after init_net is created, hence when init_net is
created, we cannot call genl_register_family_with_ops(). It's why I put the
init part in genetlink module.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/net/net_namespace.h | 1 +
include/uapi/linux/netns.h | 27 ++++++++
net/core/net_namespace.c | 157 ++++++++++++++++++++++++++++++++++++++++++++
net/netlink/genetlink.c | 4 ++
4 files changed, 189 insertions(+)
create mode 100644 include/uapi/linux/netns.h
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 5db7a1b..c373f2e 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -306,6 +306,7 @@ extern int register_pernet_subsys(struct pernet_operations *);
extern void unregister_pernet_subsys(struct pernet_operations *);
extern int register_pernet_device(struct pernet_operations *);
extern void unregister_pernet_device(struct pernet_operations *);
+extern int netns_genl_register(void);
struct ctl_table;
struct ctl_table_header;
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
new file mode 100644
index 0000000..e1c1da3
--- /dev/null
+++ b/include/uapi/linux/netns.h
@@ -0,0 +1,27 @@
+#ifndef _UAPI_LINUX_NETNS_H_
+#define _UAPI_LINUX_NETNS_H_
+
+/* Generic netlink messages */
+
+#define NETNS_GENL_NAME "netns"
+#define NETNS_GENL_VERSION 0x1
+
+/* Commands */
+enum {
+ NETNS_CMD_NOOP,
+ NETNS_CMD_GET,
+ __NETNS_CMD_MAX,
+};
+#define NETNS_CMD_MAX (__NETNS_CMD_MAX - 1)
+
+/* Attributes */
+enum {
+ NETNSA_NONE,
+ NETNSA_NSINDEX,
+ NETNSA_PID,
+ NETNSA_FD,
+ __NETNSA_MAX,
+};
+#define NETNSA_MAX (__NETNSA_MAX - 1)
+
+#endif /* _UAPI_LINUX_NETNS_H_ */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index f5267e4..2ae22b0 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -14,6 +14,8 @@
#include <linux/file.h>
#include <linux/export.h>
#include <linux/user_namespace.h>
+#include <linux/netns.h>
+#include <net/genetlink.h>
#include <net/net_namespace.h>
#include <net/netns/generic.h>
@@ -397,6 +399,161 @@ struct net *get_net_ns_by_pid(pid_t pid)
}
EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
+static struct genl_family netns_nl_family = {
+ .id = GENL_ID_GENERATE,
+ .name = NETNS_GENL_NAME,
+ .version = NETNS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = NETNSA_MAX,
+ .netnsok = true,
+};
+
+static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
+ [NETNSA_NONE] = { .type = NLA_UNSPEC, },
+ [NETNSA_NSINDEX] = { .type = NLA_U32, },
+ [NETNSA_PID] = { .type = NLA_U32 },
+ [NETNSA_FD] = { .type = NLA_U32 },
+};
+
+static int netns_nl_get_size(void)
+{
+ return nla_total_size(sizeof(u32)) /* NETNSA_NSINDEX */
+ ;
+}
+
+static int netns_nl_cmd_noop(struct sk_buff *skb, struct genl_info *info)
+{
+ struct sk_buff *msg;
+ void *hdr;
+ int ret = -ENOBUFS;
+
+ msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+ if (!msg) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq,
+ &netns_nl_family, 0, NETNS_CMD_NOOP);
+ if (!hdr) {
+ ret = -EMSGSIZE;
+ goto err_out;
+ }
+
+ genlmsg_end(msg, hdr);
+
+ return genlmsg_unicast(genl_info_net(info), msg, info->snd_portid);
+
+err_out:
+ nlmsg_free(msg);
+
+out:
+ return ret;
+}
+
+static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
+ int cmd, struct net *net)
+{
+ void *hdr;
+
+ hdr = genlmsg_put(skb, portid, seq, &netns_nl_family, flags, cmd);
+ if (!hdr)
+ return -EMSGSIZE;
+
+ if (nla_put_u32(skb, NETNSA_NSINDEX, net->nsindex))
+ goto nla_put_failure;
+
+ return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+ genlmsg_cancel(skb, hdr);
+ return -EMSGSIZE;
+}
+
+static int netns_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+ struct net *net = genl_info_net(info);
+ struct sk_buff *msg;
+ int err = -ENOBUFS;
+
+ if (info->attrs[NETNSA_PID])
+ net = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+ else if (info->attrs[NETNSA_FD])
+ net = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+ else
+ get_net(net);
+
+ msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+ if (!msg) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
+ NLM_F_ACK, NETNS_CMD_GET, net);
+ if (err < 0)
+ goto err_out;
+
+ err = genlmsg_unicast(genl_info_net(info), msg, info->snd_portid);
+ goto out;
+
+err_out:
+ nlmsg_free(msg);
+
+out:
+ put_net(net);
+ return err;
+}
+
+static int netns_nl_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+ int i = 0, s_i = cb->args[0];
+ struct net *net;
+
+ rtnl_lock();
+ for_each_net(net) {
+ if (i < s_i) {
+ i++;
+ continue;
+ }
+
+ if (netns_nl_fill(skb, NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, NLM_F_MULTI,
+ NETNS_CMD_GET, net) <= 0)
+ goto out;
+
+ i++;
+ }
+
+out:
+ cb->args[0] = i;
+ rtnl_unlock();
+
+ return skb->len;
+}
+
+static struct genl_ops netns_nl_ops[] = {
+ {
+ .cmd = NETNS_CMD_NOOP,
+ .policy = netns_nl_policy,
+ .doit = netns_nl_cmd_noop,
+ .flags = GENL_ADMIN_PERM,
+ },
+ {
+ .cmd = NETNS_CMD_GET,
+ .policy = netns_nl_policy,
+ .doit = netns_nl_cmd_get,
+ .dumpit = netns_nl_cmd_dump,
+ .flags = GENL_ADMIN_PERM,
+ },
+};
+
+int netns_genl_register(void)
+{
+ return genl_register_family_with_ops(&netns_nl_family, netns_nl_ops,
+ ARRAY_SIZE(netns_nl_ops));
+}
+
static int __init net_ns_init(void)
{
struct net_generic *ng;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f2aabb6..6d25ddb 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -963,6 +963,10 @@ static int __init genl_init(void)
if (err < 0)
goto problem;
+ err = netns_genl_register();
+ if (err < 0)
+ goto problem;
+
return 0;
problem:
--
1.8.0.1
^ permalink raw reply related
* [RFC PATCH net-next 4/5] netns: advertise netns activity with netlink
From: Nicolas Dichtel @ 2012-12-12 17:24 UTC (permalink / raw)
To: netdev; +Cc: davem, ebiederm, aatteka, Nicolas Dichtel
In-Reply-To: <1355333081-4018-1-git-send-email-nicolas.dichtel@6wind.com>
Goal of this patch is to send netlink messages when netns are crated/deleted.
This is useful for daemon that wants to manage all netns with only one running
instance.
Note that until that netns_nl_event_mcgrp group is not registered, we cannot
send event.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/uapi/linux/netns.h | 4 ++++
net/core/net_namespace.c | 38 +++++++++++++++++++++++++++++++++++++-
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
index e1c1da3..e14d90b 100644
--- a/include/uapi/linux/netns.h
+++ b/include/uapi/linux/netns.h
@@ -6,10 +6,14 @@
#define NETNS_GENL_NAME "netns"
#define NETNS_GENL_VERSION 0x1
+#define NETNS_GENL_MCAST_EVENT_NAME "events"
+
/* Commands */
enum {
NETNS_CMD_NOOP,
NETNS_CMD_GET,
+ NETNS_CMD_NEW,
+ NETNS_CMD_DEL,
__NETNS_CMD_MAX,
};
#define NETNS_CMD_MAX (__NETNS_CMD_MAX - 1)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 18fc62f..da92ecb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -40,6 +40,8 @@ EXPORT_SYMBOL(init_net);
static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
+static int netns_nl_event(struct net *net, int cmd);
+
static struct net_generic *net_alloc_generic(void)
{
struct net_generic *ng;
@@ -179,6 +181,7 @@ again:
if (error < 0)
goto out_undo;
}
+ netns_nl_event(net, NETNS_CMD_NEW);
out:
return error;
@@ -311,6 +314,7 @@ static void cleanup_net(struct work_struct *work)
synchronize_rcu();
list_for_each_entry(net, &net_exit_list, exit_list) {
+ netns_nl_event(net, NETNS_CMD_DEL);
/* Free the index */
ida_remove(&net_namespace_ids, net->nsindex);
}
@@ -413,6 +417,10 @@ struct net *get_net_ns_by_nsindex(int nsindex)
}
EXPORT_SYMBOL_GPL(get_net_ns_by_nsindex);
+static struct genl_multicast_group netns_nl_event_mcgrp = {
+ .name = NETNS_GENL_MCAST_EVENT_NAME,
+};
+
static struct genl_family netns_nl_family = {
.id = GENL_ID_GENERATE,
.name = NETNS_GENL_NAME,
@@ -562,10 +570,38 @@ static struct genl_ops netns_nl_ops[] = {
},
};
+static int netns_nl_event(struct net *net, int cmd)
+{
+ struct sk_buff *msg;
+ int err = -ENOBUFS;
+
+ /* Check that gennl infra is ready */
+ if (!netns_nl_event_mcgrp.id)
+ return -ENOENT;
+
+ msg = genlmsg_new(netns_nl_get_size(), GFP_ATOMIC);
+ if (!msg)
+ return -ENOMEM;
+
+ err = netns_nl_fill(msg, 0, 0, 0, cmd, net);
+ if (err < 0) {
+ nlmsg_free(msg);
+ return err;
+ }
+
+ return genlmsg_multicast(msg, 0, netns_nl_event_mcgrp.id, GFP_ATOMIC);
+}
+
int netns_genl_register(void)
{
- return genl_register_family_with_ops(&netns_nl_family, netns_nl_ops,
+ int err;
+
+ err = genl_register_family_with_ops(&netns_nl_family, netns_nl_ops,
ARRAY_SIZE(netns_nl_ops));
+ if (err < 0)
+ return err;
+
+ return genl_register_mc_group(&netns_nl_family, &netns_nl_event_mcgrp);
}
static int __init net_ns_init(void)
--
1.8.0.1
^ permalink raw reply related
* [RFC PATCH net-next 5/5] net/sock: add support of SO_NETNS
From: Nicolas Dichtel @ 2012-12-12 17:24 UTC (permalink / raw)
To: netdev; +Cc: davem, ebiederm, aatteka, Nicolas Dichtel
In-Reply-To: <1355333081-4018-1-git-send-email-nicolas.dichtel@6wind.com>
This new setsockopt() option allows user to change netns of a socket. It
should be done enough early, before any bind(), etc.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
arch/alpha/include/asm/socket.h | 2 ++
arch/avr32/include/uapi/asm/socket.h | 2 ++
arch/frv/include/uapi/asm/socket.h | 2 ++
arch/h8300/include/asm/socket.h | 2 ++
arch/ia64/include/uapi/asm/socket.h | 2 ++
arch/m32r/include/asm/socket.h | 2 ++
arch/m68k/include/uapi/asm/socket.h | 2 ++
arch/mips/include/uapi/asm/socket.h | 2 ++
arch/mn10300/include/uapi/asm/socket.h | 2 ++
arch/parisc/include/uapi/asm/socket.h | 2 ++
arch/powerpc/include/uapi/asm/socket.h | 2 ++
arch/s390/include/uapi/asm/socket.h | 2 ++
arch/sparc/include/uapi/asm/socket.h | 2 ++
arch/xtensa/include/uapi/asm/socket.h | 2 ++
include/uapi/asm-generic/socket.h | 2 ++
net/core/sock.c | 28 ++++++++++++++++++++++++++++
16 files changed, 58 insertions(+)
diff --git a/arch/alpha/include/asm/socket.h b/arch/alpha/include/asm/socket.h
index 0087d05..13aa509 100644
--- a/arch/alpha/include/asm/socket.h
+++ b/arch/alpha/include/asm/socket.h
@@ -77,6 +77,8 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#ifdef __KERNEL__
/* O_NONBLOCK clashes with the bits used for socket types. Therefore we
* have to define SOCK_NONBLOCK to a different value here.
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 486df68..39cc927 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -70,4 +70,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* __ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 871f89b..ac7eef6 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -70,5 +70,7 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/h8300/include/asm/socket.h b/arch/h8300/include/asm/socket.h
index 90a2e57..4d2a4e8 100644
--- a/arch/h8300/include/asm/socket.h
+++ b/arch/h8300/include/asm/socket.h
@@ -70,4 +70,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 23d6759..ed4534b 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -79,4 +79,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/asm/socket.h b/arch/m32r/include/asm/socket.h
index 5e7088a..37d0eb0 100644
--- a/arch/m32r/include/asm/socket.h
+++ b/arch/m32r/include/asm/socket.h
@@ -70,4 +70,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/m68k/include/uapi/asm/socket.h b/arch/m68k/include/uapi/asm/socket.h
index 285da3b..e79aad8 100644
--- a/arch/m68k/include/uapi/asm/socket.h
+++ b/arch/m68k/include/uapi/asm/socket.h
@@ -70,4 +70,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 17307ab..356f943 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -90,5 +90,7 @@ To add: #define SO_REUSEPORT 0x0200 /* Allow local address and port reuse. */
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index af5366b..b899cf8 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -70,4 +70,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index d9ff473..8503329 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -69,6 +69,8 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 0x4024
+#define SO_NETNS 0x4025
+
/* O_NONBLOCK clashes with the bits used for socket types. Therefore we
* have to define SOCK_NONBLOCK to a different value here.
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index eb0b186..1a520ff 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -77,4 +77,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 436d07c..cbdda59 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -76,4 +76,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index c83a937..c1c2853 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -66,6 +66,8 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 0x0027
+#define SO_NETNS 0x0028
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 38079be..a8f956d 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -81,4 +81,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 2d32d07..08c108c 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -73,4 +73,6 @@
/* Instruct lower device to use last 4-bytes of skb data as FCS */
#define SO_NOFCS 43
+#define SO_NETNS 44
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index a692ef4..7ec288f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -895,6 +895,30 @@ set_rcvbuf:
sock_valbool_flag(sk, SOCK_NOFCS, valbool);
break;
+ case SO_NETNS:
+#ifdef CONFIG_NET_NS
+ if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+ ret = -EPERM;
+ else if (sk->sk_state != TCP_CLOSE)
+ ret = -EBUSY; /* Too late to change netns */
+ else {
+ struct net *net = get_net_ns_by_nsindex(val);
+
+ if (net) {
+ /* We can not use sk_change_net() because sk
+ * will not be released with
+ * sk_release_kernel(). Let do it manually.
+ */
+ put_net(sock_net(sk));
+ sock_net_set(sk, net);
+ } else
+ ret = -EINVAL;
+ }
+#else
+ ret = -EOPNOTSUPP;
+#endif
+ break;
+
default:
ret = -ENOPROTOOPT;
break;
@@ -1140,6 +1164,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
goto lenout;
+ case SO_NETNS:
+ v.val = sock_net(sk)->nsindex;
+ break;
+
default:
return -ENOPROTOOPT;
}
--
1.8.0.1
^ permalink raw reply related
* [RFC PATCH net-next 0/5] Ease netns management by userland
From: Nicolas Dichtel @ 2012-12-12 17:24 UTC (permalink / raw)
To: netdev; +Cc: davem, ebiederm, aatteka
The goal of this serie is to ease netns management by daemons. Some systems use
netns only to virtualize network stack and don't want to multiply userland
daemons. These system may have a lot of netns, up to 2000. We don't want to
launch an instance of each daemons (quagga, strongswan, conntrackd, ...) for
each netns because it will consume a lot of ressources. Having one daemon that
manage all netns is more efficient (mainly if there are few objects to manage:
one or two routes per netns for example).
Hence, one goal of this serie is to allow, for a daemon, to monitor netns
activities, thus it can open or close netlink sockets, allocating structures
needed to manage these netns when they are created or deleted.
To help to identify a netns, an index has been added to each netns.
A new setsockopt() option is also added, to help daemons to open socket in the
right netns. For now, a daemon that want to open a socket in a specified netns,
need to call setns(CLONE_NEWNET) with a fd (not so easy to found), open the
socket and then call again setns() to go back in the initial netns. Having this
kind of setsockopt() will simplify operations. Obviously, this setsockopt()
should be done enough early (is test on sk_state enough?). The first target is
netlink socket but it can be useful for other kind of socket, it's why a add a
generic socket option.
As usual, the patch against iproute2 will be sent once the patches are included
and net-next merged. I can send it on demand.
arch/alpha/include/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/h8300/include/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/asm/socket.h | 2 +
arch/m68k/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
include/net/net_namespace.h | 3 +
include/uapi/asm-generic/socket.h | 2 +
include/uapi/linux/if_link.h | 1 +
include/uapi/linux/netns.h | 31 +++++
net/core/net_namespace.c | 223 +++++++++++++++++++++++++++++++++
net/core/rtnetlink.c | 7 +-
net/core/sock.c | 28 +++++
net/netlink/genetlink.c | 4 +
22 files changed, 326 insertions(+), 1 deletion(-)
I do not pretend to be a netns expert, it's why I add RFC in the title ;-)
Comments are welcome.
Regards,
Nicolas
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox