* [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: Weiping Pan @ 2012-12-12 14:29 UTC (permalink / raw)
To: davem; +Cc: brutus, netdev, Weiping Pan
In-Reply-To: <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>
1 do not share tail skb between sender and receiver
2 reduce the use of sock->sk_lock.slock
--------------------------------------------------------------------------
TCP friends performance results start
BASE means normal tcp with friends DISABLED.
AF_UNIX means sockets for local interprocess communication, for reference.
FRIENDS means tcp with friends ENABLED.
I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by default.
The first percentage number is FRIENDS/BASE.
The second percentage number is FRIENDS/AF_UNIX.
We set -i 10,2 -I 95,20 to stabilize the statistics.
BASE AF_UNIX FRIENDS TCP_STREAM
7952.97 10864.86 13440.08 168% 123%
BASE AF_UNIX FRIENDS TCP_MAERTS
6743.78 - 13809.97 204% -%
BASE AF_UNIX FRIENDS TCP_SENDFILE
11758 - 18483 157% -%
TCP_SENDFILE can not work with -i 10,2 -I 95,20 (strange), so I use average.
MS BASE AF_UNIX FRIENDS TCP_STREAM_MS
1 10.70 5.40 4.02 37% 74%
2 28.01 9.67 7.97 28% 82%
4 55.53 19.78 16.48 29% 83%
8 115.40 38.22 33.51 29% 87%
16 227.31 81.06 67.70 29% 83%
32 446.20 166.59 129.31 28% 77%
64 849.04 336.77 259.43 30% 77%
128 1440.50 661.88 530.43 36% 80%
256 2404.70 1279.67 1029.15 42% 80%
512 4331.53 2501.30 1942.21 44% 77%
1024 6819.78 4622.37 4128.10 60% 89%
2048 10544.60 6348.81 6349.59 60% 100%
4096 12830.41 8324.43 7984.43 62% 95%
8192 13462.65 8355.49 11079.37 82% 132%
16384 9960.87 10840.13 13037.81 130% 120%
32768 8749.31 11372.15 15087.08 172% 132%
65536 7580.27 12150.23 14971.42 197% 123%
131072 6727.74 11451.34 13604.78 202% 118%
262144 7673.14 11613.10 11436.97 149% 98%
524288 7366.17 11675.95 11559.43 156% 99%
1048576 6608.57 11883.01 10103.20 152% 85%
MS means Message Size in bytes, that is -m -M for netperf
RR BASE AF_UNIX FRIENDS TCP_RR_RR
1 19716.88 34451.39 34574.12 175% 100%
2 19836.74 34297.00 34671.29 174% 101%
4 19874.71 34456.48 34552.13 173% 100%
8 18882.93 34123.00 34661.48 183% 101%
16 19179.09 34358.47 34599.16 180% 100%
32 20140.08 34326.35 34616.30 171% 100%
64 19473.39 34382.05 34583.10 177% 100%
128 19699.62 34012.03 34566.14 175% 101%
256 19740.44 34529.71 34624.07 175% 100%
512 18929.46 33673.06 33932.83 179% 100%
1024 18738.98 33724.78 33313.44 177% 98%
2048 17315.61 32982.24 32361.39 186% 98%
4096 16585.81 31345.85 31073.32 187% 99%
8192 11933.16 27851.10 27166.94 227% 97%
16384 9717.19 21746.12 22583.40 232% 103%
32768 7044.35 12927.23 16253.26 230% 125%
65536 5038.96 8945.74 7982.61 158% 89%
131072 2860.64 4981.78 4417.16 154% 88%
262144 1633.45 2765.27 2739.36 167% 99%
524288 796.68 1429.79 1445.21 181% 101%
1048576 379.78 per 730.05 192% %
RR means Request Response Message Size in bytes, that is -r req,resp for netperf
RR BASE AF_UNIX FRIENDS TCP_CRR_RR
1 5531.49 - 5861.86 105% -%
2 5506.13 - 5845.53 106% -%
4 5523.27 - 5853.43 105% -%
8 5503.73 - 5836.44 106% -%
16 5516.23 - 5842.29 105% -%
32 5557.37 - 5858.29 105% -%
64 5517.51 - 5892.64 106% -%
128 5504.18 - 5841.44 106% -%
256 5512.82 - 5842.60 105% -%
512 5496.36 - 5837.72 106% -%
1024 5465.24 - 5827.99 106% -%
2048 5550.15 - 5812.88 104% -%
4096 5292.75 - 5824.45 110% -%
8192 4917.06 - 5705.12 116% -%
16384 4278.63 - 5318.39 124% -%
32768 3611.86 - 4930.30 136% -%
65536 77.35 - 3847.43 4974% -%
131072 47.65 - 2811.58 5900% -%
262144 805.13 - 4.88 0% -%
524288 583.08 - 4.78 0% -%
1048576 369.52 - 5.02 1% -%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf -H 127.0.0.1
TCP friends performance results end
--------------------------------------------------------------------------
Performance analysis:
1 Friends shows better performance than loopback in TCP_RR, TCP_MAERTS and
TCP_SENDFILE, same in TCP_CRR_RR.
2 In TCP_STREAM, Friends shows much worse perofrmance (30%) than loopback if
the message size if small, and it shows worse performance (80%) than AF_UNIX.
3 Compared with last performance report, Friends shows worse performance in
TCP_RR.
Friends VS AF_UNIX
I think the lock use is much similar this time.
May the locking contention is not the bottle neck ?
Friends VS loopback
I have reduced the locking contention as much as possible,
but it still shows bad performance.
May the locking contention is not the bottle neck ?
Signed-off-by: Weiping Pan <wpan@redhat.com>
---
include/net/tcp.h | 10 --
net/ipv4/tcp.c | 327 ++++++++++++++++++++++-------------------------------
2 files changed, 136 insertions(+), 201 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5f82770..80a8ec9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -688,15 +688,6 @@ void tcp_send_window_probe(struct sock *sk);
#define TCPHDR_ECE 0x40
#define TCPHDR_CWR 0x80
-/* If skb_get_friend() != NULL, TCP friends per packet state.
- */
-struct friend_skb_parm {
- bool tail_inuse; /* In use by skb_get_friend() send while */
- /* on sk_receive_queue for tail put */
-};
-
-#define TCP_FRIEND_CB(tcb) (&(tcb)->header.hf)
-
/* This is what the send packet queuing engine uses to pass
* TCP per-packet control information to the transmission code.
* We also store the host-order sequence numbers in here too.
@@ -709,7 +700,6 @@ struct tcp_skb_cb {
#if IS_ENABLED(CONFIG_IPV6)
struct inet6_skb_parm h6;
#endif
- struct friend_skb_parm hf;
} header; /* For incoming frames */
__u32 seq; /* Starting sequence number */
__u32 end_seq; /* SEQ + FIN + SYN + datalen */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e9d82e0..f008d60 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -336,25 +336,24 @@ static inline int tcp_friend_validate(struct sock *sk, struct sock **friendp,
return 1;
}
-static inline int tcp_friend_send_lock(struct sock *friend)
+static inline int tcp_friend_get_state(struct sock *friend)
{
int err = 0;
spin_lock_bh(&friend->sk_lock.slock);
- if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN)) {
- spin_unlock_bh(&friend->sk_lock.slock);
+ if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN))
err = -ECONNRESET;
- }
+ spin_unlock_bh(&friend->sk_lock.slock);
return err;
}
-static inline void tcp_friend_recv_lock(struct sock *friend)
+static inline void tcp_friend_state_lock(struct sock *friend)
{
spin_lock_bh(&friend->sk_lock.slock);
}
-static void tcp_friend_unlock(struct sock *friend)
+static inline void tcp_friend_state_unlock(struct sock *friend)
{
spin_unlock_bh(&friend->sk_lock.slock);
}
@@ -639,71 +638,32 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
}
EXPORT_SYMBOL(tcp_ioctl);
-/*
- * Friend receive_queue tail skb space? If true, set tail_inuse.
- * Else if RCV_SHUTDOWN, return *copy = -ECONNRESET.
- */
-static inline struct sk_buff *tcp_friend_tail(struct sock *friend, int *copy)
-{
- struct sk_buff *skb = NULL;
- int sz = 0;
-
- if (skb_peek_tail(&friend->sk_receive_queue)) {
- sz = tcp_friend_send_lock(friend);
- if (!sz) {
- skb = skb_peek_tail(&friend->sk_receive_queue);
- if (skb && skb->friend) {
- if (!*copy)
- sz = skb_tailroom(skb);
- else {
- sz = *copy - skb->len;
- if (sz < 0)
- sz = 0;
- }
- if (sz > 0)
- TCP_FRIEND_CB(TCP_SKB_CB(skb))->
- tail_inuse = true;
- }
- tcp_friend_unlock(friend);
- }
- }
-
- *copy = sz;
- return skb;
-}
-
-static inline void tcp_friend_seq(struct sock *sk, int copy, int charge)
-{
- struct sock *friend = sk->sk_friend;
- struct tcp_sock *tp = tcp_sk(friend);
-
- if (charge) {
- sk_mem_charge(friend, charge);
- atomic_add(charge, &friend->sk_rmem_alloc);
- }
- tp->rcv_nxt += copy;
- tp->rcv_wup += copy;
- tcp_friend_unlock(friend);
-
- tp = tcp_sk(sk);
- tp->snd_nxt += copy;
- tp->pushed_seq += copy;
- tp->snd_una += copy;
- tp->snd_up += copy;
-}
-
static inline bool tcp_friend_push(struct sock *sk, struct sk_buff *skb)
{
- struct sock *friend = sk->sk_friend;
- int wait = false;
+ struct sock *friend = sk->sk_friend;
+ struct tcp_sock *tp = NULL;
+ int wait = false;
+
+ tcp_friend_state_lock(friend);
skb_set_owner_r(skb, friend);
- __skb_queue_tail(&friend->sk_receive_queue, skb);
if (!sk_rmem_schedule(friend, skb, skb->truesize))
wait = true;
+ __skb_queue_tail(&friend->sk_receive_queue, skb);
+
+ tcp_friend_state_unlock(friend);
- tcp_friend_seq(sk, skb->len, 0);
- if (skb == skb_peek(&friend->sk_receive_queue))
+ tp = tcp_sk(friend);
+ tp->rcv_nxt += skb->len;
+ tp->rcv_wup += skb->len;
+
+ tp = tcp_sk(sk);
+ tp->snd_nxt += skb->len;
+ tp->pushed_seq += skb->len;
+ tp->snd_una += skb->len;
+ tp->snd_up += skb->len;
+
+ if (skb_queue_len(&friend->sk_receive_queue) == 1)
friend->sk_data_ready(friend, 0);
return wait;
@@ -728,7 +688,6 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
tcb->seq = tcb->end_seq = tp->write_seq;
if (sk->sk_friend) {
skb->friend = sk;
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
return;
}
skb->csum = 0;
@@ -1048,8 +1007,17 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
+ if (friend) {
+ err = tcp_friend_get_state(friend);
+ if (err) {
+ sk->sk_err = -err;
+ err = -EPIPE;
+ goto out_err;
+ }
+ }
+
while (psize > 0) {
- struct sk_buff *skb;
+ struct sk_buff *skb = NULL;
struct tcp_skb_cb *tcb;
struct page *page = pages[poffset / PAGE_SIZE];
int copy, i;
@@ -1059,12 +1027,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
if (friend) {
copy = size_goal;
- skb = tcp_friend_tail(friend, ©);
- if (copy < 0) {
- sk->sk_err = -copy;
- err = -EPIPE;
- goto out_err;
- }
+ if (skb)
+ copy = copy - skb->len;
+ else
+ copy = 0;
} else if (!tcp_send_head(sk)) {
skb = NULL;
copy = 0;
@@ -1078,9 +1044,17 @@ new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
- if (friend)
+ if (friend) {
+ if (skb) {
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
+ }
+
+ /*
+ * new skb
+ */
skb = tcp_friend_alloc_skb(sk, 0);
- else
+ } else
skb = sk_stream_alloc_skb(sk, 0,
sk->sk_allocation);
if (!skb)
@@ -1097,10 +1071,7 @@ new_segment:
i = skb_shinfo(skb)->nr_frags;
can_coalesce = skb_can_coalesce(skb, i, page, offset);
if (!can_coalesce && i >= MAX_SKB_FRAGS) {
- if (friend) {
- if (TCP_FRIEND_CB(tcb)->tail_inuse)
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- } else
+ if (!friend)
tcp_mark_push(tp, skb);
goto new_segment;
}
@@ -1124,20 +1095,9 @@ new_segment:
psize -= copy;
if (friend) {
- err = tcp_friend_send_lock(friend);
- if (err) {
- sk->sk_err = -err;
- err = -EPIPE;
- goto out_err;
- }
tcb->end_seq += copy;
- if (TCP_FRIEND_CB(tcb)->tail_inuse) {
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- tcp_friend_seq(sk, copy, copy);
- } else {
- if (tcp_friend_push(sk, skb))
- goto wait_for_sndbuf;
- }
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
if (!psize)
goto out;
continue;
@@ -1172,6 +1132,18 @@ wait_for_memory:
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
+ if (friend) {
+ if (skb) {
+ tcp_friend_state_lock(friend);
+ if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+ tcp_friend_state_unlock(friend);
+ goto wait_for_sndbuf;
+ }
+ tcp_friend_state_unlock(friend);
+ skb = NULL;
+ }
+ }
+
if (!friend)
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
@@ -1266,7 +1238,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
struct iovec *iov;
struct sock *friend = sk->sk_friend;
struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *skb;
+ struct sk_buff *skb = NULL;
struct tcp_skb_cb *tcb;
int iovlen, flags, err, copied = 0;
int mss_now = 0, size_goal = size, copied_syn = 0, offset = 0;
@@ -1330,6 +1302,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
sg = !!(sk->sk_route_caps & NETIF_F_SG);
+ if (friend) {
+ err = tcp_friend_get_state(friend);
+ if (err) {
+ sk->sk_err = -err;
+ err = -EPIPE;
+ goto out_err;
+ }
+ }
+
while (--iovlen >= 0) {
size_t seglen = iov->iov_len;
unsigned char __user *from = iov->iov_base;
@@ -1350,12 +1331,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
int max = size_goal;
if (friend) {
- skb = tcp_friend_tail(friend, ©);
- if (copy < 0) {
- sk->sk_err = -copy;
- err = -EPIPE;
- goto out_err;
- }
+ if (skb)
+ copy = skb_availroom(skb);
+ else
+ copy = 0;
} else {
skb = tcp_write_queue_tail(sk);
if (tcp_send_head(sk)) {
@@ -1370,9 +1349,21 @@ new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
- if (friend)
+ if (friend) {
+ if (skb) {
+ /*
+ * Friend push old skb
+ */
+
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
+ }
+
+ /*
+ * new skb
+ */
skb = tcp_friend_alloc_skb(sk, max);
- else {
+ } else {
/* Allocate new segment. If the
* interface is SG, allocate skb
* fitting to single page.
@@ -1455,32 +1446,23 @@ new_segment:
copied += copy;
seglen -= copy;
- if (friend) {
- err = tcp_friend_send_lock(friend);
- if (err) {
- sk->sk_err = -err;
- err = -EPIPE;
- goto out_err;
- }
- tcb->end_seq += copy;
- if (TCP_FRIEND_CB(tcb)->tail_inuse) {
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- tcp_friend_seq(sk, copy, 0);
- } else {
- if (tcp_friend_push(sk, skb))
- goto wait_for_sndbuf;
- }
- continue;
- }
-
tcb->end_seq += copy;
+
skb_shinfo(skb)->gso_segs = 0;
if (copied == copy)
tcb->tcp_flags &= ~TCPHDR_PSH;
- if (seglen == 0 && iovlen == 0)
+ if (seglen == 0 && iovlen == 0) {
+ if (friend && skb) {
+ if (tcp_friend_push(sk, skb))
+ goto wait_for_sndbuf;
+ }
goto out;
+ }
+
+ if (friend)
+ continue;
if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
continue;
@@ -1501,6 +1483,17 @@ wait_for_memory:
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
+ if (friend) {
+ if (skb) {
+ tcp_friend_state_lock(friend);
+ if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+ tcp_friend_state_unlock(friend);
+ goto wait_for_sndbuf;
+ }
+ tcp_friend_state_unlock(friend);
+ skb = NULL;
+ }
+ }
if (!friend)
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
@@ -1514,10 +1507,7 @@ out:
do_fault:
if (skb->friend) {
- if (TCP_FRIEND_CB(tcb)->tail_inuse)
- TCP_FRIEND_CB(tcb)->tail_inuse = false;
- else
- __kfree_skb(skb);
+ __kfree_skb(skb);
} else if (!skb->len) {
tcp_unlink_write_queue(skb, sk);
/* It is the one place in all of TCP, except connection
@@ -1787,8 +1777,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
err = tcp_friend_validate(sk, &friend, &timeo);
if (err < 0)
return err;
- if (friend)
- tcp_friend_recv_lock(sk);
while ((skb = tcp_recv_skb(sk, seq, &offset, &len)) != NULL) {
if (len > 0) {
@@ -1803,9 +1791,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
break;
}
- if (friend)
- tcp_friend_unlock(sk);
-
used = recv_actor(desc, skb, offset, len);
if (used < 0) {
if (!copied)
@@ -1817,21 +1802,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
offset += used;
}
- if (friend)
- tcp_friend_recv_lock(sk);
- if (skb->friend) {
- len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
- if (len > 0) {
- /*
- * Friend did an skb_put() while we
- * were away so process the same skb.
- */
- if (!desc->count)
- break;
- tp->copied_seq = seq;
- goto again;
- }
- } else {
+ if (!skb->friend) {
/*
* If recv_actor drops the lock (e.g. TCP
* splice receive) the skb pointer might be
@@ -1844,19 +1815,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
break;
}
}
+
if (!skb->friend && tcp_hdr(skb)->fin) {
sk_eat_skb(sk, skb, false);
++seq;
break;
}
if (skb->friend) {
- if (!TCP_FRIEND_CB(TCP_SKB_CB(skb))->tail_inuse) {
- __skb_unlink(skb, &sk->sk_receive_queue);
- __kfree_skb(skb);
- tcp_friend_write_space(sk);
+ len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
+ if (len > 0) {
+ if (!desc->count)
+ break;
+ tp->copied_seq = seq;
+ goto again;
}
- tcp_friend_unlock(sk);
- tcp_friend_recv_lock(sk);
+ tcp_friend_state_lock(sk);
+ __skb_unlink(skb, &sk->sk_receive_queue);
+ __kfree_skb(skb);
+ tcp_friend_state_unlock(sk);
+ tcp_friend_write_space(sk);
} else
sk_eat_skb(sk, skb, 0);
if (!desc->count)
@@ -1866,7 +1843,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
tp->copied_seq = seq;
if (friend) {
- tcp_friend_unlock(sk);
tcp_friend_write_space(sk);
} else {
tcp_rcv_space_adjust(sk);
@@ -1903,7 +1879,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
bool copied_early = false;
struct sk_buff *skb;
u32 urg_hole = 0;
- bool locked = false;
lock_sock(sk);
@@ -1991,11 +1966,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
* slock, end_seq updated, so we can only use the bytes
* from *seq to end_seq!
*/
- if (friend && !locked) {
- tcp_friend_recv_lock(sk);
- locked = true;
- }
-
skb_queue_walk(&sk->sk_receive_queue, skb) {
tcb = TCP_SKB_CB(skb);
offset = *seq - tcb->seq;
@@ -2003,20 +1973,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
if (skb->friend) {
used = (u32)(tcb->end_seq - *seq);
if (used > 0) {
- tcp_friend_unlock(sk);
- locked = false;
/* Can use it all */
goto found_ok_skb;
}
/* No data to copyout */
if (flags & MSG_PEEK)
continue;
- if (!TCP_FRIEND_CB(tcb)->tail_inuse)
- goto unlink;
- break;
+ goto unlink;
}
- tcp_friend_unlock(sk);
- locked = false;
}
/* Now that we have two receive queues this
@@ -2043,11 +2007,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
/* Well, if we have backlog, try to process it now yet. */
- if (friend && locked) {
- tcp_friend_unlock(sk);
- locked = false;
- }
-
if (copied >= target && !sk->sk_backlog.tail)
break;
@@ -2262,17 +2221,7 @@ do_prequeue:
len -= used;
offset += used;
- tcp_rcv_space_adjust(sk);
-
-skip_copy:
- if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
- tp->urg_data = 0;
- tcp_fast_path_check(sk);
- }
-
if (skb->friend) {
- tcp_friend_recv_lock(sk);
- locked = true;
used = (u32)(tcb->end_seq - *seq);
if (used) {
/*
@@ -2280,29 +2229,28 @@ skip_copy:
* so if more to do process the same skb.
*/
if (len > 0) {
- tcp_friend_unlock(sk);
- locked = false;
goto found_ok_skb;
}
continue;
}
- if (TCP_FRIEND_CB(tcb)->tail_inuse) {
- /* Give sendmsg a chance */
- tcp_friend_unlock(sk);
- locked = false;
- continue;
- }
if (!(flags & MSG_PEEK)) {
unlink:
+ tcp_friend_state_lock(sk);
__skb_unlink(skb, &sk->sk_receive_queue);
__kfree_skb(skb);
- tcp_friend_unlock(sk);
- locked = false;
+ tcp_friend_state_unlock(sk);
tcp_friend_write_space(sk);
}
continue;
}
+ tcp_rcv_space_adjust(sk);
+skip_copy:
+ if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
+ tp->urg_data = 0;
+ tcp_fast_path_check(sk);
+ }
+
if (offset < skb->len)
continue;
else if (tcp_hdr(skb)->fin)
@@ -2323,9 +2271,6 @@ skip_copy:
break;
} while (len > 0);
- if (friend && locked)
- tcp_friend_unlock(sk);
-
if (user_recv) {
if (!skb_queue_empty(&tp->ucopy.prequeue)) {
int chunk;
--
1.7.4.4
^ permalink raw reply related
* Re: [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections
From: Weiping Pan @ 2012-12-12 14:13 UTC (permalink / raw)
To: David Miller; +Cc: wpan, netdev, brutus
In-Reply-To: <20121210.160230.1883556145617090938.davem@davemloft.net>
On 12/11/2012 05:02 AM, David Miller wrote:
> From: Weiping Pan<wpan@redhat.com>
> Date: Wed, 5 Dec 2012 10:54:16 +0800
>
>> Friends VS AF__UNIX
>> Their call path are almost the same, but AF_UNIX uses its own send/recv codes
>> with proper locks,
>> so AF_UNIX's performance is much better than Friends.
Sorry, this statement is not correct.
In TCP_STREAM case, if the message size if 16384, then AF_UNIX is much
better than Friends.
If the message size is smaller, then Friends shows equal performance
with AF_UNIX.
In TCP_RR, Friends shows equal performance with AF_UNIX, too.
> While I understand the other portions of your analysis, this one
> mystifies me.
>
> In both cases, the sender has to queue the SKB onto the receiver's
> queue. And in both cases, the sender takes the lock on that queue.
>
> So the locking contention really ought to be similar if not identical.
>
> The only difference is that AF_UNIX takes the unix_sk()->lock of the
> remote socket around these operations.
>
> If that is enough of a synchronizer to "fix" the contention or reduce
> it, then this would be very easy to test by adding a friend lock to
> tcp_sk().
I make some experiments to reduce the use of lock,
some performance results will be followed up.
thanks
Weiping Pan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH v2 1/1] net: ethernet: davinci_cpdma: Add boundary for rx and tx descriptors
From: Mugunthan V N @ 2012-12-12 13:38 UTC (permalink / raw)
To: David Miller; +Cc: erdnetdev, netdev, linux-arm-kernel, linux-omap, s.hauer
In-Reply-To: <20121211.135759.1010213285970148974.davem@davemloft.net>
On 12/12/2012 12:27 AM, David Miller wrote:
> From: Eric Dumazet <erdnetdev@gmail.com>
> Date: Tue, 11 Dec 2012 10:54:56 -0800
>
>> Suggested fix : add a TCQ_F_MQSLAVE flag to allow dequeue_skb() to test
>> the netif_xmit_frozen_or_stopped() status _before_ dequeing packet from
>> qdisc.
> This sounds fine to me.
I will submit next version with the suggestion
Regards
Mugunthan V N
^ permalink raw reply
* Re: [PATCH] net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
From: Eric Dumazet @ 2012-12-12 12:22 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: David Miller, netdev, Ani Sinha
In-Reply-To: <1355304701-22228-1-git-send-email-dborkman@redhat.com>
On Wed, 2012-12-12 at 10:31 +0100, Daniel Borkmann wrote:
> Currently, we return -EINVAL for malicious or wrong BPF filters.
> However, this is not done for BPF_S_ANC* operations, which makes it
> more difficult to detect if it's actually supported or not by the
> BPF machine. Therefore, we should also return -EINVAL if K is within
> the SKF_AD_OFF universe and the ancillary operation did not match.
>
> Cc: Ani Sinha <ani@aristanetworks.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
> net/core/filter.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c23543c..de9bed4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -531,7 +531,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> [BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
> [BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
> };
> - int pc;
> + int pc, anc_found;
>
> if (flen == 0 || flen > BPF_MAXINSNS)
> return -EINVAL;
> @@ -592,8 +592,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> case BPF_S_LD_W_ABS:
> case BPF_S_LD_H_ABS:
> case BPF_S_LD_B_ABS:
> + anc_found = 0;
> #define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
> code = BPF_S_ANC_##CODE; \
> + anc_found = 1; \
> break
> switch (ftest->k) {
> ANCILLARY(PROTOCOL);
> @@ -610,6 +612,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> ANCILLARY(VLAN_TAG);
> ANCILLARY(VLAN_TAG_PRESENT);
> }
> +
> + /* ancillary operation unkown or unsupported */
> + if (anc_found == 0 && ftest->k >= SKF_AD_OFF)
> + return -EINVAL;
> }
> ftest->code = code;
> }
Several points :
1) This might break a userland filter that was previously working, by
returning 0 when load_pointer() returns NULL.
Specifying an offset bigger than skb->len is not _invalid_, it only
makes a filter returns 0, because load_pointer() returns NULL.
2) This wont help applications running on old kernels where your patch
wont be applied, as already mentioned yesterday.
3) Misses a "Reported-by" tag
4) anc_found is a boolean
To be truly portable, userland should not rely on kernel doing a full
validation of ancillaries.
^ permalink raw reply
* [PATCH V1 net-next 0/4] Add destination MAC address to ethtool flow steering
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
From: Yan Burman <yanb@mellanox.com>
In vSwitch configuration it is often beneficial to create flow steering
rules for L3/L4 traffic based on VM port. This requires destination MAC
address of that port to be present. Note that today the mlx4_en driver
adds the mac address of itself to the flow spec, where under the new
ethtool flag suggested here it doesn't.
It may also be useful in macvlan devices.
These patches add kernel support for the new field (does not break old
userspace compatibility, so new ethtool will work on old kernels and
old ethtool will work with new kernels).
Also present here is the ethtool userspace patch.
See more details here http ://marc.info/?t=134977576500003
Changes from V0:
- Get rid of full_mac, zero_mac in favour of
is_zero_ether_addr and is_broadcast_ether_addr
Yan Burman (3):
net: ethtool: Add destination MAC address to flow steering API
net/mlx4_en: Use generic etherdevice.h functions.
net/mlx4_en: Add support for destination MAC in steering rules
drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 27 ++++++++++++++++---------
include/uapi/linux/ethtool.h | 11 ++++++----
2 files changed, 24 insertions(+), 14 deletions(-)
--
1.7.11.3
^ permalink raw reply
* [PATCH V1 net-next 3/3] net/mlx4_en: Add support for destination MAC in steering rules
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>
From: Yan Burman <yanb@mellanox.com>
Implement destination MAC rule extension for L3/L4 rules in
flow steering. Usefull for vSwitch/macvlan configurations.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index cc7bb25..03447da 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -617,7 +617,13 @@ static int mlx4_en_validate_flow(struct net_device *dev,
if (cmd->fs.location >= MAX_NUM_OF_FS_RULES)
return -EINVAL;
- switch (cmd->fs.flow_type & ~FLOW_EXT) {
+ if (cmd->fs.flow_type & FLOW_MAC_EXT) {
+ /* dest mac mask must be ff:ff:ff:ff:ff:ff */
+ if (!is_broadcast_ether_addr(cmd->fs.m_ext.h_dest))
+ return -EINVAL;
+ }
+
+ switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
case TCP_V4_FLOW:
case UDP_V4_FLOW:
if (cmd->fs.m_u.tcp_ip4_spec.tos)
@@ -745,7 +751,6 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
struct list_head *rule_list_h)
{
int err;
- u64 mac;
__be64 be_mac;
struct ethhdr *eth_spec;
struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -760,12 +765,16 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
if (!spec_l2)
return -ENOMEM;
- mac = priv->mac & MLX4_MAC_MASK;
- be_mac = cpu_to_be64(mac << 16);
+ if (cmd->fs.flow_type & FLOW_MAC_EXT) {
+ memcpy(&be_mac, cmd->fs.h_ext.h_dest, ETH_ALEN);
+ } else {
+ u64 mac = priv->mac & MLX4_MAC_MASK;
+ be_mac = cpu_to_be64(mac << 16);
+ }
spec_l2->id = MLX4_NET_TRANS_RULE_ID_ETH;
memcpy(spec_l2->eth.dst_mac_msk, &mac_msk, ETH_ALEN);
- if ((cmd->fs.flow_type & ~FLOW_EXT) != ETHER_FLOW)
+ if ((cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) != ETHER_FLOW)
memcpy(spec_l2->eth.dst_mac, &be_mac, ETH_ALEN);
if ((cmd->fs.flow_type & FLOW_EXT) && cmd->fs.m_ext.vlan_tci) {
@@ -775,7 +784,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
list_add_tail(&spec_l2->list, rule_list_h);
- switch (cmd->fs.flow_type & ~FLOW_EXT) {
+ switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
case ETHER_FLOW:
eth_spec = &cmd->fs.h_u.ether_spec;
memcpy(&spec_l2->eth.dst_mac, eth_spec->h_dest, ETH_ALEN);
--
1.7.11.3
^ permalink raw reply related
* [PATCH ETHTOOL] Added dst-mac parameter for L3/L4 flow spec rules. This is usefull in vSwitch configurations.
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>
From: Yan Burman <yanb@mellanox.com>
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
ethtool-copy.h | 11 +++++++----
ethtool.8.in | 6 ++++++
ethtool.c | 5 +++++
rxclass.c | 62 ++++++++++++++++++++++++++++++++++++++++------------------
4 files changed, 61 insertions(+), 23 deletions(-)
diff --git a/ethtool-copy.h b/ethtool-copy.h
index 4801eef..d352f20 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -500,13 +500,15 @@ union ethtool_flow_union {
struct ethtool_ah_espip4_spec esp_ip4_spec;
struct ethtool_usrip4_spec usr_ip4_spec;
struct ethhdr ether_spec;
- __u8 hdata[60];
+ __u8 hdata[52];
};
struct ethtool_flow_ext {
- __be16 vlan_etype;
- __be16 vlan_tci;
- __be32 data[2];
+ __u8 padding[2];
+ unsigned char h_dest[ETH_ALEN]; /* destination eth addr */
+ __be16 vlan_etype;
+ __be16 vlan_tci;
+ __be32 data[2];
};
/**
@@ -1027,6 +1029,7 @@ enum ethtool_sfeatures_retval_bits {
#define ETHER_FLOW 0x12 /* spec only (ether_spec) */
/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
#define FLOW_EXT 0x80000000
+#define FLOW_MAC_EXT 0x40000000
/* L3-L4 network traffic flow hash options */
#define RXH_L2DA (1 << 1)
diff --git a/ethtool.8.in b/ethtool.8.in
index e701919..a52e484 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -268,6 +268,7 @@ ethtool \- query or control network driver and hardware settings
.BM vlan\-etype
.BM vlan
.BM user\-def
+.RB [ dst-mac \ \*(MA\ [ m \ \*(MA]]
.BN action
.BN loc
.RB |
@@ -739,6 +740,11 @@ Includes the VLAN tag and an optional mask.
.BI user\-def \ N \\fR\ [\\fPm \ N \\fR]\\fP
Includes 64-bits of user-specific data and an optional mask.
.TP
+.BR dst-mac \ \*(MA\ [ m \ \*(MA]
+Includes the destination MAC address, specified as 6 bytes in hexadecimal
+separated by colons, along with an optional mask.
+Valid for all IPv4 based flow-types.
+.TP
.BI action \ N
Specifies the Rx queue to send packets to, or some other action.
.TS
diff --git a/ethtool.c b/ethtool.c
index 345c21c..55bc082 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3231,6 +3231,10 @@ static int flow_spec_to_ntuple(struct ethtool_rx_flow_spec *fsp,
if (fsp->location != RX_CLS_LOC_ANY)
return -1;
+ /* destination MAC address in L3/L4 rules is not supported by ntuple */
+ if (fsp->flow_type & FLOW_MAC_EXT)
+ return -1;
+
/* verify ring cookie can transfer to action */
if (fsp->ring_cookie > INT_MAX && fsp->ring_cookie < (u64)(-2))
return -1;
@@ -3814,6 +3818,7 @@ static const struct option {
" [ vlan-etype %x [m %x] ]\n"
" [ vlan %x [m %x] ]\n"
" [ user-def %x [m %x] ]\n"
+ " [ dst-mac %x:%x:%x:%x:%x:%x [m %x:%x:%x:%x:%x:%x] ]\n"
" [ action %d ]\n"
" [ loc %d]] |\n"
" delete %d\n" },
diff --git a/rxclass.c b/rxclass.c
index e1633a8..1564b62 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -41,26 +41,38 @@ static void rxclass_print_ipv4_rule(__be32 sip, __be32 sipm, __be32 dip,
static void rxclass_print_nfc_spec_ext(struct ethtool_rx_flow_spec *fsp)
{
- u64 data, datam;
- __u16 etype, etypem, tci, tcim;
+ if (fsp->flow_type & FLOW_EXT) {
+ u64 data, datam;
+ __u16 etype, etypem, tci, tcim;
+ etype = ntohs(fsp->h_ext.vlan_etype);
+ etypem = ntohs(~fsp->m_ext.vlan_etype);
+ tci = ntohs(fsp->h_ext.vlan_tci);
+ tcim = ntohs(~fsp->m_ext.vlan_tci);
+ data = (u64)ntohl(fsp->h_ext.data[0]) << 32;
+ data = (u64)ntohl(fsp->h_ext.data[1]);
+ datam = (u64)ntohl(~fsp->m_ext.data[0]) << 32;
+ datam |= (u64)ntohl(~fsp->m_ext.data[1]);
- if (!(fsp->flow_type & FLOW_EXT))
- return;
+ fprintf(stdout,
+ "\tVLAN EtherType: 0x%x mask: 0x%x\n"
+ "\tVLAN: 0x%x mask: 0x%x\n"
+ "\tUser-defined: 0x%llx mask: 0x%llx\n",
+ etype, etypem, tci, tcim, data, datam);
+ }
- etype = ntohs(fsp->h_ext.vlan_etype);
- etypem = ntohs(~fsp->m_ext.vlan_etype);
- tci = ntohs(fsp->h_ext.vlan_tci);
- tcim = ntohs(~fsp->m_ext.vlan_tci);
- data = (u64)ntohl(fsp->h_ext.data[0]) << 32;
- data = (u64)ntohl(fsp->h_ext.data[1]);
- datam = (u64)ntohl(~fsp->m_ext.data[0]) << 32;
- datam |= (u64)ntohl(~fsp->m_ext.data[1]);
+ if (fsp->flow_type & FLOW_MAC_EXT) {
+ unsigned char *dmac, *dmacm;
- fprintf(stdout,
- "\tVLAN EtherType: 0x%x mask: 0x%x\n"
- "\tVLAN: 0x%x mask: 0x%x\n"
- "\tUser-defined: 0x%llx mask: 0x%llx\n",
- etype, etypem, tci, tcim, data, datam);
+ dmac = fsp->h_ext.h_dest;
+ dmacm = fsp->m_ext.h_dest;
+
+ fprintf(stdout,
+ "\tDest MAC addr: %02X:%02X:%02X:%02X:%02X:%02X"
+ " mask: %02X:%02X:%02X:%02X:%02X:%02X\n",
+ dmac[0], dmac[1], dmac[2], dmac[3], dmac[4],
+ dmac[5], dmacm[0], dmacm[1], dmacm[2], dmacm[3],
+ dmacm[4], dmacm[5]);
+ }
}
static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
@@ -70,7 +82,7 @@ static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
fprintf(stdout, "Filter: %d\n", fsp->location);
- flow_type = fsp->flow_type & ~FLOW_EXT;
+ flow_type = fsp->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
invert_flow_mask(fsp);
@@ -172,7 +184,7 @@ static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
static void rxclass_print_rule(struct ethtool_rx_flow_spec *fsp)
{
/* print the rule in this location */
- switch (fsp->flow_type & ~FLOW_EXT) {
+ switch (fsp->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
case TCP_V4_FLOW:
case UDP_V4_FLOW:
case SCTP_V4_FLOW:
@@ -533,6 +545,7 @@ typedef enum {
#define NTUPLE_FLAG_VLAN 0x100
#define NTUPLE_FLAG_UDEF 0x200
#define NTUPLE_FLAG_VETH 0x400
+#define NFC_FLAG_MAC_ADDR 0x800
struct rule_opts {
const char *name;
@@ -571,6 +584,9 @@ static const struct rule_opts rule_nfc_tcp_ip4[] = {
{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
offsetof(struct ethtool_rx_flow_spec, h_ext.data),
offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+ { "dst-mac", OPT_MAC, NFC_FLAG_MAC_ADDR,
+ offsetof(struct ethtool_rx_flow_spec, h_ext.h_dest),
+ offsetof(struct ethtool_rx_flow_spec, m_ext.h_dest) },
};
static const struct rule_opts rule_nfc_esp_ip4[] = {
@@ -599,6 +615,9 @@ static const struct rule_opts rule_nfc_esp_ip4[] = {
{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
offsetof(struct ethtool_rx_flow_spec, h_ext.data),
offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+ { "dst-mac", OPT_MAC, NFC_FLAG_MAC_ADDR,
+ offsetof(struct ethtool_rx_flow_spec, h_ext.h_dest),
+ offsetof(struct ethtool_rx_flow_spec, m_ext.h_dest) },
};
static const struct rule_opts rule_nfc_usr_ip4[] = {
@@ -639,6 +658,9 @@ static const struct rule_opts rule_nfc_usr_ip4[] = {
{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
offsetof(struct ethtool_rx_flow_spec, h_ext.data),
offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+ { "dst-mac", OPT_MAC, NFC_FLAG_MAC_ADDR,
+ offsetof(struct ethtool_rx_flow_spec, h_ext.h_dest),
+ offsetof(struct ethtool_rx_flow_spec, m_ext.h_dest) },
};
static const struct rule_opts rule_nfc_ether[] = {
@@ -1063,6 +1085,8 @@ int rxclass_parse_ruleopts(struct cmd_context *ctx,
fsp->h_u.usr_ip4_spec.ip_ver = ETH_RX_NFC_IP4;
if (flags & (NTUPLE_FLAG_VLAN | NTUPLE_FLAG_UDEF | NTUPLE_FLAG_VETH))
fsp->flow_type |= FLOW_EXT;
+ if (flags & NFC_FLAG_MAC_ADDR)
+ fsp->flow_type |= FLOW_MAC_EXT;
return 0;
--
1.7.11.3
^ permalink raw reply related
* [PATCH V1 net-next 1/3] net: ethtool: Add destination MAC address to flow steering API
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>
From: Yan Burman <yanb@mellanox.com>
Add ability to specify destination MAC address for L3/L4 flow spec
in order to be able to specify action for different VM's under vSwitch
configuration. This change is transparent to older userspace.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
include/uapi/linux/ethtool.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index d3eaaaf..be8c41e 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -500,13 +500,15 @@ union ethtool_flow_union {
struct ethtool_ah_espip4_spec esp_ip4_spec;
struct ethtool_usrip4_spec usr_ip4_spec;
struct ethhdr ether_spec;
- __u8 hdata[60];
+ __u8 hdata[52];
};
struct ethtool_flow_ext {
- __be16 vlan_etype;
- __be16 vlan_tci;
- __be32 data[2];
+ __u8 padding[2];
+ unsigned char h_dest[ETH_ALEN]; /* destination eth addr */
+ __be16 vlan_etype;
+ __be16 vlan_tci;
+ __be32 data[2];
};
/**
@@ -1027,6 +1029,7 @@ enum ethtool_sfeatures_retval_bits {
#define ETHER_FLOW 0x12 /* spec only (ether_spec) */
/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
#define FLOW_EXT 0x80000000
+#define FLOW_MAC_EXT 0x40000000
/* L3-L4 network traffic flow hash options */
#define RXH_L2DA (1 << 1)
--
1.7.11.3
^ permalink raw reply related
* [PATCH V1 net-next 2/3] net/mlx4_en: Use generic etherdevice.h functions.
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>
From: Yan Burman <yanb@mellanox.com>
Get rid of full_mac, zero_mac in favour of
is_zero_ether_addr and is_broadcast_ether_addr.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 4aaa7c3..cc7bb25 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -613,8 +613,6 @@ static int mlx4_en_validate_flow(struct net_device *dev,
struct ethtool_usrip4_spec *l3_mask;
struct ethtool_tcpip4_spec *l4_mask;
struct ethhdr *eth_mask;
- u64 full_mac = ~0ull;
- u64 zero_mac = 0;
if (cmd->fs.location >= MAX_NUM_OF_FS_RULES)
return -EINVAL;
@@ -644,11 +642,11 @@ static int mlx4_en_validate_flow(struct net_device *dev,
case ETHER_FLOW:
eth_mask = &cmd->fs.m_u.ether_spec;
/* source mac mask must not be set */
- if (memcmp(eth_mask->h_source, &zero_mac, ETH_ALEN))
+ if (!is_zero_ether_addr(eth_mask->h_source))
return -EINVAL;
/* dest mac mask must be ff:ff:ff:ff:ff:ff */
- if (memcmp(eth_mask->h_dest, &full_mac, ETH_ALEN))
+ if (!is_broadcast_ether_addr(eth_mask->h_dest))
return -EINVAL;
if (!all_zeros_or_all_ones(eth_mask->h_proto))
--
1.7.11.3
^ permalink raw reply related
* [PATCH] iproute2: fix tc ematch manpage section
From: Andreas Henriksson @ 2012-12-12 11:23 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
The debian package checking tool, lintian, spotted that the
tc ematch manpage seems to have an error in the specified section.
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
diff --git a/man/man8/tc-ematch.8 b/man/man8/tc-ematch.8
index 2eafc29..957a22e 100644
--- a/man/man8/tc-ematch.8
+++ b/man/man8/tc-ematch.8
@@ -1,4 +1,4 @@
-.TH filter ematch "6 August 2012" iproute2 Linux
+.TH ematch 8 "6 August 2012" iproute2 Linux
.
.SH NAME
ematch \- extended matches for use with "basic" or "flow" filters
^ permalink raw reply related
* Re: [PATCH] ipv6: fix the bug when propagating Redirect Message
From: Duan Jiong @ 2012-12-12 11:09 UTC (permalink / raw)
To: Steffen Klassert; +Cc: davem, netdev
In-Reply-To: <20121211134514.GE18940@secunet.com>
于 2012/12/11 21:45, Steffen Klassert 写道:
> On Tue, Dec 11, 2012 at 08:58:20PM +0800, Duan Jiong wrote:
>>
>> Just like you said, i try to use ndisc_parse_options() to instead
>> of the loop, but i find the skb->data can't be changed in function
>> ndisc_parse_options() due to lack of arguments. So i think it is
>> better to continue to use the loop. How do you think this?
>>
>
> You can change the data pointer after ndisc_parse_options().
> Something like the (untested) patch below should do it.
>
> include/net/ndisc.h | 7 +++++++
> net/ipv6/ndisc.c | 20 ++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/include/net/ndisc.h b/include/net/ndisc.h
> index 980d263..c17bccd 100644
> --- a/include/net/ndisc.h
> +++ b/include/net/ndisc.h
> @@ -78,6 +78,13 @@ struct ra_msg {
> __be32 retrans_timer;
> };
>
> +struct rd_msg {
> + struct icmp6hdr icmph;
> + struct in6_addr target;
> + struct in6_addr dest;
> + __u8 opt[0];
> +};
> +
> struct nd_opt_hdr {
> __u8 nd_opt_type;
> __u8 nd_opt_len;
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index 2edce30..9afd23f 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1333,6 +1333,12 @@ out:
>
> static void ndisc_redirect_rcv(struct sk_buff *skb)
> {
> + u8 *hdr;
> + struct ndisc_options ndopts;
> + struct rd_msg *msg = (struct rd_msg *) skb_transport_header(skb);
> + u32 ndoptlen = skb->tail - (skb->transport_header +
> + offsetof(struct rd_msg, opt));
> +
> #ifdef CONFIG_IPV6_NDISC_NODETYPE
> switch (skb->ndisc_nodetype) {
> case NDISC_NODETYPE_HOST:
> @@ -1349,6 +1355,20 @@ static void ndisc_redirect_rcv(struct sk_buff *skb)
> return;
> }
>
> + if (!ndisc_parse_options(msg->opt, ndoptlen, &ndopts)) {
> + ND_PRINTK(2, warn, "Redirect: invalid ND options\n");
> + return;
> + }
> +
> + if (!ndopts.nd_opts_rh)
> + return;
> +
> + hdr = (u8 *) ndopts.nd_opts_rh;
> + hdr += 8;
> +
> + if (!pskb_pull(skb, hdr - skb_transport_header(skb)))
> + return;
> +
> icmpv6_notify(skb, NDISC_REDIRECT, 0, 0);
> }
>
>
Thanks for you help. I will test it.
^ permalink raw reply
* [patch net-next 4/4] dummy: implement carrier change
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
drivers/net/dummy.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index c260af5..42aa54a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -100,6 +100,15 @@ static void dummy_dev_uninit(struct net_device *dev)
free_percpu(dev->dstats);
}
+static int dummy_change_carrier(struct net_device *dev, bool new_carrier)
+{
+ if (new_carrier)
+ netif_carrier_on(dev);
+ else
+ netif_carrier_off(dev);
+ return 0;
+}
+
static const struct net_device_ops dummy_netdev_ops = {
.ndo_init = dummy_dev_init,
.ndo_uninit = dummy_dev_uninit,
@@ -108,6 +117,7 @@ static const struct net_device_ops dummy_netdev_ops = {
.ndo_set_rx_mode = set_multicast_list,
.ndo_set_mac_address = eth_mac_addr,
.ndo_get_stats64 = dummy_get_stats64,
+ .ndo_change_carrier = dummy_change_carrier,
};
static void dummy_setup(struct net_device *dev)
--
1.8.0
^ permalink raw reply related
* [patch net-next 3/4] rtnl: expose carrier value with possibility to set it
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
include/uapi/linux/if_link.h | 1 +
net/core/rtnetlink.c | 10 ++++++++++
2 files changed, 11 insertions(+)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 60f3b6b..c4edfe1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -142,6 +142,7 @@ enum {
#define IFLA_PROMISCUITY IFLA_PROMISCUITY
IFLA_NUM_TX_QUEUES,
IFLA_NUM_RX_QUEUES,
+ IFLA_CARRIER,
__IFLA_MAX
};
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1868625..2ef7a56 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -780,6 +780,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
+ nla_total_size(4) /* IFLA_MTU */
+ nla_total_size(4) /* IFLA_LINK */
+ nla_total_size(4) /* IFLA_MASTER */
+ + nla_total_size(1) /* IFLA_CARRIER */
+ nla_total_size(4) /* IFLA_PROMISCUITY */
+ nla_total_size(4) /* IFLA_NUM_TX_QUEUES */
+ nla_total_size(4) /* IFLA_NUM_RX_QUEUES */
@@ -909,6 +910,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
nla_put_u32(skb, IFLA_LINK, dev->iflink)) ||
(dev->master &&
nla_put_u32(skb, IFLA_MASTER, dev->master->ifindex)) ||
+ nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
(dev->qdisc &&
nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
(dev->ifalias &&
@@ -1108,6 +1110,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_MTU] = { .type = NLA_U32 },
[IFLA_LINK] = { .type = NLA_U32 },
[IFLA_MASTER] = { .type = NLA_U32 },
+ [IFLA_CARRIER] = { .type = NLA_U8 },
[IFLA_TXQLEN] = { .type = NLA_U32 },
[IFLA_WEIGHT] = { .type = NLA_U32 },
[IFLA_OPERSTATE] = { .type = NLA_U8 },
@@ -1438,6 +1441,13 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
modified = 1;
}
+ if (tb[IFLA_CARRIER]) {
+ err = dev_change_carrier(dev, nla_get_u8(tb[IFLA_CARRIER]));
+ if (err)
+ goto errout;
+ modified = 1;
+ }
+
if (tb[IFLA_TXQLEN])
dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
--
1.8.0
^ permalink raw reply related
* [patch net-next 2/4] net: allow to change carrier via sysfs
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>
Make carrier writable
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
net/core/net-sysfs.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 334efd5..7eda40a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -126,6 +126,19 @@ static ssize_t show_broadcast(struct device *dev,
return -EINVAL;
}
+static int change_carrier(struct net_device *net, unsigned long new_carrier)
+{
+ if (!netif_running(net))
+ return -EINVAL;
+ return dev_change_carrier(net, (bool) new_carrier);
+}
+
+static ssize_t store_carrier(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ return netdev_store(dev, attr, buf, len, change_carrier);
+}
+
static ssize_t show_carrier(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -331,7 +344,7 @@ static struct device_attribute net_class_attributes[] = {
__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
__ATTR(address, S_IRUGO, show_address, NULL),
__ATTR(broadcast, S_IRUGO, show_broadcast, NULL),
- __ATTR(carrier, S_IRUGO, show_carrier, NULL),
+ __ATTR(carrier, S_IRUGO | S_IWUSR, show_carrier, store_carrier),
__ATTR(speed, S_IRUGO, show_speed, NULL),
__ATTR(duplex, S_IRUGO, show_duplex, NULL),
__ATTR(dormant, S_IRUGO, show_dormant, NULL),
--
1.8.0
^ permalink raw reply related
* [patch net-next 1/4] net: add change_carrier netdev op
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>
This allows a driver to register change_carrier callback which will be
called whenever user will like to change carrier state. This is useful
for devices like dummy, gre, team and so on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
include/linux/netdevice.h | 7 +++++++
net/core/dev.c | 19 +++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6a14d4..e1a5c16 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -891,6 +891,9 @@ struct netdev_fcoe_hbainfo {
* int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh)
* int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq,
* struct net_device *dev)
+ *
+ * int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier);
+ * Called to update device carrier.
*/
struct net_device_ops {
int (*ndo_init)(struct net_device *dev);
@@ -1008,6 +1011,8 @@ struct net_device_ops {
int (*ndo_bridge_getlink)(struct sk_buff *skb,
u32 pid, u32 seq,
struct net_device *dev);
+ int (*ndo_change_carrier)(struct net_device *dev,
+ bool new_carrier);
};
/*
@@ -2191,6 +2196,8 @@ extern int dev_set_mtu(struct net_device *, int);
extern void dev_set_group(struct net_device *, int);
extern int dev_set_mac_address(struct net_device *,
struct sockaddr *);
+extern int dev_change_carrier(struct net_device *,
+ bool new_carrier);
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev,
struct netdev_queue *txq);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4783850..cc6426b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5025,6 +5025,25 @@ int dev_set_mac_address(struct net_device *dev, struct sockaddr *sa)
}
EXPORT_SYMBOL(dev_set_mac_address);
+/**
+ * dev_change_carrier - Change device carrier
+ * @dev: device
+ * @new_carries: new value
+ *
+ * Change device carrier
+ */
+int dev_change_carrier(struct net_device *dev, bool new_carrier)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ if (!ops->ndo_change_carrier)
+ return -EOPNOTSUPP;
+ if (!netif_device_present(dev))
+ return -ENODEV;
+ return ops->ndo_change_carrier(dev, new_carrier);
+}
+EXPORT_SYMBOL(dev_change_carrier);
+
/*
* Perform the SIOCxIFxxx calls, inside rcu_read_lock()
*/
--
1.8.0
^ permalink raw reply related
* [patch net-next 0/4] net: allow to change carrier from userspace
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
This is basically a repost of my previous patchset:
"[patch net-next-2.6 0/2] net: allow to change carrier via sysfs" from Aug 30
The way net-sysfs stores values changed and this patchset reflects it.
Also, I exposed carrier via rtnetlink iface.
So far, only dummy driver uses carrier change ndo. In very near future
team driver will use that as well.
Jiri Pirko (4):
net: add change_carrier netdev op
net: allow to change carrier via sysfs
rtnl: expose carrier value with possibility to set it
dummy: implement carrier change
drivers/net/dummy.c | 10 ++++++++++
include/linux/netdevice.h | 7 +++++++
include/uapi/linux/if_link.h | 1 +
net/core/dev.c | 19 +++++++++++++++++++
net/core/net-sysfs.c | 15 ++++++++++++++-
net/core/rtnetlink.c | 10 ++++++++++
6 files changed, 61 insertions(+), 1 deletion(-)
--
1.8.0
^ permalink raw reply
* Re: [PATCH net-next 2/2] net/mlx4_en: Add support for destination MAC in steering rules
From: Amir Vadai @ 2012-12-12 10:07 UTC (permalink / raw)
To: Brian Haley; +Cc: David S. Miller, netdev, Or Gerlitz, Yan Burman
In-Reply-To: <50C75382.3040504@hp.com>
On 11/12/2012 17:38, Brian Haley wrote:
> On 12/11/2012 07:03 AM, Amir Vadai wrote:
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
>> @@ -619,7 +619,13 @@ static int mlx4_en_validate_flow(struct net_device *dev,
>> if (cmd->fs.location >= MAX_NUM_OF_FS_RULES)
>> return -EINVAL;
>>
>> - switch (cmd->fs.flow_type & ~FLOW_EXT) {
>> + if (cmd->fs.flow_type & FLOW_MAC_EXT) {
>> + /* dest mac mask must be ff:ff:ff:ff:ff:ff */
>> + if (memcmp(cmd->fs.m_ext.h_dest, &full_mac, ETH_ALEN))
>> + return -EINVAL;
>> + }
>
> etherdevice.h has is_broadcast_ether_addr() and is_zero_ether_addr() if you want
> to get rid of full_mac and zero_mac in this function.
>
> -Brian
>
Right, will send a V1 with this fix.
Amir.
^ permalink raw reply
* [PATCH iproute2 1/3] ip: add support of netconf messages
From: Nicolas Dichtel @ 2012-12-12 9:51 UTC (permalink / raw)
To: shemminger; +Cc: netdev, Nicolas Dichtel
Example of the output:
$ ip monitor netconf&
[1] 24901
$ echo 0 > /proc/sys/net/ipv6/conf/all/forwarding
ipv6 dev lo forwarding off
ipv6 dev eth0 forwarding off
ipv6 all forwarding off
$ echo 1 > /proc/sys/net/ipv4/conf/eth0/forwarding
ipv4 dev eth0 forwarding on
$ ip -6 netconf
ipv6 all forwarding on mc_forwarding 0
$ ip netconf show dev eth0
ipv4 dev eth0 forwarding on rp_filter off mc_forwarding 1
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/libnetlink.h | 1 +
include/linux/netconf.h | 24 ++++++
include/linux/rtnetlink.h | 9 +++
ip/Makefile | 2 +-
ip/ip.c | 1 +
ip/ip_common.h | 3 +
ip/ipmonitor.c | 16 ++++
ip/ipnetconf.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 238 insertions(+), 1 deletion(-)
create mode 100644 include/linux/netconf.h
create mode 100644 ip/ipnetconf.c
diff --git a/include/libnetlink.h b/include/libnetlink.h
index 81649af..4a6b878 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -8,6 +8,7 @@
#include <linux/if_link.h>
#include <linux/if_addr.h>
#include <linux/neighbour.h>
+#include <linux/netconf.h>
struct rtnl_handle
{
diff --git a/include/linux/netconf.h b/include/linux/netconf.h
new file mode 100644
index 0000000..64804a7
--- /dev/null
+++ b/include/linux/netconf.h
@@ -0,0 +1,24 @@
+#ifndef _UAPI_LINUX_NETCONF_H_
+#define _UAPI_LINUX_NETCONF_H_
+
+#include <linux/types.h>
+#include <linux/netlink.h>
+
+struct netconfmsg {
+ __u8 ncm_family;
+};
+
+enum {
+ NETCONFA_UNSPEC,
+ NETCONFA_IFINDEX,
+ NETCONFA_FORWARDING,
+ NETCONFA_RP_FILTER,
+ NETCONFA_MC_FORWARDING,
+ __NETCONFA_MAX
+};
+#define NETCONFA_MAX (__NETCONFA_MAX - 1)
+
+#define NETCONFA_IFINDEX_ALL -1
+#define NETCONFA_IFINDEX_DEFAULT -2
+
+#endif /* _UAPI_LINUX_NETCONF_H_ */
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 0e3e0c1..a30530e 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -120,6 +120,11 @@ enum {
RTM_SETDCB,
#define RTM_SETDCB RTM_SETDCB
+ RTM_NEWNETCONF = 80,
+#define RTM_NEWNETCONF RTM_NEWNETCONF
+ RTM_GETNETCONF = 82,
+#define RTM_GETNETCONF RTM_GETNETCONF
+
__RTM_MAX,
#define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1)
};
@@ -585,6 +590,10 @@ enum rtnetlink_groups {
#define RTNLGRP_PHONET_ROUTE RTNLGRP_PHONET_ROUTE
RTNLGRP_DCB,
#define RTNLGRP_DCB RTNLGRP_DCB
+ RTNLGRP_IPV4_NETCONF,
+#define RTNLGRP_IPV4_NETCONF RTNLGRP_IPV4_NETCONF
+ RTNLGRP_IPV6_NETCONF,
+#define RTNLGRP_IPV6_NETCONF RTNLGRP_IPV6_NETCONF
__RTNLGRP_MAX
};
#define RTNLGRP_MAX (__RTNLGRP_MAX - 1)
diff --git a/ip/Makefile b/ip/Makefile
index 1676f0f..4bc33d7 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -4,7 +4,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o \
- iplink_vxlan.o tcp_metrics.o iplink_ipoib.o
+ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o
RTMONOBJ=rtmon.o
diff --git a/ip/ip.c b/ip/ip.c
index e0f7e60..632d271 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -85,6 +85,7 @@ static const struct cmd {
{ "mroute", do_multiroute },
{ "mrule", do_multirule },
{ "netns", do_netns },
+ { "netconf", do_ipnetconf },
{ "help", do_help },
{ 0 }
};
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 2fd66b7..a394669 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -25,6 +25,8 @@ extern int print_prefix(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg);
extern int print_rule(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg);
+extern int print_netconf(const struct sockaddr_nl *who,
+ struct nlmsghdr *n, void *arg);
extern int do_ipaddr(int argc, char **argv);
extern int do_ipaddrlabel(int argc, char **argv);
extern int do_iproute(int argc, char **argv);
@@ -43,6 +45,7 @@ extern int do_netns(int argc, char **argv);
extern int do_xfrm(int argc, char **argv);
extern int do_ipl2tp(int argc, char **argv);
extern int do_tcp_metrics(int argc, char **argv);
+extern int do_ipnetconf(int argc, char **argv);
static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
{
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 4b1d469..d87e58f 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -85,6 +85,12 @@ int accept_msg(const struct sockaddr_nl *who,
print_rule(who, n, arg);
return 0;
}
+ if (n->nlmsg_type == RTM_NEWNETCONF) {
+ if (prefix_banner)
+ fprintf(fp, "[NETCONF]");
+ print_netconf(who, n, arg);
+ return 0;
+ }
if (n->nlmsg_type == 15) {
char *tstr;
time_t secs = ((__u32*)NLMSG_DATA(n))[0];
@@ -118,6 +124,7 @@ int do_ipmonitor(int argc, char **argv)
int lroute=0;
int lprefix=0;
int lneigh=0;
+ int lnetconf=0;
rtnl_close(&rth);
ipaddr_reset_filter(1);
@@ -143,6 +150,9 @@ int do_ipmonitor(int argc, char **argv)
} else if (matches(*argv, "neigh") == 0) {
lneigh = 1;
groups = 0;
+ } else if (matches(*argv, "netconf") == 0) {
+ lnetconf = 1;
+ groups = 0;
} else if (strcmp(*argv, "all") == 0) {
groups = ~RTMGRP_TC;
prefix_banner=1;
@@ -176,6 +186,12 @@ int do_ipmonitor(int argc, char **argv)
if (lneigh) {
groups |= nl_mgrp(RTNLGRP_NEIGH);
}
+ if (lnetconf) {
+ if (!preferred_family || preferred_family == AF_INET)
+ groups |= nl_mgrp(RTNLGRP_IPV4_NETCONF);
+ if (!preferred_family || preferred_family == AF_INET6)
+ groups |= nl_mgrp(RTNLGRP_IPV6_NETCONF);
+ }
if (file) {
FILE *fp;
fp = fopen(file, "r");
diff --git a/ip/ipnetconf.c b/ip/ipnetconf.c
new file mode 100644
index 0000000..66d667b
--- /dev/null
+++ b/ip/ipnetconf.c
@@ -0,0 +1,183 @@
+/*
+ * ipnetconf.c "ip netconf".
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: Nicolas Dichtel, <nicolas.dichtel@6wind.com>
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+static struct
+{
+ int family;
+ int ifindex;
+} filter;
+
+static void usage(void) __attribute__((noreturn));
+
+static void usage(void)
+{
+ fprintf(stderr, "Usage: ip netconf show [ dev STRING ]\n");
+ exit(-1);
+}
+
+#define NETCONF_RTA(r) ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct netconfmsg))))
+
+int print_netconf(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+ FILE *fp = (FILE*)arg;
+ struct netconfmsg *ncm = NLMSG_DATA(n);
+ int len = n->nlmsg_len;
+ struct rtattr *tb[NETCONFA_MAX+1];
+
+ if (n->nlmsg_type == NLMSG_ERROR)
+ return -1;
+ if (n->nlmsg_type != RTM_NEWNETCONF) {
+ fprintf(stderr, "Not RTM_NEWNETCONF: %08x %08x %08x\n",
+ n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
+
+ return -1;
+ }
+ len -= NLMSG_SPACE(sizeof(*ncm));
+ if (len < 0) {
+ fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
+ return -1;
+ }
+
+ if (filter.family && filter.family != ncm->ncm_family)
+ return 0;
+
+ parse_rtattr(tb, NETCONFA_MAX, NETCONF_RTA(ncm),
+ NLMSG_PAYLOAD(n, sizeof(*ncm)));
+
+ switch (ncm->ncm_family) {
+ case AF_INET:
+ fprintf(fp, "ipv4 ");
+ break;
+ case AF_INET6:
+ fprintf(fp, "ipv6 ");
+ break;
+ default:
+ fprintf(fp, "unknown ");
+ break;
+ }
+
+ if (tb[NETCONFA_IFINDEX]) {
+ int *ifindex = (int *)RTA_DATA(tb[NETCONFA_IFINDEX]);
+
+ switch (*ifindex) {
+ case NETCONFA_IFINDEX_ALL:
+ fprintf(fp, "all ");
+ break;
+ case NETCONFA_IFINDEX_DEFAULT:
+ fprintf(fp, "default ");
+ break;
+ default:
+ fprintf(fp, "dev %s ", ll_index_to_name(*ifindex));
+ break;
+ }
+ }
+
+ if (tb[NETCONFA_FORWARDING])
+ fprintf(fp, "forwarding %s ",
+ *(int *)RTA_DATA(tb[NETCONFA_FORWARDING])?"on":"off");
+ if (tb[NETCONFA_RP_FILTER]) {
+ int rp_filter = *(int *)RTA_DATA(tb[NETCONFA_RP_FILTER]);
+
+ if (rp_filter == 0)
+ fprintf(fp, "rp_filter off ");
+ else if (rp_filter == 1)
+ fprintf(fp, "rp_filter strict ");
+ else if (rp_filter == 2)
+ fprintf(fp, "rp_filter loose ");
+ else
+ fprintf(fp, "rp_filter unknown mode ");
+ }
+ if (tb[NETCONFA_MC_FORWARDING])
+ fprintf(fp, "mc_forwarding %d ",
+ *(int *)RTA_DATA(tb[NETCONFA_MC_FORWARDING]));
+
+ fprintf(fp, "\n");
+ fflush(fp);
+ return 0;
+}
+
+void ipnetconf_reset_filter(void)
+{
+ memset(&filter, 0, sizeof(filter));
+}
+
+int do_show(int argc, char **argv)
+{
+ struct {
+ struct nlmsghdr n;
+ struct netconfmsg ncm;
+ char buf[1024];
+ } req;
+
+ ipnetconf_reset_filter();
+ filter.family = preferred_family;
+ if (filter.family == AF_UNSPEC)
+ filter.family = AF_INET;
+ filter.ifindex = NETCONFA_IFINDEX_ALL;
+
+ while (argc > 0) {
+ if (strcmp(*argv, "dev") == 0) {
+ NEXT_ARG();
+ filter.ifindex = ll_name_to_index(*argv);
+ if (filter.ifindex <= 0) {
+ fprintf(stderr, "Device \"%s\" does not exist.\n",
+ *argv);
+ return -1;
+ }
+ }
+ argv++; argc--;
+ }
+
+ ll_init_map(&rth);
+ memset(&req, 0, sizeof(req));
+ req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct netconfmsg));
+ req.n.nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK;
+ req.n.nlmsg_type = RTM_GETNETCONF;
+ req.ncm.ncm_family = filter.family;
+ addattr_l(&req.n, sizeof(req), NETCONFA_IFINDEX, &filter.ifindex,
+ sizeof(filter.ifindex));
+
+ rtnl_send(&rth, &req.n, req.n.nlmsg_len);
+ rtnl_listen(&rth, print_netconf, stdout);
+
+ return 0;
+}
+
+int do_ipnetconf(int argc, char **argv)
+{
+ if (argc > 0) {
+ if (matches(*argv, "show") == 0 ||
+ matches(*argv, "lst") == 0 ||
+ matches(*argv, "list") == 0)
+ return do_show(argc-1, argv+1);
+ if (matches(*argv, "help") == 0)
+ usage();
+ } else
+ return do_show(0, NULL);
+
+ fprintf(stderr, "Command \"%s\" is unknown, try \"ip netconf help\".\n", *argv);
+ exit(-1);
+}
--
1.8.0.1
^ permalink raw reply related
* [PATCH iproute2 2/3] ip: add support of 'ip link type ip6tnl'
From: Nicolas Dichtel @ 2012-12-12 9:51 UTC (permalink / raw)
To: shemminger; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <1355305907-7102-1-git-send-email-nicolas.dichtel@6wind.com>
This patch allows to manage ip6 tunnels via the interface ip link.
The syntax for parameters is the same that 'ip -6 tunnel'.
It also allows to display tunnels parameters with 'ip -details link' or
'ip -details monitor link'.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/linux/if_tunnel.h | 20 +++
ip/Makefile | 2 +-
ip/iplink.c | 3 +-
ip/link_ip6tnl.c | 344 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 367 insertions(+), 2 deletions(-)
create mode 100644 ip/link_ip6tnl.c
diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 5db5942..aee73d0 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -37,6 +37,26 @@ struct ip_tunnel_parm {
struct iphdr iph;
};
+enum {
+ IFLA_IPTUN_UNSPEC,
+ IFLA_IPTUN_LINK,
+ IFLA_IPTUN_LOCAL,
+ IFLA_IPTUN_REMOTE,
+ IFLA_IPTUN_TTL,
+ IFLA_IPTUN_TOS,
+ IFLA_IPTUN_ENCAP_LIMIT,
+ IFLA_IPTUN_FLOWINFO,
+ IFLA_IPTUN_FLAGS,
+ IFLA_IPTUN_PROTO,
+ IFLA_IPTUN_PMTUDISC,
+ IFLA_IPTUN_6RD_PREFIX,
+ IFLA_IPTUN_6RD_RELAY_PREFIX,
+ IFLA_IPTUN_6RD_PREFIXLEN,
+ IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
+ __IFLA_IPTUN_MAX,
+};
+#define IFLA_IPTUN_MAX (__IFLA_IPTUN_MAX - 1)
+
/* SIT-mode i_flags */
#define SIT_ISATAP 0x0001
diff --git a/ip/Makefile b/ip/Makefile
index 4bc33d7..abf54bf 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -4,7 +4,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o \
- iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o
+ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o
RTMONOBJ=rtmon.o
diff --git a/ip/iplink.c b/ip/iplink.c
index 7451aa0..8aac9fc 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -83,7 +83,8 @@ void iplink_usage(void)
if (iplink_have_newlink()) {
fprintf(stderr, "\n");
- fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | can | bridge | ipoib }\n");
+ fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | can |\n");
+ fprintf(stderr, " bridge | ipoib | ip6tnl }\n");
}
exit(-1);
}
diff --git a/ip/link_ip6tnl.c b/ip/link_ip6tnl.c
new file mode 100644
index 0000000..2947364
--- /dev/null
+++ b/ip/link_ip6tnl.c
@@ -0,0 +1,344 @@
+/*
+ * link_ip6tnl.c ip6tnl driver module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: Nicolas Dichtel <nicolas.dichtel@6wind.com>
+ *
+ */
+
+#include <string.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+
+#include <linux/ip.h>
+#include <linux/if_tunnel.h>
+#include <linux/ip6_tunnel.h>
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+#include "tunnel.h"
+
+#define IP6_FLOWINFO_TCLASS htonl(0x0FF00000)
+#define IP6_FLOWINFO_FLOWLABEL htonl(0x000FFFFF)
+
+#define DEFAULT_TNL_HOP_LIMIT (64)
+
+static void usage(void) __attribute__((noreturn));
+static void usage(void)
+{
+ fprintf(stderr, "Usage: ip link { add | set | change | replace | del } NAME\n");
+ fprintf(stderr, " type ip6tnl [ remote ADDR ] [ local ADDR ]\n");
+ fprintf(stderr, " [ dev PHYS_DEV ] [ encaplimit ELIM ]\n");
+ fprintf(stderr ," [ hoplimit HLIM ] [ tclass TCLASS ] [ flowlabel FLOWLABEL ]\n");
+ fprintf(stderr, " [ dscp inherit ] [ fwmark inherit ]\n");
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Where: NAME := STRING\n");
+ fprintf(stderr, " ADDR := IPV6_ADDRESS\n");
+ fprintf(stderr, " ELIM := { none | 0..255 }(default=%d)\n",
+ IPV6_DEFAULT_TNL_ENCAP_LIMIT);
+ fprintf(stderr, " HLIM := 0..255 (default=%d)\n",
+ DEFAULT_TNL_HOP_LIMIT);
+ fprintf(stderr, " TCLASS := { 0x0..0xff | inherit }\n");
+ fprintf(stderr, " FLOWLABEL := { 0x0..0xfffff | inherit }\n");
+ exit(-1);
+}
+
+static int ip6tunnel_parse_opt(struct link_util *lu, int argc, char **argv,
+ struct nlmsghdr *n)
+{
+ struct {
+ struct nlmsghdr n;
+ struct ifinfomsg i;
+ char buf[2048];
+ } req;
+ struct ifinfomsg *ifi = (struct ifinfomsg *)(n + 1);
+ struct rtattr *tb[IFLA_MAX + 1];
+ struct rtattr *linkinfo[IFLA_INFO_MAX+1];
+ struct rtattr *iptuninfo[IFLA_IPTUN_MAX + 1];
+ int len;
+ struct in6_addr laddr;
+ struct in6_addr raddr;
+ __u8 hop_limit = DEFAULT_TNL_HOP_LIMIT;
+ __u8 encap_limit = IPV6_DEFAULT_TNL_ENCAP_LIMIT;
+ __u32 flowinfo = 0;
+ __u32 flags = 0;
+ __u32 link = 0;
+ __u8 proto = 0;
+
+ memset(&laddr, 0, sizeof(laddr));
+ memset(&raddr, 0, sizeof(raddr));
+
+ if (!(n->nlmsg_flags & NLM_F_CREATE)) {
+ memset(&req, 0, sizeof(req));
+
+ req.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi));
+ req.n.nlmsg_flags = NLM_F_REQUEST;
+ req.n.nlmsg_type = RTM_GETLINK;
+ req.i.ifi_family = preferred_family;
+ req.i.ifi_index = ifi->ifi_index;
+
+ if (rtnl_talk(&rth, &req.n, 0, 0, &req.n) < 0) {
+get_failed:
+ fprintf(stderr,
+ "Failed to get existing tunnel info.\n");
+ return -1;
+ }
+
+ len = req.n.nlmsg_len;
+ len -= NLMSG_LENGTH(sizeof(*ifi));
+ if (len < 0)
+ goto get_failed;
+
+ parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+
+ if (!tb[IFLA_LINKINFO])
+ goto get_failed;
+
+ parse_rtattr_nested(linkinfo, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+ if (!linkinfo[IFLA_INFO_DATA])
+ goto get_failed;
+
+ parse_rtattr_nested(iptuninfo, IFLA_IPTUN_MAX,
+ linkinfo[IFLA_INFO_DATA]);
+
+ if (iptuninfo[IFLA_IPTUN_LOCAL])
+ memcpy(&laddr, RTA_DATA(iptuninfo[IFLA_IPTUN_LOCAL]),
+ sizeof(laddr));
+
+ if (iptuninfo[IFLA_IPTUN_REMOTE])
+ memcpy(&raddr, RTA_DATA(iptuninfo[IFLA_IPTUN_REMOTE]),
+ sizeof(raddr));
+
+ if (iptuninfo[IFLA_IPTUN_TTL])
+ hop_limit = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TTL]);
+
+ if (iptuninfo[IFLA_IPTUN_ENCAP_LIMIT])
+ encap_limit = rta_getattr_u8(iptuninfo[IFLA_IPTUN_ENCAP_LIMIT]);
+
+ if (iptuninfo[IFLA_IPTUN_FLOWINFO])
+ flowinfo = rta_getattr_u32(iptuninfo[IFLA_IPTUN_FLOWINFO]);
+
+ if (iptuninfo[IFLA_IPTUN_FLAGS])
+ flags = rta_getattr_u32(iptuninfo[IFLA_IPTUN_FLAGS]);
+
+ if (iptuninfo[IFLA_IPTUN_LINK])
+ link = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LINK]);
+
+ if (iptuninfo[IFLA_IPTUN_PROTO])
+ proto = rta_getattr_u8(iptuninfo[IFLA_IPTUN_PROTO]);
+ }
+
+ while (argc > 0) {
+ if (matches(*argv, "mode") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "ipv6/ipv6") == 0 ||
+ strcmp(*argv, "ip6ip6") == 0)
+ proto = IPPROTO_IPV6;
+ else if (strcmp(*argv, "ip/ipv6") == 0 ||
+ strcmp(*argv, "ipv4/ipv6") == 0 ||
+ strcmp(*argv, "ipip6") == 0 ||
+ strcmp(*argv, "ip4ip6") == 0)
+ proto = IPPROTO_IPIP;
+ else if (strcmp(*argv, "any/ipv6") == 0 ||
+ strcmp(*argv, "any") == 0)
+ proto = 0;
+ else
+ invarg("Cannot guess tunnel mode.", *argv);
+ } else if (strcmp(*argv, "remote") == 0) {
+ inet_prefix addr;
+ NEXT_ARG();
+ get_prefix(&addr, *argv, preferred_family);
+ if (addr.family == AF_UNSPEC)
+ invarg("\"remote\" address family is AF_UNSPEC", *argv);
+ memcpy(&raddr, addr.data, addr.bytelen);
+ } else if (strcmp(*argv, "local") == 0) {
+ inet_prefix addr;
+ NEXT_ARG();
+ get_prefix(&addr, *argv, preferred_family);
+ if (addr.family == AF_UNSPEC)
+ invarg("\"local\" address family is AF_UNSPEC", *argv);
+ memcpy(&laddr, addr.data, addr.bytelen);
+ } else if (matches(*argv, "dev") == 0) {
+ NEXT_ARG();
+ link = if_nametoindex(*argv);
+ if (link == 0)
+ invarg("\"dev\" is invalid", *argv);
+ } else if (strcmp(*argv, "hoplimit") == 0 ||
+ strcmp(*argv, "ttl") == 0 ||
+ strcmp(*argv, "hlim") == 0) {
+ __u8 uval;
+ NEXT_ARG();
+ if (get_u8(&uval, *argv, 0))
+ invarg("invalid HLIM", *argv);
+ hop_limit = uval;
+ } else if (matches(*argv, "encaplimit") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "none") == 0) {
+ flags |= IP6_TNL_F_IGN_ENCAP_LIMIT;
+ } else {
+ __u8 uval;
+ if (get_u8(&uval, *argv, 0) < -1)
+ invarg("invalid ELIM", *argv);
+ encap_limit = uval;
+ flags &= ~IP6_TNL_F_IGN_ENCAP_LIMIT;
+ }
+ } else if (strcmp(*argv, "tclass") == 0 ||
+ strcmp(*argv, "tc") == 0 ||
+ strcmp(*argv, "tos") == 0 ||
+ matches(*argv, "dsfield") == 0) {
+ __u8 uval;
+ NEXT_ARG();
+ flowinfo &= ~IP6_FLOWINFO_TCLASS;
+ if (strcmp(*argv, "inherit") == 0)
+ flags |= IP6_TNL_F_USE_ORIG_TCLASS;
+ else {
+ if (get_u8(&uval, *argv, 16))
+ invarg("invalid TClass", *argv);
+ flowinfo |= htonl((__u32)uval << 20) & IP6_FLOWINFO_TCLASS;
+ flags &= ~IP6_TNL_F_USE_ORIG_TCLASS;
+ }
+ } else if (strcmp(*argv, "flowlabel") == 0 ||
+ strcmp(*argv, "fl") == 0) {
+ __u32 uval;
+ NEXT_ARG();
+ flowinfo &= ~IP6_FLOWINFO_FLOWLABEL;
+ if (strcmp(*argv, "inherit") == 0)
+ flags |= IP6_TNL_F_USE_ORIG_FLOWLABEL;
+ else {
+ if (get_u32(&uval, *argv, 16))
+ invarg("invalid Flowlabel", *argv);
+ if (uval > 0xFFFFF)
+ invarg("invalid Flowlabel", *argv);
+ flowinfo |= htonl(uval) & IP6_FLOWINFO_FLOWLABEL;
+ flags &= ~IP6_TNL_F_USE_ORIG_FLOWLABEL;
+ }
+ } else if (strcmp(*argv, "dscp") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "inherit") != 0)
+ invarg("not inherit", *argv);
+ flags |= IP6_TNL_F_RCV_DSCP_COPY;
+ } else if (strcmp(*argv, "fwmark") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "inherit") != 0)
+ invarg("not inherit", *argv);
+ flags |= IP6_TNL_F_USE_ORIG_FWMARK;
+ } else
+ usage();
+ argc--, argv++;
+ }
+
+ addattr8(n, 1024, IFLA_IPTUN_PROTO, proto);
+ addattr_l(n, 1024, IFLA_IPTUN_LOCAL, &laddr, sizeof(laddr));
+ addattr_l(n, 1024, IFLA_IPTUN_REMOTE, &raddr, sizeof(raddr));
+ addattr8(n, 1024, IFLA_IPTUN_TTL, hop_limit);
+ addattr8(n, 1024, IFLA_IPTUN_ENCAP_LIMIT, encap_limit);
+ addattr32(n, 1024, IFLA_IPTUN_FLOWINFO, flowinfo);
+ addattr32(n, 1024, IFLA_IPTUN_FLAGS, flags);
+ addattr32(n, 1024, IFLA_IPTUN_LINK, link);
+
+ return 0;
+}
+
+static void ip6tunnel_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+ char s1[256];
+ char s2[64];
+ int flags = 0;
+ __u32 flowinfo = 0;
+
+ if (!tb)
+ return;
+
+ if (tb[IFLA_IPTUN_FLAGS])
+ flags = rta_getattr_u32(tb[IFLA_IPTUN_FLAGS]);
+
+ if (tb[IFLA_IPTUN_FLOWINFO])
+ flowinfo = rta_getattr_u32(tb[IFLA_IPTUN_FLOWINFO]);
+
+ if (tb[IFLA_IPTUN_PROTO]) {
+ switch (rta_getattr_u8(tb[IFLA_IPTUN_PROTO])) {
+ case IPPROTO_IPIP:
+ fprintf(f, "ipip6 ");
+ break;
+ case IPPROTO_IPV6:
+ fprintf(f, "ip6ip6 ");
+ break;
+ case 0:
+ fprintf(f, "any ");
+ break;
+ }
+ }
+
+ if (tb[IFLA_IPTUN_REMOTE]) {
+ fprintf(f, "remote %s ",
+ rt_addr_n2a(AF_INET6,
+ RTA_PAYLOAD(tb[IFLA_IPTUN_REMOTE]),
+ RTA_DATA(tb[IFLA_IPTUN_REMOTE]),
+ s1, sizeof(s1)));
+ }
+
+ if (tb[IFLA_IPTUN_LOCAL]) {
+ fprintf(f, "local %s ",
+ rt_addr_n2a(AF_INET6,
+ RTA_PAYLOAD(tb[IFLA_IPTUN_LOCAL]),
+ RTA_DATA(tb[IFLA_IPTUN_LOCAL]),
+ s1, sizeof(s1)));
+ }
+
+ if (tb[IFLA_IPTUN_LINK] && rta_getattr_u32(tb[IFLA_IPTUN_LINK])) {
+ unsigned link = rta_getattr_u32(tb[IFLA_IPTUN_LINK]);
+ const char *n = if_indextoname(link, s2);
+
+ if (n)
+ fprintf(f, "dev %s ", n);
+ else
+ fprintf(f, "dev %u ", link);
+ }
+
+ if (flags & IP6_TNL_F_IGN_ENCAP_LIMIT)
+ printf("encaplimit none ");
+ else if (tb[IFLA_IPTUN_ENCAP_LIMIT])
+ fprintf(f, "encaplimit %u ",
+ rta_getattr_u8(tb[IFLA_IPTUN_ENCAP_LIMIT]));
+
+ if (tb[IFLA_IPTUN_TTL])
+ fprintf(f, "hoplimit %u ", rta_getattr_u8(tb[IFLA_IPTUN_TTL]));
+
+ if (flags & IP6_TNL_F_USE_ORIG_TCLASS)
+ printf("tclass inherit ");
+ else if (tb[IFLA_IPTUN_FLOWINFO]) {
+ __u32 val = ntohl(flowinfo & IP6_FLOWINFO_TCLASS);
+
+ printf("tclass 0x%02x ", (__u8)(val >> 20));
+ }
+
+ if (flags & IP6_TNL_F_USE_ORIG_FLOWLABEL)
+ printf("flowlabel inherit ");
+ else
+ printf("flowlabel 0x%05x ", ntohl(flowinfo & IP6_FLOWINFO_FLOWLABEL));
+
+ printf("(flowinfo 0x%08x) ", ntohl(flowinfo));
+
+ if (flags & IP6_TNL_F_RCV_DSCP_COPY)
+ printf("dscp inherit ");
+
+ if (flags & IP6_TNL_F_MIP6_DEV)
+ fprintf(f, "mip6 ");
+
+ if (flags & IP6_TNL_F_USE_ORIG_FWMARK)
+ fprintf(f, "fwmark inherit ");
+}
+
+struct link_util ip6tnl_link_util = {
+ .id = "ip6tnl",
+ .maxattr = IFLA_IPTUN_MAX,
+ .parse_opt = ip6tunnel_parse_opt,
+ .print_opt = ip6tunnel_print_opt,
+};
--
1.8.0.1
^ permalink raw reply related
* [PATCH iproute2 3/3] ip: add support of 'ip link type [ipip|sit]'
From: Nicolas Dichtel @ 2012-12-12 9:51 UTC (permalink / raw)
To: shemminger; +Cc: netdev, Nicolas Dichtel
In-Reply-To: <1355305907-7102-1-git-send-email-nicolas.dichtel@6wind.com>
This patch allows to manage ip tunnels via the interface ip link.
The syntax for parameters is the same that 'ip tunnel'.
It also allows to display tunnels parameters with 'ip -details link' or
'ip -details monitor link'.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
ip/Makefile | 3 +-
ip/iplink.c | 2 +-
ip/link_iptnl.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 343 insertions(+), 2 deletions(-)
create mode 100644 ip/link_iptnl.c
diff --git a/ip/Makefile b/ip/Makefile
index abf54bf..2b606d4 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -4,7 +4,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o \
- iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o
+ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \
+ link_iptnl.o
RTMONOBJ=rtmon.o
diff --git a/ip/iplink.c b/ip/iplink.c
index 8aac9fc..d73c705 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -84,7 +84,7 @@ void iplink_usage(void)
if (iplink_have_newlink()) {
fprintf(stderr, "\n");
fprintf(stderr, "TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | can |\n");
- fprintf(stderr, " bridge | ipoib | ip6tnl }\n");
+ fprintf(stderr, " bridge | ipoib | ip6tnl | ipip | sit }\n");
}
exit(-1);
}
diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
new file mode 100644
index 0000000..238722d
--- /dev/null
+++ b/ip/link_iptnl.c
@@ -0,0 +1,340 @@
+/*
+ * link_iptnl.c ipip and sit driver module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: Nicolas Dichtel <nicolas.dichtel@6wind.com>
+ *
+ */
+
+#include <string.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+
+#include <linux/ip.h>
+#include <linux/if_tunnel.h>
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+#include "tunnel.h"
+
+static void usage(int sit) __attribute__((noreturn));
+static void usage(int sit)
+{
+ fprintf(stderr, "Usage: ip link { add | set | change | replace | del } NAME\n");
+ fprintf(stderr, " type { ipip | sit } [ remote ADDR ] [ local ADDR ]\n");
+ fprintf(stderr, " [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]\n");
+ fprintf(stderr, " [ 6rd-prefix ADDR ] [ 6rd-relay_prefix ADDR ] [ 6rd-reset ]\n");
+ if (sit)
+ fprintf(stderr, " [ isatap ]\n");
+ fprintf(stderr, "\n");
+ fprintf(stderr, "Where: NAME := STRING\n");
+ fprintf(stderr, " ADDR := { IP_ADDRESS | any }\n");
+ fprintf(stderr, " TOS := { NUMBER | inherit }\n");
+ fprintf(stderr, " TTL := { 1..255 | inherit }\n");
+ exit(-1);
+}
+
+static int iptunnel_parse_opt(struct link_util *lu, int argc, char **argv,
+ struct nlmsghdr *n)
+{
+ struct {
+ struct nlmsghdr n;
+ struct ifinfomsg i;
+ char buf[2048];
+ } req;
+ struct ifinfomsg *ifi = (struct ifinfomsg *)(n + 1);
+ struct rtattr *tb[IFLA_MAX + 1];
+ struct rtattr *linkinfo[IFLA_INFO_MAX+1];
+ struct rtattr *iptuninfo[IFLA_IPTUN_MAX + 1];
+ int len;
+ __u32 link = 0;
+ __u32 laddr = 0;
+ __u32 raddr = 0;
+ __u8 ttl = 0;
+ __u8 tos = 0;
+ __u8 pmtudisc = 1;
+ __u16 iflags = 0;
+ struct in6_addr ip6rdprefix;
+ __u16 ip6rdprefixlen = 0;
+ __u32 ip6rdrelayprefix = 0;
+ __u16 ip6rdrelayprefixlen = 0;
+
+ memset(&ip6rdprefix, 0, sizeof(ip6rdprefix));
+
+ if (!(n->nlmsg_flags & NLM_F_CREATE)) {
+ memset(&req, 0, sizeof(req));
+
+ req.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi));
+ req.n.nlmsg_flags = NLM_F_REQUEST;
+ req.n.nlmsg_type = RTM_GETLINK;
+ req.i.ifi_family = preferred_family;
+ req.i.ifi_index = ifi->ifi_index;
+
+ if (rtnl_talk(&rth, &req.n, 0, 0, &req.n) < 0) {
+get_failed:
+ fprintf(stderr,
+ "Failed to get existing tunnel info.\n");
+ return -1;
+ }
+
+ len = req.n.nlmsg_len;
+ len -= NLMSG_LENGTH(sizeof(*ifi));
+ if (len < 0)
+ goto get_failed;
+
+ parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+
+ if (!tb[IFLA_LINKINFO])
+ goto get_failed;
+
+ parse_rtattr_nested(linkinfo, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+ if (!linkinfo[IFLA_INFO_DATA])
+ goto get_failed;
+
+ parse_rtattr_nested(iptuninfo, IFLA_IPTUN_MAX,
+ linkinfo[IFLA_INFO_DATA]);
+
+ if (iptuninfo[IFLA_IPTUN_LOCAL])
+ laddr = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LOCAL]);
+
+ if (iptuninfo[IFLA_IPTUN_REMOTE])
+ raddr = rta_getattr_u32(iptuninfo[IFLA_IPTUN_REMOTE]);
+
+ if (iptuninfo[IFLA_IPTUN_TTL])
+ ttl = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TTL]);
+
+ if (iptuninfo[IFLA_IPTUN_TOS])
+ tos = rta_getattr_u8(iptuninfo[IFLA_IPTUN_TOS]);
+
+ if (iptuninfo[IFLA_IPTUN_PMTUDISC])
+ pmtudisc =
+ rta_getattr_u8(iptuninfo[IFLA_IPTUN_PMTUDISC]);
+
+ if (iptuninfo[IFLA_IPTUN_FLAGS])
+ iflags = rta_getattr_u16(iptuninfo[IFLA_IPTUN_FLAGS]);
+
+ if (iptuninfo[IFLA_IPTUN_LINK])
+ link = rta_getattr_u32(iptuninfo[IFLA_IPTUN_LINK]);
+
+ if (iptuninfo[IFLA_IPTUN_6RD_PREFIX])
+ memcpy(&ip6rdprefix,
+ RTA_DATA(iptuninfo[IFLA_IPTUN_6RD_PREFIX]),
+ sizeof(laddr));
+
+ if (iptuninfo[IFLA_IPTUN_6RD_PREFIXLEN])
+ ip6rdprefixlen =
+ rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_PREFIXLEN]);
+
+ if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIX])
+ ip6rdrelayprefix =
+ rta_getattr_u32(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIX]);
+
+ if (iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN])
+ ip6rdrelayprefixlen =
+ rta_getattr_u16(iptuninfo[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
+ }
+
+ while (argc > 0) {
+ if (strcmp(*argv, "remote") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "any"))
+ raddr = get_addr32(*argv);
+ else
+ raddr = 0;
+ } else if (strcmp(*argv, "local") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "any"))
+ laddr = get_addr32(*argv);
+ else
+ laddr = 0;
+ } else if (matches(*argv, "dev") == 0) {
+ NEXT_ARG();
+ link = if_nametoindex(*argv);
+ if (link == 0)
+ invarg("\"dev\" is invalid", *argv);
+ } else if (strcmp(*argv, "ttl") == 0 ||
+ strcmp(*argv, "hoplimit") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "inherit") != 0) {
+ if (get_u8(&ttl, *argv, 0))
+ invarg("invalid TTL\n", *argv);
+ } else
+ ttl = 0;
+ } else if (strcmp(*argv, "tos") == 0 ||
+ strcmp(*argv, "tclass") == 0 ||
+ matches(*argv, "dsfield") == 0) {
+ __u32 uval;
+ NEXT_ARG();
+ if (strcmp(*argv, "inherit") != 0) {
+ if (rtnl_dsfield_a2n(&uval, *argv))
+ invarg("bad TOS value", *argv);
+ tos = uval;
+ } else
+ tos = 1;
+ } else if (strcmp(*argv, "nopmtudisc") == 0) {
+ pmtudisc = 0;
+ } else if (strcmp(*argv, "pmtudisc") == 0) {
+ pmtudisc = 1;
+ } else if (strcmp(lu->id, "sit") == 0 &&
+ strcmp(*argv, "isatap") == 0) {
+ iflags |= SIT_ISATAP;
+ } else if (strcmp(*argv, "6rd-prefix") == 0) {
+ inet_prefix prefix;
+ NEXT_ARG();
+ if (get_prefix(&prefix, *argv, AF_INET6))
+ invarg("invalid 6rd_prefix\n", *argv);
+ memcpy(&ip6rdprefix, prefix.data, 16);
+ ip6rdprefixlen = prefix.bitlen;
+ } else if (strcmp(*argv, "6rd-relay_prefix") == 0) {
+ inet_prefix prefix;
+ NEXT_ARG();
+ if (get_prefix(&prefix, *argv, AF_INET))
+ invarg("invalid 6rd-relay_prefix\n", *argv);
+ memcpy(&ip6rdrelayprefix, prefix.data, 4);
+ ip6rdrelayprefixlen = prefix.bitlen;
+ } else if (strcmp(*argv, "6rd-reset") == 0) {
+ inet_prefix prefix;
+ get_prefix(&prefix, "2002::", AF_INET6);
+ memcpy(&ip6rdprefix, prefix.data, 16);
+ ip6rdprefixlen = 16;
+ ip6rdrelayprefix = 0;
+ ip6rdrelayprefixlen = 0;
+ } else
+ usage(strcmp(lu->id, "sit") == 0);
+ argc--, argv++;
+ }
+
+ if (ttl && pmtudisc == 0) {
+ fprintf(stderr, "ttl != 0 and noptmudisc are incompatible\n");
+ exit(-1);
+ }
+
+ addattr32(n, 1024, IFLA_IPTUN_LINK, link);
+ addattr32(n, 1024, IFLA_IPTUN_LOCAL, laddr);
+ addattr32(n, 1024, IFLA_IPTUN_REMOTE, raddr);
+ addattr8(n, 1024, IFLA_IPTUN_TTL, ttl);
+ addattr8(n, 1024, IFLA_IPTUN_TOS, tos);
+ addattr8(n, 1024, IFLA_IPTUN_PMTUDISC, pmtudisc);
+ if (strcmp(lu->id, "sit") == 0) {
+ addattr16(n, 1024, IFLA_IPTUN_FLAGS, iflags);
+ if (ip6rdprefixlen) {
+ addattr_l(n, 1024, IFLA_IPTUN_6RD_PREFIX,
+ &ip6rdprefix, sizeof(ip6rdprefix));
+ addattr16(n, 1024, IFLA_IPTUN_6RD_PREFIXLEN,
+ ip6rdprefixlen);
+ addattr32(n, 1024, IFLA_IPTUN_6RD_RELAY_PREFIX,
+ ip6rdrelayprefix);
+ addattr16(n, 1024, IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
+ ip6rdrelayprefixlen);
+ }
+ }
+
+ return 0;
+}
+
+static void iptunnel_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+ char s1[1024];
+ char s2[64];
+ const char *local = "any";
+ const char *remote = "any";
+
+ if (!tb)
+ return;
+
+ if (tb[IFLA_IPTUN_REMOTE]) {
+ unsigned addr = rta_getattr_u32(tb[IFLA_IPTUN_REMOTE]);
+
+ if (addr)
+ remote = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
+ }
+
+ fprintf(f, "remote %s ", remote);
+
+ if (tb[IFLA_IPTUN_LOCAL]) {
+ unsigned addr = rta_getattr_u32(tb[IFLA_IPTUN_LOCAL]);
+
+ if (addr)
+ local = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
+ }
+
+ fprintf(f, "local %s ", local);
+
+ if (tb[IFLA_IPTUN_LINK] && rta_getattr_u32(tb[IFLA_IPTUN_LINK])) {
+ unsigned link = rta_getattr_u32(tb[IFLA_IPTUN_LINK]);
+ const char *n = if_indextoname(link, s2);
+
+ if (n)
+ fprintf(f, "dev %s ", n);
+ else
+ fprintf(f, "dev %u ", link);
+ }
+
+ if (tb[IFLA_IPTUN_TTL] && rta_getattr_u8(tb[IFLA_IPTUN_TTL]))
+ fprintf(f, "ttl %d ", rta_getattr_u8(tb[IFLA_IPTUN_TTL]));
+ else
+ fprintf(f, "ttl inherit ");
+
+ if (tb[IFLA_IPTUN_TOS] && rta_getattr_u8(tb[IFLA_IPTUN_TOS])) {
+ int tos = rta_getattr_u8(tb[IFLA_IPTUN_TOS]);
+
+ fputs("tos ", f);
+ if (tos == 1)
+ fputs("inherit ", f);
+ else
+ fprintf(f, "0x%x ", tos);
+ }
+
+ if (tb[IFLA_IPTUN_PMTUDISC] && rta_getattr_u8(tb[IFLA_IPTUN_PMTUDISC]))
+ fprintf(f, "pmtudisc ");
+ else
+ fprintf(f, "nopmtudisc ");
+
+ if (tb[IFLA_IPTUN_FLAGS]) {
+ __u16 iflags = rta_getattr_u16(tb[IFLA_IPTUN_FLAGS]);
+
+ if (iflags & SIT_ISATAP)
+ fprintf(f, "isatap ");
+ }
+
+ if (tb[IFLA_IPTUN_6RD_PREFIXLEN] &&
+ *(__u16 *)RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIXLEN])) {
+ __u16 prefixlen = rta_getattr_u16(tb[IFLA_IPTUN_6RD_PREFIXLEN]);
+ __u16 relayprefixlen =
+ rta_getattr_u16(tb[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
+ __u32 relayprefix =
+ rta_getattr_u32(tb[IFLA_IPTUN_6RD_RELAY_PREFIX]);
+
+ printf("6rd-prefix %s/%u ",
+ inet_ntop(AF_INET6, RTA_DATA(tb[IFLA_IPTUN_6RD_PREFIX]),
+ s1, sizeof(s1)),
+ prefixlen);
+ if (relayprefix) {
+ printf("6rd-relay_prefix %s/%u ",
+ format_host(AF_INET, 4, &relayprefix, s1,
+ sizeof(s1)),
+ relayprefixlen);
+ }
+ }
+}
+
+struct link_util ipip_link_util = {
+ .id = "ipip",
+ .maxattr = IFLA_IPTUN_MAX,
+ .parse_opt = iptunnel_parse_opt,
+ .print_opt = iptunnel_print_opt,
+};
+
+struct link_util sit_link_util = {
+ .id = "sit",
+ .maxattr = IFLA_IPTUN_MAX,
+ .parse_opt = iptunnel_parse_opt,
+ .print_opt = iptunnel_print_opt,
+};
--
1.8.0.1
^ permalink raw reply related
* Re: [PATCH] net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
From: Daniel Borkmann @ 2012-12-12 9:38 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Ani Sinha, Eric Dumazet
In-Reply-To: <1355304701-22228-1-git-send-email-dborkman@redhat.com>
On 12/12/2012 10:31 AM, Daniel Borkmann wrote:
> Currently, we return -EINVAL for malicious or wrong BPF filters.
> However, this is not done for BPF_S_ANC* operations, which makes it
> more difficult to detect if it's actually supported or not by the
> BPF machine. Therefore, we should also return -EINVAL if K is within
> the SKF_AD_OFF universe and the ancillary operation did not match.
>
> Cc: Ani Sinha <ani@aristanetworks.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
(Sorry, this is intended for net-next.)
^ permalink raw reply
* [PATCH iproute2] ip: use rtnelink to manage mroute
From: Nicolas Dichtel @ 2012-12-12 9:32 UTC (permalink / raw)
To: shemminger; +Cc: netdev, Nicolas Dichtel
mroute was using /proc/net/ip_mr_[vif|cache] to display mroute entries. Hence,
only RT_TABLE_DEFAULT was displayed and only IPv4.
With rtnetlink, it is possible to display all tables for IPv4 and IPv6. The output
format is kept. Also, like before the patch, statistics are displayed when user specify
the '-s' argument.
The patch also adds the support of 'ip monitor mroute', which is now possible.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/linux/rtnetlink.h | 7 ++
ip/ip_common.h | 3 +
ip/ipmonitor.c | 35 +++++-
ip/ipmroute.c | 296 ++++++++++++++++++++++++++++------------------
4 files changed, 220 insertions(+), 121 deletions(-)
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 0e3e0c1..e0595dc 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -283,6 +283,7 @@ enum rtattr_type_t {
RTA_MP_ALGO, /* no longer used */
RTA_TABLE,
RTA_MARK,
+ RTA_MFC_STATS,
__RTA_MAX
};
@@ -403,6 +404,12 @@ struct rta_session {
} u;
};
+struct rta_mfc_stats {
+ __u64 mfcs_packets;
+ __u64 mfcs_bytes;
+ __u64 mfcs_wrong_if;
+};
+
/****
* General form of address family dependent message.
****/
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 2fd66b7..57653b5 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -16,11 +16,14 @@ extern int ipaddr_list_link(int argc, char **argv);
extern int iproute_monitor(int argc, char **argv);
extern void iplink_usage(void) __attribute__((noreturn));
extern void iproute_reset_filter(void);
+extern void ipmroute_reset_filter(void);
extern void ipaddr_reset_filter(int);
extern void ipneigh_reset_filter(void);
extern void ipntable_reset_filter(void);
extern int print_route(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg);
+extern int print_mroute(const struct sockaddr_nl *who,
+ struct nlmsghdr *n, void *arg);
extern int print_prefix(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg);
extern int print_rule(const struct sockaddr_nl *who,
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 4b1d469..39bfb8e 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -43,10 +43,26 @@ int accept_msg(const struct sockaddr_nl *who,
print_timestamp(fp);
if (n->nlmsg_type == RTM_NEWROUTE || n->nlmsg_type == RTM_DELROUTE) {
- if (prefix_banner)
- fprintf(fp, "[ROUTE]");
- print_route(who, n, arg);
- return 0;
+ struct rtmsg *r = NLMSG_DATA(n);
+
+ if (n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)) < 0) {
+ fprintf(stderr, "BUG: wrong nlmsg len %d\n",
+ n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
+ return -1;
+ }
+
+ if (r->rtm_family == RTNL_FAMILY_IPMR ||
+ r->rtm_family == RTNL_FAMILY_IP6MR) {
+ if (prefix_banner)
+ fprintf(fp, "[MROUTE]");
+ print_mroute(who, n, arg);
+ return 0;
+ } else {
+ if (prefix_banner)
+ fprintf(fp, "[ROUTE]");
+ print_route(who, n, arg);
+ return 0;
+ }
}
if (n->nlmsg_type == RTM_NEWLINK || n->nlmsg_type == RTM_DELLINK) {
ll_remember_index(who, n, NULL);
@@ -116,12 +132,14 @@ int do_ipmonitor(int argc, char **argv)
int llink=0;
int laddr=0;
int lroute=0;
+ int lmroute=0;
int lprefix=0;
int lneigh=0;
rtnl_close(&rth);
ipaddr_reset_filter(1);
iproute_reset_filter();
+ ipmroute_reset_filter();
ipneigh_reset_filter();
while (argc > 0) {
@@ -137,6 +155,9 @@ int do_ipmonitor(int argc, char **argv)
} else if (matches(*argv, "route") == 0) {
lroute=1;
groups = 0;
+ } else if (matches(*argv, "mroute") == 0) {
+ lmroute=1;
+ groups = 0;
} else if (matches(*argv, "prefix") == 0) {
lprefix=1;
groups = 0;
@@ -169,6 +190,12 @@ int do_ipmonitor(int argc, char **argv)
if (!preferred_family || preferred_family == AF_INET6)
groups |= nl_mgrp(RTNLGRP_IPV6_ROUTE);
}
+ if (lmroute) {
+ if (!preferred_family || preferred_family == AF_INET)
+ groups |= nl_mgrp(RTNLGRP_IPV4_MROUTE);
+ if (!preferred_family || preferred_family == AF_INET6)
+ groups |= nl_mgrp(RTNLGRP_IPV6_MROUTE);
+ }
if (lprefix) {
if (!preferred_family || preferred_family == AF_INET6)
groups |= nl_mgrp(RTNLGRP_IPV6_PREFIX);
diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index 945727d..4c82c6e 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -15,6 +15,7 @@
#include <unistd.h>
#include <syslog.h>
#include <fcntl.h>
+#include <inttypes.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <netinet/in.h>
@@ -26,167 +27,228 @@
#include <linux/if_arp.h>
#include <linux/sockios.h>
+#include <rt_names.h>
#include "utils.h"
-
-char filter_dev[16];
-int filter_family;
+#include "ip_common.h"
static void usage(void) __attribute__((noreturn));
static void usage(void)
{
- fprintf(stderr, "Usage: ip mroute show [ PREFIX ] [ from PREFIX ] [ iif DEVICE ]\n");
+ fprintf(stderr, "Usage: ip mroute show [ [ to ] PREFIX ] [ from PREFIX ] [ iif DEVICE ]\n");
#if 0
fprintf(stderr, "Usage: ip mroute [ add | del ] DESTINATION from SOURCE [ iif DEVICE ] [ oif DEVICE ]\n");
#endif
exit(-1);
}
-static char *viftable[32];
-
struct rtfilter
{
+ int tb;
+ int af;
+ int iif;
inet_prefix mdst;
inet_prefix msrc;
} filter;
-static void read_viftable(void)
+int print_mroute(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
{
- char buf[256];
- FILE *fp = fopen("/proc/net/ip_mr_vif", "r");
-
- if (!fp)
- return;
-
- if (!fgets(buf, sizeof(buf), fp)) {
- fclose(fp);
- return;
+ FILE *fp = (FILE*)arg;
+ struct rtmsg *r = NLMSG_DATA(n);
+ int len = n->nlmsg_len;
+ struct rtattr * tb[RTA_MAX+1];
+ char abuf[256];
+ char obuf[256];
+ SPRINT_BUF(b1);
+ __u32 table;
+ int iif = 0;
+ int family;
+
+ if ((n->nlmsg_type != RTM_NEWROUTE &&
+ n->nlmsg_type != RTM_DELROUTE) ||
+ !(n->nlmsg_flags & NLM_F_MULTI)) {
+ fprintf(stderr, "Not a multicast route: %08x %08x %08x\n",
+ n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
+ return 0;
}
- while (fgets(buf, sizeof(buf), fp)) {
- int vifi;
- char dev[256];
-
- if (sscanf(buf, "%d%s", &vifi, dev) < 2)
- continue;
-
- if (vifi<0 || vifi>31)
- continue;
-
- viftable[vifi] = strdup(dev);
+ len -= NLMSG_LENGTH(sizeof(*r));
+ if (len < 0) {
+ fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
+ return -1;
}
- fclose(fp);
-}
-
-static void read_mroute_list(FILE *ofp)
-{
- char buf[256];
- FILE *fp = fopen("/proc/net/ip_mr_cache", "r");
-
- if (!fp)
- return;
-
- if (!fgets(buf, sizeof(buf), fp)) {
- fclose(fp);
- return;
+ if (r->rtm_type != RTN_MULTICAST) {
+ fprintf(stderr, "Not a multicast route (type: %s)\n",
+ rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
+ return 0;
}
- while (fgets(buf, sizeof(buf), fp)) {
- inet_prefix maddr, msrc;
- unsigned pkts, b, w;
- int vifi;
- char oiflist[256];
- char sbuf[256];
- char mbuf[256];
- char obuf[256];
-
- oiflist[0] = 0;
- if (sscanf(buf, "%x%x%d%u%u%u %[^\n]",
- maddr.data, msrc.data, &vifi,
- &pkts, &b, &w, oiflist) < 6)
- continue;
-
- if (vifi!=-1 && (vifi < 0 || vifi>31))
- continue;
-
- if (filter_dev[0] && (vifi<0 || strcmp(filter_dev, viftable[vifi])))
- continue;
- if (filter.mdst.family && inet_addr_match(&maddr, &filter.mdst, filter.mdst.bitlen))
- continue;
- if (filter.msrc.family && inet_addr_match(&msrc, &filter.msrc, filter.msrc.bitlen))
- continue;
-
- snprintf(obuf, sizeof(obuf), "(%s, %s)",
- format_host(AF_INET, 4, &msrc.data[0], sbuf, sizeof(sbuf)),
- format_host(AF_INET, 4, &maddr.data[0], mbuf, sizeof(mbuf)));
-
- fprintf(ofp, "%-32s Iif: ", obuf);
-
- if (vifi == -1)
- fprintf(ofp, "unresolved ");
- else
- fprintf(ofp, "%-10s ", viftable[vifi]);
-
- if (oiflist[0]) {
- char *next = NULL;
- char *p = oiflist;
- int ovifi, ottl;
-
- fprintf(ofp, "Oifs: ");
-
- while (p) {
- next = strchr(p, ' ');
- if (next) {
- *next = 0;
- next++;
- }
- if (sscanf(p, "%d:%d", &ovifi, &ottl)<2) {
- p = next;
- continue;
- }
- p = next;
-
- fprintf(ofp, "%s", viftable[ovifi]);
- if (ottl>1)
- fprintf(ofp, "(ttl %d) ", ovifi);
- else
- fprintf(ofp, " ");
+ parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
+ table = rtm_get_table(r, tb);
+
+ if (filter.tb > 0 && filter.tb != table)
+ return 0;
+
+ if (tb[RTA_IIF])
+ iif = *(int*)RTA_DATA(tb[RTA_IIF]);
+ if (filter.iif && filter.iif != iif)
+ return 0;
+
+ if (filter.af && filter.af != r->rtm_family)
+ return 0;
+
+ if (tb[RTA_DST] &&
+ filter.mdst.bitlen > 0 &&
+ inet_addr_match(RTA_DATA(tb[RTA_DST]), &filter.mdst, filter.mdst.bitlen))
+ return 0;
+
+ if (tb[RTA_SRC] &&
+ filter.msrc.bitlen > 0 &&
+ inet_addr_match(RTA_DATA(tb[RTA_SRC]), &filter.msrc, filter.msrc.bitlen))
+ return 0;
+
+ family = r->rtm_family == RTNL_FAMILY_IPMR ? AF_INET : AF_INET6;
+
+ if (n->nlmsg_type == RTM_DELROUTE)
+ fprintf(fp, "Deleted ");
+
+ if (tb[RTA_SRC])
+ len = snprintf(obuf, sizeof(obuf),
+ "(%s, ", rt_addr_n2a(family,
+ RTA_PAYLOAD(tb[RTA_SRC]),
+ RTA_DATA(tb[RTA_SRC]),
+ abuf, sizeof(abuf)));
+ else
+ len = sprintf(obuf, "(unknown, ");
+ if (tb[RTA_DST])
+ snprintf(obuf + len, sizeof(obuf) - len,
+ "%s)", rt_addr_n2a(family, RTA_PAYLOAD(tb[RTA_DST]),
+ RTA_DATA(tb[RTA_DST]),
+ abuf, sizeof(abuf)));
+ else
+ snprintf(obuf + len, sizeof(obuf) - len, "unknown) ");
+
+ fprintf(fp, "%-32s Iif: ", obuf);
+ if (iif)
+ fprintf(fp, "%-10s ", ll_index_to_name(iif));
+ else
+ fprintf(fp, "unresolved ");
+
+ if (tb[RTA_MULTIPATH]) {
+ struct rtnexthop *nh = RTA_DATA(tb[RTA_MULTIPATH]);
+ int first = 1;
+
+ len = RTA_PAYLOAD(tb[RTA_MULTIPATH]);
+
+ for (;;) {
+ if (len < sizeof(*nh))
+ break;
+ if (nh->rtnh_len > len)
+ break;
+
+ if (first) {
+ fprintf(fp, "Oifs: ");
+ first = 0;
}
+ fprintf(fp, "%s", ll_index_to_name(nh->rtnh_ifindex));
+ if (nh->rtnh_hops > 1)
+ fprintf(fp, "(ttl %d) ", nh->rtnh_hops);
+ else
+ fprintf(fp, " ");
+ len -= NLMSG_ALIGN(nh->rtnh_len);
+ nh = RTNH_NEXT(nh);
}
-
- if (show_stats && b) {
- fprintf(ofp, "%s %u packets, %u bytes", _SL_, pkts, b);
- if (w)
- fprintf(ofp, ", %u arrived on wrong iif.", w);
- }
- fprintf(ofp, "\n");
}
- fclose(fp);
+ if (show_stats && tb[RTA_MFC_STATS]) {
+ struct rta_mfc_stats *mfcs = RTA_DATA(tb[RTA_MFC_STATS]);
+
+ fprintf(fp, "%s %"PRIu64" packets, %"PRIu64" bytes", _SL_,
+ mfcs->mfcs_packets, mfcs->mfcs_bytes);
+ if (mfcs->mfcs_wrong_if)
+ fprintf(fp, ", %"PRIu64" arrived on wrong iif.",
+ mfcs->mfcs_wrong_if);
+ }
+ fprintf(fp, "\n");
+ fflush(fp);
+ return 0;
}
+void ipmroute_reset_filter(void)
+{
+ memset(&filter, 0, sizeof(filter));
+ filter.mdst.bitlen = -1;
+ filter.msrc.bitlen = -1;
+}
static int mroute_list(int argc, char **argv)
{
+ char *id = NULL;
+ int family;
+
+ ipmroute_reset_filter();
+ if (preferred_family == AF_UNSPEC)
+ family = AF_INET;
+ else
+ family = AF_INET6;
+ if (family == AF_INET) {
+ filter.af = RTNL_FAMILY_IPMR;
+ filter.tb = RT_TABLE_DEFAULT; /* for backward compatibility */
+ } else
+ filter.af = RTNL_FAMILY_IP6MR;
+
while (argc > 0) {
- if (strcmp(*argv, "iif") == 0) {
+ if (matches(*argv, "table") == 0) {
+ __u32 tid;
NEXT_ARG();
- strncpy(filter_dev, *argv, sizeof(filter_dev)-1);
+ if (rtnl_rttable_a2n(&tid, *argv)) {
+ if (strcmp(*argv, "all") == 0) {
+ filter.tb = 0;
+ } else if (strcmp(*argv, "help") == 0) {
+ usage();
+ } else {
+ invarg("table id value is invalid\n", *argv);
+ }
+ } else
+ filter.tb = tid;
+ } else if (strcmp(*argv, "iif") == 0) {
+ NEXT_ARG();
+ id = *argv;
} else if (matches(*argv, "from") == 0) {
NEXT_ARG();
- get_prefix(&filter.msrc, *argv, AF_INET);
+ get_prefix(&filter.msrc, *argv, family);
} else {
if (strcmp(*argv, "to") == 0) {
NEXT_ARG();
}
if (matches(*argv, "help") == 0)
usage();
- get_prefix(&filter.mdst, *argv, AF_INET);
+ get_prefix(&filter.mdst, *argv, family);
}
- argv++; argc--;
+ argc--; argv++;
}
- read_viftable();
- read_mroute_list(stdout);
- return 0;
+ ll_init_map(&rth);
+
+ if (id) {
+ int idx;
+
+ if ((idx = ll_name_to_index(id)) == 0) {
+ fprintf(stderr, "Cannot find device \"%s\"\n", id);
+ return -1;
+ }
+ filter.iif = idx;
+ }
+
+ if (rtnl_wilddump_request(&rth, filter.af, RTM_GETROUTE) < 0) {
+ perror("Cannot send dump request");
+ return 1;
+ }
+
+ if (rtnl_dump_filter(&rth, print_mroute, stdout) < 0) {
+ fprintf(stderr, "Dump terminated\n");
+ exit(1);
+ }
+
+ exit(0);
}
int do_multiroute(int argc, char **argv)
--
1.8.0.1
^ permalink raw reply related
* [PATCH] net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
From: Daniel Borkmann @ 2012-12-12 9:31 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Daniel Borkmann, Ani Sinha, Eric Dumazet
Currently, we return -EINVAL for malicious or wrong BPF filters.
However, this is not done for BPF_S_ANC* operations, which makes it
more difficult to detect if it's actually supported or not by the
BPF machine. Therefore, we should also return -EINVAL if K is within
the SKF_AD_OFF universe and the ancillary operation did not match.
Cc: Ani Sinha <ani@aristanetworks.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
net/core/filter.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index c23543c..de9bed4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -531,7 +531,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
[BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
[BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
};
- int pc;
+ int pc, anc_found;
if (flen == 0 || flen > BPF_MAXINSNS)
return -EINVAL;
@@ -592,8 +592,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
case BPF_S_LD_W_ABS:
case BPF_S_LD_H_ABS:
case BPF_S_LD_B_ABS:
+ anc_found = 0;
#define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
code = BPF_S_ANC_##CODE; \
+ anc_found = 1; \
break
switch (ftest->k) {
ANCILLARY(PROTOCOL);
@@ -610,6 +612,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
ANCILLARY(VLAN_TAG);
ANCILLARY(VLAN_TAG_PRESENT);
}
+
+ /* ancillary operation unkown or unsupported */
+ if (anc_found == 0 && ftest->k >= SKF_AD_OFF)
+ return -EINVAL;
}
ftest->code = code;
}
--
1.7.11.7
^ permalink raw reply related
* Re: [RFC PATCH v2 3/3] tun: fix LSM/SELinux labeling of tun/tap devices
From: Michael S. Tsirkin @ 2012-12-12 9:22 UTC (permalink / raw)
To: Paul Moore; +Cc: netdev, linux-security-module, selinux, jasowang
In-Reply-To: <20121205202619.18626.98778.stgit@localhost>
On Wed, Dec 05, 2012 at 03:26:19PM -0500, Paul Moore wrote:
> This patch corrects some problems with LSM/SELinux that were introduced
> with the multiqueue patchset. The problem stems from the fact that the
> multiqueue work changed the relationship between the tun device and its
> associated socket; before the socket persisted for the life of the
> device, however after the multiqueue changes the socket only persisted
> for the life of the userspace connection (fd open). For non-persistent
> devices this is not an issue, but for persistent devices this can cause
> the tun device to lose its SELinux label.
>
> We correct this problem by adding an opaque LSM security blob to the
> tun device struct which allows us to have the LSM security state, e.g.
> SELinux labeling information, persist for the lifetime of the tun
> device. In the process we tweak the LSM hooks to work with this new
> approach to TUN device/socket labeling and introduce a new LSM hook,
> security_tun_dev_create_queue(), to approve requests to create a new
> TUN queue via TUNSETQUEUE.
>
> The SELinux code has been adjusted to match the new LSM hooks, the
> other LSMs do not make use of the LSM TUN controls. This patch makes
> use of the recently added "tun_socket:create_queue" permission to
> restrict access to the TUNSETQUEUE operation. On older SELinux
> policies which do not define the "tun_socket:create_queue" permission
> the access control decision for TUNSETQUEUE will be handled according
> to the SELinux policy's unknown permission setting.
>
> Signed-off-by: Paul Moore <pmoore@redhat.com>
> ---
> drivers/net/tun.c | 26 +++++++++++++---
> include/linux/security.h | 59 +++++++++++++++++++++++++++++--------
> security/capability.c | 24 +++++++++++++--
> security/security.c | 28 ++++++++++++++----
> security/selinux/hooks.c | 50 ++++++++++++++++++++++++-------
> security/selinux/include/objsec.h | 4 +++
> 6 files changed, 153 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 14a0454..fb8148b 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -182,6 +182,7 @@ struct tun_struct {
> struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
> struct timer_list flow_gc_timer;
> unsigned long ageing_time;
> + void *security;
> };
>
> static inline u32 tun_hashfn(u32 rxhash)
> @@ -465,6 +466,10 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
> struct tun_file *tfile = file->private_data;
> int err;
>
> + err = security_tun_dev_attach(tfile->socket.sk, tun->security);
> + if (err < 0)
> + goto out;
> +
> err = -EINVAL;
> if (rcu_dereference_protected(tfile->tun, lockdep_rtnl_is_held()))
> goto out;
This hook triggers with both set_queue and set_iff,
and it also seems to trigger when attaching to a
persistent device and when creating a new one. But I
believe we might want to be able to allow one but not the other.
For example:
- we might want to allow qemu to do set_queue but not set_iff
- we might want to configure presistent devices and
prevent a user from adding new ones
> @@ -1348,6 +1353,7 @@ static void tun_free_netdev(struct net_device *dev)
> struct tun_struct *tun = netdev_priv(dev);
>
> tun_flow_uninit(tun);
> + security_tun_dev_free_security(tun->security);
> free_netdev(dev);
> }
>
> @@ -1534,7 +1540,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>
> if (tun_not_capable(tun))
> return -EPERM;
> - err = security_tun_dev_attach(tfile->socket.sk);
> + err = security_tun_dev_open(tun->security);
> if (err < 0)
> return err;
>
> @@ -1587,7 +1593,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>
> spin_lock_init(&tun->lock);
>
> - security_tun_dev_post_create(&tfile->sk);
> + err = security_tun_dev_alloc_security(&tun->security);
> + if (err < 0)
> + goto err_free_dev;
>
> tun_net_init(dev);
>
> @@ -1767,12 +1775,18 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
>
> tun = netdev_priv(dev);
> if (dev->netdev_ops != &tap_netdev_ops &&
> - dev->netdev_ops != &tun_netdev_ops)
> + dev->netdev_ops != &tun_netdev_ops) {
> ret = -EINVAL;
> - else if (tun_not_capable(tun))
> + goto unlock;
> + }
> + if (tun_not_capable(tun)) {
> ret = -EPERM;
> - else
> - ret = tun_attach(tun, file);
> + goto unlock;
> + }
> + ret = security_tun_dev_create_queue(tun->security);
> + if (ret < 0)
> + goto unlock;
> + ret = tun_attach(tun, file);
> } else if (ifr->ifr_flags & IFF_DETACH_QUEUE)
> __tun_detach(tfile, false);
> else
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 05e88bd..8ea923b 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -983,17 +983,29 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
> * tells the LSM to decrement the number of secmark labeling rules loaded
> * @req_classify_flow:
> * Sets the flow's sid to the openreq sid.
> + * @tun_dev_alloc_security:
> + * This hook allows a module to allocate a security structure for a TUN
> + * device.
> + * @security pointer to a security structure pointer.
> + * Returns a zero on success, negative values on failure.
> + * @tun_dev_free_security:
> + * This hook allows a module to free the security structure for a TUN
> + * device.
> + * @security pointer to the TUN device's security structure
> * @tun_dev_create:
> * Check permissions prior to creating a new TUN device.
> - * @tun_dev_post_create:
> - * This hook allows a module to update or allocate a per-socket security
> - * structure.
> - * @sk contains the newly created sock structure.
I worry that removing a hook hurt users that use it in their
ecurity policy.
> + * @tun_dev_create_queue:
> + * Check permissions prior to creating a new TUN device queue.
> + * @security pointer to the TUN device's security structure.
> * @tun_dev_attach:
> - * Check permissions prior to attaching to a persistent TUN device. This
> - * hook can also be used by the module to update any security state
> + * This hook can be used by the module to update any security state
> * associated with the TUN device's sock structure.
> * @sk contains the existing sock structure.
> + * @security pointer to the TUN device's security structure.
> + * @tun_dev_open:
> + * This hook can be used by the module to update any security state
> + * associated with the TUN device's security structure.
> + * @security pointer to the TUN devices's security structure.
> *
> * Security hooks for XFRM operations.
> *
> @@ -1613,9 +1625,12 @@ struct security_operations {
> void (*secmark_refcount_inc) (void);
> void (*secmark_refcount_dec) (void);
> void (*req_classify_flow) (const struct request_sock *req, struct flowi *fl);
> - int (*tun_dev_create)(void);
> - void (*tun_dev_post_create)(struct sock *sk);
> - int (*tun_dev_attach)(struct sock *sk);
> + int (*tun_dev_alloc_security) (void **security);
> + void (*tun_dev_free_security) (void *security);
> + int (*tun_dev_create) (void);
> + int (*tun_dev_create_queue) (void *security);
> + int (*tun_dev_attach) (struct sock *sk, void *security);
> + int (*tun_dev_open) (void *security);
> #endif /* CONFIG_SECURITY_NETWORK */
>
> #ifdef CONFIG_SECURITY_NETWORK_XFRM
> @@ -2553,9 +2568,12 @@ void security_inet_conn_established(struct sock *sk,
> int security_secmark_relabel_packet(u32 secid);
> void security_secmark_refcount_inc(void);
> void security_secmark_refcount_dec(void);
> +int security_tun_dev_alloc_security(void **security);
> +void security_tun_dev_free_security(void *security);
> int security_tun_dev_create(void);
> -void security_tun_dev_post_create(struct sock *sk);
> -int security_tun_dev_attach(struct sock *sk);
> +int security_tun_dev_create_queue(void *security);
> +int security_tun_dev_attach(struct sock *sk, void *security);
> +int security_tun_dev_open(void *security);
>
> #else /* CONFIG_SECURITY_NETWORK */
> static inline int security_unix_stream_connect(struct sock *sock,
> @@ -2720,16 +2738,31 @@ static inline void security_secmark_refcount_dec(void)
> {
> }
>
> +static inline int security_tun_dev_alloc_security(void **security)
> +{
> + return 0;
> +}
> +
> +static inline void security_tun_dev_free_security(void *security)
> +{
> +}
> +
> static inline int security_tun_dev_create(void)
> {
> return 0;
> }
>
> -static inline void security_tun_dev_post_create(struct sock *sk)
> +static inline int security_tun_dev_create_queue(void *security)
> +{
> + return 0;
> +}
> +
> +static inline int security_tun_dev_attach(struct sock *sk, void *security)
> {
> + return 0;
> }
>
> -static inline int security_tun_dev_attach(struct sock *sk)
> +static inline int security_tun_dev_open(void *security)
> {
> return 0;
> }
> diff --git a/security/capability.c b/security/capability.c
> index b14a30c..bf4cbf2 100644
> --- a/security/capability.c
> +++ b/security/capability.c
> @@ -704,16 +704,31 @@ static void cap_req_classify_flow(const struct request_sock *req,
> {
> }
>
> +static int cap_tun_dev_alloc_security(void **security)
> +{
> + return 0;
> +}
> +
> +static void cap_tun_dev_free_security(void *security)
> +{
> +}
> +
> static int cap_tun_dev_create(void)
> {
> return 0;
> }
>
> -static void cap_tun_dev_post_create(struct sock *sk)
> +static int cap_tun_dev_create_queue(void *security)
> +{
> + return 0;
> +}
> +
> +static int cap_tun_dev_attach(struct sock *sk, void *security)
> {
> + return 0;
> }
>
> -static int cap_tun_dev_attach(struct sock *sk)
> +static int cap_tun_dev_open(void *security)
> {
> return 0;
> }
> @@ -1044,8 +1059,11 @@ void __init security_fixup_ops(struct security_operations *ops)
> set_to_cap_if_null(ops, secmark_refcount_inc);
> set_to_cap_if_null(ops, secmark_refcount_dec);
> set_to_cap_if_null(ops, req_classify_flow);
> + set_to_cap_if_null(ops, tun_dev_alloc_security);
> + set_to_cap_if_null(ops, tun_dev_free_security);
> set_to_cap_if_null(ops, tun_dev_create);
> - set_to_cap_if_null(ops, tun_dev_post_create);
> + set_to_cap_if_null(ops, tun_dev_create_queue);
> + set_to_cap_if_null(ops, tun_dev_open);
> set_to_cap_if_null(ops, tun_dev_attach);
> #endif /* CONFIG_SECURITY_NETWORK */
> #ifdef CONFIG_SECURITY_NETWORK_XFRM
> diff --git a/security/security.c b/security/security.c
> index 8dcd4ae..4d82654 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1244,24 +1244,42 @@ void security_secmark_refcount_dec(void)
> }
> EXPORT_SYMBOL(security_secmark_refcount_dec);
>
> +int security_tun_dev_alloc_security(void **security)
> +{
> + return security_ops->tun_dev_alloc_security(security);
> +}
> +EXPORT_SYMBOL(security_tun_dev_alloc_security);
> +
> +void security_tun_dev_free_security(void *security)
> +{
> + security_ops->tun_dev_free_security(security);
> +}
> +EXPORT_SYMBOL(security_tun_dev_free_security);
> +
> int security_tun_dev_create(void)
> {
> return security_ops->tun_dev_create();
> }
> EXPORT_SYMBOL(security_tun_dev_create);
>
> -void security_tun_dev_post_create(struct sock *sk)
> +int security_tun_dev_create_queue(void *security)
> {
> - return security_ops->tun_dev_post_create(sk);
> + return security_ops->tun_dev_create_queue(security);
> }
> -EXPORT_SYMBOL(security_tun_dev_post_create);
> +EXPORT_SYMBOL(security_tun_dev_create_queue);
>
> -int security_tun_dev_attach(struct sock *sk)
> +int security_tun_dev_attach(struct sock *sk, void *security)
> {
> - return security_ops->tun_dev_attach(sk);
> + return security_ops->tun_dev_attach(sk, security);
> }
> EXPORT_SYMBOL(security_tun_dev_attach);
>
> +int security_tun_dev_open(void *security)
> +{
> + return security_ops->tun_dev_open(security);
> +}
> +EXPORT_SYMBOL(security_tun_dev_open);
> +
> #endif /* CONFIG_SECURITY_NETWORK */
>
> #ifdef CONFIG_SECURITY_NETWORK_XFRM
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 61a5336..f1efb08 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4399,6 +4399,24 @@ static void selinux_req_classify_flow(const struct request_sock *req,
> fl->flowi_secid = req->secid;
> }
>
> +static int selinux_tun_dev_alloc_security(void **security)
> +{
> + struct tun_security_struct *tunsec;
> +
> + tunsec = kzalloc(sizeof(*tunsec), GFP_KERNEL);
> + if (!tunsec)
> + return -ENOMEM;
> + tunsec->sid = current_sid();
> +
> + *security = tunsec;
> + return 0;
> +}
> +
> +static void selinux_tun_dev_free_security(void *security)
> +{
> + kfree(security);
> +}
> +
> static int selinux_tun_dev_create(void)
> {
> u32 sid = current_sid();
> @@ -4414,8 +4432,17 @@ static int selinux_tun_dev_create(void)
> NULL);
> }
>
> -static void selinux_tun_dev_post_create(struct sock *sk)
> +static int selinux_tun_dev_create_queue(void *security)
> {
> + struct tun_security_struct *tunsec = security;
> +
> + return avc_has_perm(current_sid(), tunsec->sid, SECCLASS_TUN_SOCKET,
> + TUN_SOCKET__CREATE_QUEUE, NULL);
> +}
> +
> +static int selinux_tun_dev_attach(struct sock *sk, void *security)
> +{
> + struct tun_security_struct *tunsec = security;
> struct sk_security_struct *sksec = sk->sk_security;
>
> /* we don't currently perform any NetLabel based labeling here and it
> @@ -4425,20 +4452,19 @@ static void selinux_tun_dev_post_create(struct sock *sk)
> * cause confusion to the TUN user that had no idea network labeling
> * protocols were being used */
>
> - /* see the comments in selinux_tun_dev_create() about why we don't use
> - * the sockcreate SID here */
> -
> - sksec->sid = current_sid();
> + sksec->sid = tunsec->sid;
> sksec->sclass = SECCLASS_TUN_SOCKET;
> +
> + return 0;
> }
>
> -static int selinux_tun_dev_attach(struct sock *sk)
> +static int selinux_tun_dev_open(void *security)
> {
> - struct sk_security_struct *sksec = sk->sk_security;
> + struct tun_security_struct *tunsec = security;
> u32 sid = current_sid();
> int err;
>
> - err = avc_has_perm(sid, sksec->sid, SECCLASS_TUN_SOCKET,
> + err = avc_has_perm(sid, tunsec->sid, SECCLASS_TUN_SOCKET,
> TUN_SOCKET__RELABELFROM, NULL);
> if (err)
> return err;
> @@ -4446,8 +4472,7 @@ static int selinux_tun_dev_attach(struct sock *sk)
> TUN_SOCKET__RELABELTO, NULL);
> if (err)
> return err;
> -
> - sksec->sid = sid;
> + tunsec->sid = sid;
>
> return 0;
> }
> @@ -5642,9 +5667,12 @@ static struct security_operations selinux_ops = {
> .secmark_refcount_inc = selinux_secmark_refcount_inc,
> .secmark_refcount_dec = selinux_secmark_refcount_dec,
> .req_classify_flow = selinux_req_classify_flow,
> + .tun_dev_alloc_security = selinux_tun_dev_alloc_security,
> + .tun_dev_free_security = selinux_tun_dev_free_security,
> .tun_dev_create = selinux_tun_dev_create,
> - .tun_dev_post_create = selinux_tun_dev_post_create,
> + .tun_dev_create_queue = selinux_tun_dev_create_queue,
> .tun_dev_attach = selinux_tun_dev_attach,
> + .tun_dev_open = selinux_tun_dev_open,
>
> #ifdef CONFIG_SECURITY_NETWORK_XFRM
> .xfrm_policy_alloc_security = selinux_xfrm_policy_alloc,
> diff --git a/security/selinux/include/objsec.h b/security/selinux/include/objsec.h
> index 26c7eee..aa47bca 100644
> --- a/security/selinux/include/objsec.h
> +++ b/security/selinux/include/objsec.h
> @@ -110,6 +110,10 @@ struct sk_security_struct {
> u16 sclass; /* sock security class */
> };
>
> +struct tun_security_struct {
> + u32 sid; /* SID for the tun device sockets */
> +};
> +
> struct key_security_struct {
> u32 sid; /* SID of key */
> };
^ permalink raw reply
* Re: [RFC PATCH v2 3/3] tun: fix LSM/SELinux labeling of tun/tap devices
From: Michael S. Tsirkin @ 2012-12-12 9:10 UTC (permalink / raw)
To: Paul Moore; +Cc: netdev, linux-security-module, selinux, jasowang
In-Reply-To: <1963349.P9uq3yvlyR@sifl>
On Mon, Dec 10, 2012 at 05:43:49PM -0500, Paul Moore wrote:
> On Monday, December 10, 2012 07:50:35 PM Michael S. Tsirkin wrote:
> > On Mon, Dec 10, 2012 at 12:33:49PM -0500, Paul Moore wrote:
> > > On Monday, December 10, 2012 07:26:56 PM Michael S. Tsirkin wrote:
> > > > On Mon, Dec 10, 2012 at 12:04:35PM -0500, Paul Moore wrote:
> > > > > On Friday, December 07, 2012 02:25:16 PM Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 06, 2012 at 04:09:51PM -0500, Paul Moore wrote:
> > > > > > > On Thursday, December 06, 2012 10:57:16 PM Michael S. Tsirkin
> wrote:
> > > > > > > > On Thu, Dec 06, 2012 at 11:56:45AM -0500, Paul Moore wrote:
> > > > > > > > > The SETQUEUE/tun_socket:create_queue permissions do not yet
> > > > > > > > > exist
> > > > > > > > > in any released SELinux policy as we are just now adding them
> > > > > > > > > with
> > > > > > > > > this patchset. With current policies loaded into a kernel with
> > > > > > > > > this patchset applied the SETQUEUE/tun_socket:create_queue
> > > > > > > > > permission would be treated according to the policy's unknown
> > > > > > > > > permission setting.
> > > > > > > >
> > > > > > > > OK I think we need to rethink what we are doing here: what you
> > > > > > > > sent
> > > > > > > > addresses the problem as stated but I think we mis-stated it.
> > > > > > > > Let
> > > > > > > > me try to restate the problem: it is not just selinux problem.
> > > > > > > > Let's
> > > > > > > > assume qemu wants to use tun, I (libvirt) don't want to run it
> > > > > > > > as
> > > > > > > > root.
> > > > > > > >
> > > > > > > > 1. TUNSETIFF: I can open tun, attach an fd and pass it to qemu.
> > > > > > > > Now, qemu does not invoke TUNSETIFF so it can run without
> > > > > > > > kernel priveledges.
> > > > > > >
> > > > > > > Correct me if I'm wrong, but I believe libvirt does this while
> > > > > > > running
> > > > > > > as root. Assuming that is the case, why not simply
> > > > > > > setuid()/setgid()
> > > > > > > to the same credentials as the QEMU instance before creating the
> > > > > > > TUN
> > > > > > > device? You can always (re)configure the device afterwards while
> > > > > > > running as root/CAP_NET_ADMIN.
> > > > > >
> > > > > > We want isolation between qemu instances.
> > > > >
> > > > > Understood, I agree.
> > > > >
> > > > > Achieving separation via SELinux is easily done, with libvirt/sVirt
> > > > > already doing this for us automatically in most cases; the only thing
> > > > > we
> > > > > will want to do is make sure the SELinux policy is aware of the new
> > > > > permission.
> > > > >
> > > > > Achieving separation via DAC should also be easily done, simply run
> > > > > each
> > > > > QEMU instance with a separate UID and/or GID.
> > > > >
> > > > > > Giving qemu right to open tun and SETIFF would give it rights
> > > > > > to access any tun device.
> > > > >
> > > > > I'm quickly looked at tun_chr_open() again and I don't see any special
> > > > > rights/privileges required, the same for tun_chr_ioctl() and
> > > > > __tun_chr_ioctl(). Looking at tun_set_queue() I see we call
> > > > > tun_not_capable() which does a simple DAC check; it must have the same
> > > > > UID/GID or have CAP_NET_ADMIN.
> > > > >
> > > > > I'm having a hard time seeing the problem you are describing; help me
> > > > > understand.
> > > >
> > > > The issue is guest controls the number of queues in use.
> > > > So qemu would be required to be allowed to call tun_set_queue.
> > > > If we allow this we have a problem as one qemu will be
> > > > able to access any tun.
> > >
> > > QEMU can call tun_set_queue() as long as it satisfies tun_not_capable(),
> > > which from a practical point of view means that the TUN device was
> > > created with the same UID/GID as the QEMU instance. If you want TUN
> > > device separation between QEMU instances using DAC you need to run each
> > > QEMU instance with a different UID/GID (which you should be doing anyway
> > > if you want DAC enforced general separation).
> > >
> > > I believe I've stated this point several times now and I don't feel you've
> > > addressed it properly.
> >
> > Look at how it works at the moment:
> > a priveledged libvirt server calls tun_set_iff
> > and passes the fd to qemu which is not priveledged.
> >
> > The result is isolation between qemu instances without
> > need to create uid per qemu instance.
>
> Okay, good. That is my understanding.
>
> > How do we create multiple queues? It makes sense to
> > follow this model and pass in fds for individual queues.
>
> Okay.
>
> > However they need to be disabled initially
> > so libvirt can not do tun_set_queue for us.
>
> Unrelated question: why do the queues need to be disabled initially? Is this
> to prevent traffic from being queued up? Some other reason? I'm just curious
> as to the reason ...
Yes.
Basically because old guests only use a single queue.
If a guest comes along and declares multiqueue support
we can queue up traffic on new queues but if we
do this with a legacy guest it will not be able to
consume it.
> > can't utilize multiqueue.
>
> I still don't understand why in the multiqueue case libvirt doesn't just
> change it's effective UID/GID when creating the TUN device, or just use the
> TUNSETOWNER/TUNSETGROUP commands. This would solve the problem you describe
> above and - at least to me - seems like a better solution conceptually.
>
> Help me understand why you believe that will not work.
>
> Do you not want to give ownership of the TUN device to QEMU? That would be
> the only reason I can think of, but all of your comments that I can recall
> have been about isolation between QEMU instances and not access control
> between a QEMU instance and its assigned TUN device.
I think I might have confused things more than clarified them.
Let me comment on specific lines in patch that worry me
that will make it clear I hope.
> > My solution is an unpriveledged variant
> > of tun_set_queue that only enables/disables
> > a queue without attach/detach.
>
> --
> paul moore
> security and virtualization @ redhat
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox