Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 6/7] sch_netem: clear old rate when old qdisc's replaced
From: Eric Dumazet @ 2014-02-14 12:43 UTC (permalink / raw)
  To: Yang Yingliang; +Cc: netdev, davem, stephen
In-Reply-To: <1392366970-11592-7-git-send-email-yangyingliang@huawei.com>

On Fri, 2014-02-14 at 16:36 +0800, Yang Yingliang wrote:
> If we set a netem qdisc with rate option, while we
> use "#tc qdisc replace ..." that without rate option
> to replace the old qdisc, the old rate is still there.
> We need clear old rate after qdisc's replaced.

Wait... Have you tested :

tc qdisc change ...

This is far more needed than 'replace' : 

You (meaning user scripts) can implement replace by delete + create, but
'tc qdisc change' needs current code.

^ permalink raw reply

* Re: [PATCH net-next 0/7] clear old options when old qdisc's replaced
From: Eric Dumazet @ 2014-02-14 12:44 UTC (permalink / raw)
  To: Yang Yingliang; +Cc: netdev, davem, stephen
In-Reply-To: <1392366970-11592-1-git-send-email-yangyingliang@huawei.com>

On Fri, 2014-02-14 at 16:36 +0800, Yang Yingliang wrote:
> I've added a netem qdisc with rate option, then I replace this qdisc
> without rate option but with latency option. The rate option is still
> there.
> 
> E.g.
>   # tc qdisc add dev eth4 handle 1: root netem rate 10mbit
>   # tc qdisc show
>     qdisc netem 1: dev eth4 root refcnt 2 limit 1000 rate 10Mbit
> 
>   # tc qdisc replace dev eth4 handle 1: root netem latency 10ms
>   # tc qdisc show
>     qdisc netem 1: dev eth4 root refcnt 2 limit 1000 delay 10.0ms rate 10Mbit
> 
> The old options need be cleared after the qdisc is replaced.

Not at all. Test your changes with "tc qdisc change ... "

^ permalink raw reply

* [PATCH] net: usb: sr9800: Use '%zu' to print size_t format
From: Fabio Estevam @ 2014-02-14 13:25 UTC (permalink / raw)
  To: davem; +Cc: liujunliang_ljl, netdev, Fabio Estevam

Fix the following build warning on ARM:

drivers/net/usb/sr9800.c:826:2: warning: format '%ld' expects argument of type 'long int', but argument 5 has type 'size_t' [-Wformat]

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
---
 drivers/net/usb/sr9800.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/sr9800.c b/drivers/net/usb/sr9800.c
index 4175eb9..8017108 100644
--- a/drivers/net/usb/sr9800.c
+++ b/drivers/net/usb/sr9800.c
@@ -823,7 +823,7 @@ static int sr9800_bind(struct usbnet *dev, struct usb_interface *intf)
 		dev->rx_urb_size =
 			SR9800_BULKIN_SIZE[SR9800_MAX_BULKIN_2K].size;
 	}
-	netdev_dbg(dev->net, "%s : setting rx_urb_size with : %ld\n", __func__,
+	netdev_dbg(dev->net, "%s : setting rx_urb_size with : %zu\n", __func__,
 		   dev->rx_urb_size);
 	return 0;
 
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH v3] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-02-14 13:51 UTC (permalink / raw)
  To: linux-sctp@vger.kernel.org; +Cc: netdev@vger.kernel.org, Alexander Sverdlin

Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while 
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and 
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with 
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1
 
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_update:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Version 2 of the patch addressed comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
in a way that existing function is not reordered/copied in line, but it is
correctly called. Thanks Vlad for suggesting.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>

---

 - v1 -> v2
	- Taken into account comments from Vlad, name of the function set to be
	  more descriptive. Two code paths were changed, one removed superflous
	  call. In second code reordering.
 - v2 -> v3
	- Open-coded function removed and written properly. Thanks Vlad for
	  suggesting

--- net-next.orig/net/sctp/associola.c
+++ net-next/net/sctp/associola.c
@@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
 	return false;
 }
 
-/* Increase asoc's rwnd by len and send any window update SACK if needed. */
-void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
+/* Update asoc's rwnd for the approximated state in the buffer,
+ * and check whether SACK needs to be sent.
+ */
+void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
 {
+	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;
 
-	if (asoc->rwnd_over) {
-		if (asoc->rwnd_over >= len) {
-			asoc->rwnd_over -= len;
-		} else {
-			asoc->rwnd += (len - asoc->rwnd_over);
-			asoc->rwnd_over = 0;
-		}
-	} else {
-		asoc->rwnd += len;
-	}
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
 
-	/* If we had window pressure, start recovering it
-	 * once our rwnd had reached the accumulated pressure
-	 * threshold.  The idea is to recover slowly, but up
-	 * to the initial advertised window.
-	 */
-	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
-		int change = min(asoc->pathmtu, asoc->rwnd_press);
-		asoc->rwnd += change;
-		asoc->rwnd_press -= change;
-	}
+	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
+		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
+	else
+		asoc->rwnd = 0;
 
-	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->a_rwnd);
+	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
+		 __func__, asoc, asoc->rwnd, rx_count,
+		 asoc->base.sk->sk_rcvbuf);
 
 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (sctp_peer_needs_update(asoc)) {
+	if (update_peer && sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;
 
 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
 	}
 }
 
-/* Decrease asoc's rwnd by len. */
-void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
-{
-	int rx_count;
-	int over = 0;
-
-	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
-		pr_debug("%s: association:%p has asoc->rwnd:%u, "
-			 "asoc->rwnd_over:%u!\n", __func__, asoc,
-			 asoc->rwnd, asoc->rwnd_over);
-
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
-
-	/* If we've reached or overflowed our receive buffer, announce
-	 * a 0 rwnd if rwnd would still be positive.  Store the
-	 * the potential pressure overflow so that the window can be restored
-	 * back to original value.
-	 */
-	if (rx_count >= asoc->base.sk->sk_rcvbuf)
-		over = 1;
-
-	if (asoc->rwnd >= len) {
-		asoc->rwnd -= len;
-		if (over) {
-			asoc->rwnd_press += asoc->rwnd;
-			asoc->rwnd = 0;
-		}
-	} else {
-		asoc->rwnd_over = len - asoc->rwnd;
-		asoc->rwnd = 0;
-	}
-
-	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->rwnd_press);
-}
 
 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
--- net-next.orig/include/net/sctp/structs.h
+++ net-next/include/net/sctp/structs.h
@@ -1653,17 +1653,6 @@ struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;
 
-	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
-	 * to slop over a maximum of the association's frag_point.
-	 */
-	__u32 rwnd_over;
-
-	/* Keeps treack of rwnd pressure.  This happens when we have
-	 * a window, but not recevie buffer (i.e small packets).  This one
-	 * is releases slowly (1 PMTU at a time ).
-	 */
-	__u32 rwnd_press;
-
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
 __u32 sctp_association_get_next_tsn(struct sctp_association *);
 
 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
-void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_update(struct sctp_association *, bool);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
--- net-next.orig/net/sctp/sm_statefuns.c
+++ net-next/net/sctp/sm_statefuns.c
@@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {
 
 		/* If this is the next TSN, consider reneging to make
--- net-next.orig/net/sctp/socket.c
+++ net-next/net/sctp/socket.c
@@ -2092,12 +2092,6 @@ static int sctp_recvmsg(struct kiocb *io
 		sctp_skb_pull(skb, copied);
 		skb_queue_head(&sk->sk_receive_queue, skb);
 
-		/* When only partial message is copied to the user, increase
-		 * rwnd by that amount. If all the data in the skb is read,
-		 * rwnd is updated when the event is freed.
-		 */
-		if (!sctp_ulpevent_is_notification(event))
-			sctp_assoc_rwnd_increase(event->asoc, copied);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
--- net-next.orig/net/sctp/ulpevent.c
+++ net-next/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
+	sctp_assoc_rwnd_update(asoc, false);
 
 	if (!skb->data_len)
 		return;
@@ -1011,6 +1011,7 @@ static void sctp_ulpevent_release_data(s
 {
 	struct sk_buff *skb, *frag;
 	unsigned int	len;
+	struct sctp_association *asoc;
 
 	/* Current stack structures assume that the rcv buffer is
 	 * per socket.   For UDP style sockets this is not true as
@@ -1035,8 +1036,11 @@ static void sctp_ulpevent_release_data(s
 	}
 
 done:
-	sctp_assoc_rwnd_increase(event->asoc, len);
+	asoc = event->asoc;
+	sctp_association_hold(asoc);
 	sctp_ulpevent_release_owner(event);
+	sctp_assoc_rwnd_update(asoc, true);
+	sctp_association_put(asoc);
 }
 
 static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)

^ permalink raw reply

* Re: [PATCH V2 net-next 0/5] xen-net{back,front}: Multiple transmit and receive queues
From: Wei Liu @ 2014-02-14 14:06 UTC (permalink / raw)
  To: Andrew J. Bennieston
  Cc: xen-devel, ian.campbell, wei.liu2, paul.durrant, netdev
In-Reply-To: <1392378624-6123-1-git-send-email-andrew.bennieston@citrix.com>

On Fri, Feb 14, 2014 at 11:50:19AM +0000, Andrew J. Bennieston wrote:
> 
> This patch series implements multiple transmit and receive queues (i.e.
> multiple shared rings) for the xen virtual network interfaces.
> 
> The series is split up as follows:
>  - Patches 1 and 3 factor out the queue-specific data for netback and
>     netfront respectively, and modify the rest of the code to use these
>     as appropriate.
>  - Patches 2 and 4 introduce new XenStore keys to negotiate and use
>    multiple shared rings and event channels, and code to connect these
>    as appropriate.
>  - Patch 5 documents the XenStore keys required for the new feature
>    in include/xen/interface/io/netif.h
> 
> All other transmit and receive processing remains unchanged, i.e. there
> is a kthread per queue and a NAPI context per queue.
> 
> The performance of these patches has been analysed in detail, with
> results available at:
> 
> http://wiki.xenproject.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing
> 
> To summarise:
>   * Using multiple queues allows a VM to transmit at line rate on a 10
>     Gbit/s NIC, compared with a maximum aggregate throughput of 6 Gbit/s
>     with a single queue.
>   * For intra-host VM--VM traffic, eight queues provide 171% of the
>     throughput of a single queue; almost 12 Gbit/s instead of 6 Gbit/s.
>   * There is a corresponding increase in total CPU usage, i.e. this is a
>     scaling out over available resources, not an efficiency improvement.
>   * Results depend on the availability of sufficient CPUs, as well as the
>     distribution of interrupts and the distribution of TCP streams across
>     the queues.
> 
> Queue selection is currently achieved via an L4 hash on the packet (i.e.
> TCP src/dst port, IP src/dst address) and is not negotiated between the
> frontend and backend, since only one option exists. Future patches to
> support other frontends (particularly Windows) will need to add some
> capability to negotiate not only the hash algorithm selection, but also
> allow the frontend to specify some parameters to this.
> 

This has an impact on the protocol. If the key to select hash algorithm
is missing then we're assuming L4 is in use.

This either needs to be documented (which is missing in your patch to
netif.h) or you need to write that key explicitly in XenStore.

I also have a question what would happen if one end advertises one hash
algorithm then use a different one. This can happen when the
driver is rogue or buggy. Will it cause the "good guy" to stall? We
certainly don't want to stall backend, at the very least.

I don't see relevant code in this series to handle "rogue other end". I
presume for a simple hash algorithm like L4 is not very important (say,
even a packet ends up in the wrong queue we can still safely process
it), or core driver can deal with this all by itself (dropping)?

Wei.

^ permalink raw reply

* Re: [PATCH V2 net-next 2/5] xen-netback: Add support for multiple queues
From: Wei Liu @ 2014-02-14 14:11 UTC (permalink / raw)
  To: Andrew J. Bennieston
  Cc: xen-devel, ian.campbell, wei.liu2, paul.durrant, netdev
In-Reply-To: <1392378624-6123-3-git-send-email-andrew.bennieston@citrix.com>

On Fri, Feb 14, 2014 at 11:50:21AM +0000, Andrew J. Bennieston wrote:
[...]
>  
> +extern unsigned int xenvif_max_queues;
> +
>  #endif /* __XEN_NETBACK__COMMON_H__ */
> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> index 4cde112..4dc092c 100644
> --- a/drivers/net/xen-netback/interface.c
> +++ b/drivers/net/xen-netback/interface.c
> @@ -373,7 +373,12 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>  	char name[IFNAMSIZ] = {};
>  
>  	snprintf(name, IFNAMSIZ - 1, "vif%u.%u", domid, handle);
> -	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup, 1);
> +	/* Allocate a netdev with the max. supported number of queues.
> +	 * When the guest selects the desired number, it will be updated
> +	 * via netif_set_real_num_tx_queues().
> +	 */
> +	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup,
> +						  xenvif_max_queues);

Indentation.

>  	if (dev == NULL) {
>  		pr_warn("Could not allocate netdev for %s\n", name);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 46b2f5b..aeb5ffa 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -54,6 +54,9 @@
[...]
> @@ -490,6 +497,23 @@ static void connect(struct backend_info *be)
>  	unsigned long credit_bytes, credit_usec;
>  	unsigned int queue_index;
>  	struct xenvif_queue *queue;
> +	unsigned int requested_num_queues;
> +
> +	/* Check whether the frontend requested multiple queues
> +	 * and read the number requested.
> +	 */
> +	err = xenbus_scanf(XBT_NIL, dev->otherend,
> +			"multi-queue-num-queues",
> +			"%u", &requested_num_queues);
> +	if (err < 0) {
> +		requested_num_queues = 1; /* Fall back to single queue */
> +	} else if (requested_num_queues > xenvif_max_queues) {
> +		/* buggy or malicious guest */
> +		xenbus_dev_fatal(dev, err,
> +						 "guest requested %u queues, exceeding the maximum of %u.",
> +						 requested_num_queues, xenvif_max_queues);

Indentation.

> +		return;
> +	}
>  
[...]
> @@ -547,29 +575,52 @@ static int connect_rings(struct backend_info *be, struct xenvif_queue *queue)
>  	unsigned long tx_ring_ref, rx_ring_ref;
>  	unsigned int tx_evtchn, rx_evtchn;
>  	int err;
> +	char *xspath = NULL;
> +	size_t xspathsize;
> +	const size_t xenstore_path_ext_size = 11; /* sufficient for "/queue-NNN" */
> +
> +	/* If the frontend requested 1 queue, or we have fallen back
> +	 * to single queue due to lack of frontend support for multi-
> +	 * queue, expect the remaining XenStore keys in the toplevel
> +	 * directory. Otherwise, expect them in a subdirectory called
> +	 * queue-N.
> +	 */
> +	if (queue->vif->num_queues == 1)
> +		xspath = (char *)dev->otherend;

Coding style.

> +	else {

Wei.

^ permalink raw reply

* Re: [PATCH V2 net-next 4/5] xen-netfront: Add support for multiple queues
From: Wei Liu @ 2014-02-14 14:13 UTC (permalink / raw)
  To: Andrew J. Bennieston
  Cc: xen-devel, ian.campbell, wei.liu2, paul.durrant, netdev
In-Reply-To: <1392378624-6123-5-git-send-email-andrew.bennieston@citrix.com>

On Fri, Feb 14, 2014 at 11:50:23AM +0000, Andrew J. Bennieston wrote:
> From: "Andrew J. Bennieston" <andrew.bennieston@citrix.com>
> 
> Build on the refactoring of the previous patch to implement multiple
> queues between xen-netfront and xen-netback.
> 
> Check XenStore for multi-queue support, and set up the rings and event
> channels accordingly.
> 
> Write ring references and event channels to XenStore in a queue
> hierarchy if appropriate, or flat when using only one queue.
> 
> Update the xennet_select_queue() function to choose the queue on which
> to transmit a packet based on the skb hash result.
> 
> Signed-off-by: Andrew J. Bennieston <andrew.bennieston@citrix.com>
> ---
>  drivers/net/xen-netfront.c |  176 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 138 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index d4239b9..d584fa4 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -57,6 +57,10 @@
>  #include <xen/interface/memory.h>
>  #include <xen/interface/grant_table.h>
>  
> +/* Module parameters */
> +unsigned int xennet_max_queues;
> +module_param(xennet_max_queues, uint, 0644);
> +
>  static const struct ethtool_ops xennet_ethtool_ops;
>  
>  struct netfront_cb {
> @@ -565,10 +569,22 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb)
>  	return pages;
>  }
>  
> -static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb)
> +static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb,
> +							   void *accel_priv)

Indentation.

>  {
> -	/* Stub for later implementation of queue selection */
> -	return 0;
> +	struct netfront_info *info = netdev_priv(dev);
> +	u32 hash;
> +	u16 queue_idx;
> +
> +	/* First, check if there is only one queue */
> +	if (info->num_queues == 1)
> +		queue_idx = 0;

Coding style. Need to put braces around this single statement.

Wei.

^ permalink raw reply

* drivers/net: tulip_remove_one needs to call pci_disable_device()
From: Sebastian Andrzej Siewior @ 2014-02-14 14:32 UTC (permalink / raw)
  To: Grant Grundler
  Cc: netdev, David S. Miller, Ingo Molnar, Thomas Gleixner,
	Sebastian Andrzej Siewior

From: Ingo Molnar <mingo@elte.hu>

Otherwise the device is not completely shut down.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
The patch is from "Date: Fri, 3 Jul 2009 08:30:18 -0500" and has been
in -RT since. Now that I stumbled upon it and didn't notice anything -RT
specific, here it comes.

diff --git a/drivers/net/ethernet/dec/tulip/tulip_core.c b/drivers/net/ethernet/dec/tulip/tulip_core.c
index add05f1..1642de7 100644
--- a/drivers/net/ethernet/dec/tulip/tulip_core.c
+++ b/drivers/net/ethernet/dec/tulip/tulip_core.c
@@ -1939,6 +1939,7 @@ static void tulip_remove_one(struct pci_dev *pdev)
 	pci_iounmap(pdev, tp->base_addr);
 	free_netdev (dev);
 	pci_release_regions (pdev);
+	pci_disable_device(pdev);
 
 	/* pci_power_off (pdev, -1); */
 }
-- 
1.9.0.rc3

^ permalink raw reply related

* [PATCH ipsec-next v2] ipsec: add support of limited SA dump
From: Nicolas Dichtel @ 2014-02-14 14:30 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem
  Cc: netdev, fengguang.wu, kbuild-all, Nicolas Dichtel
In-Reply-To: <1392223581-25554-1-git-send-email-nicolas.dichtel@6wind.com>

The goal of this patch is to allow userland to dump only a part of SA by
specifying a filter during the dump.
The kernel is in charge to filter SA, this avoids to generate useless netlink
traffic (it save also some cpu cycles). This is particularly useful when there
is a big number of SA set on the system.

Note that I removed the union in struct xfrm_state_walk to fix a problem on arm.
struct netlink_callback->args is defined as a array of 6 long and the first long
is used in xfrm code to flag the cb as initialized. Hence, we must have:
sizeof(struct xfrm_state_walk) <= sizeof(long) * 5.
With the union, it was false on arm (sizeof(struct xfrm_state_walk) was
sizeof(long) * 7), due to the padding.
In fact, whatever the arch is, this union seems useless, there will be always
padding after it. Removing it will not increase the size of this struct (and
reduce it on arm).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---

Note that iproute2 patches are available and will be submitted once
the patch will be accepted. I can send them on demand.

v2: fix build on arm

 include/net/xfrm.h           | 10 +++++-----
 include/uapi/linux/pfkeyv2.h | 15 ++++++++++++++-
 include/uapi/linux/xfrm.h    | 10 ++++++++++
 net/key/af_key.c             | 19 ++++++++++++++++++-
 net/xfrm/xfrm_state.c        | 25 ++++++++++++++++++++++++-
 net/xfrm/xfrm_user.c         | 28 +++++++++++++++++++++++++++-
 6 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 5313ccfdeedf..45332acac022 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -118,11 +118,10 @@
 struct xfrm_state_walk {
 	struct list_head	all;
 	u8			state;
-	union {
-		u8		dying;
-		u8		proto;
-	};
+	u8			dying;
+	u8			proto;
 	u32			seq;
+	struct xfrm_filter	*filter;
 };
 
 /* Full description of state of transformer. */
@@ -1406,7 +1405,8 @@ static inline void xfrm_sysctl_fini(struct net *net)
 }
 #endif
 
-void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto);
+void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto,
+			  struct xfrm_filter *filter);
 int xfrm_state_walk(struct net *net, struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*), void *);
 void xfrm_state_walk_done(struct xfrm_state_walk *walk, struct net *net);
diff --git a/include/uapi/linux/pfkeyv2.h b/include/uapi/linux/pfkeyv2.h
index 0b80c806631f..ada7f0171ccc 100644
--- a/include/uapi/linux/pfkeyv2.h
+++ b/include/uapi/linux/pfkeyv2.h
@@ -235,6 +235,18 @@ struct sadb_x_kmaddress {
 } __attribute__((packed));
 /* sizeof(struct sadb_x_kmaddress) == 8 */
 
+/* To specify the SA dump filter */
+struct sadb_x_filter {
+	__u16	sadb_x_filter_len;
+	__u16	sadb_x_filter_exttype;
+	__u32	sadb_x_filter_saddr[4];
+	__u32	sadb_x_filter_daddr[4];
+	__u16	sadb_x_filter_family;
+	__u8	sadb_x_filter_splen;
+	__u8	sadb_x_filter_dplen;
+} __attribute__((packed));
+/* sizeof(struct sadb_x_filter) == 40 */
+
 /* Message types */
 #define SADB_RESERVED		0
 #define SADB_GETSPI		1
@@ -358,7 +370,8 @@ struct sadb_x_kmaddress {
 #define SADB_X_EXT_SEC_CTX		24
 /* Used with MIGRATE to pass @ to IKE for negotiation */
 #define SADB_X_EXT_KMADDRESS		25
-#define SADB_EXT_MAX			25
+#define SADB_X_EXT_FILTER		26
+#define SADB_EXT_MAX			26
 
 /* Identity Extension values */
 #define SADB_IDENTTYPE_RESERVED	0
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index a8cd6a4a2970..6550c679584f 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -298,6 +298,8 @@ enum xfrm_attr_type_t {
 	XFRMA_TFCPAD,		/* __u32 */
 	XFRMA_REPLAY_ESN_VAL,	/* struct xfrm_replay_esn */
 	XFRMA_SA_EXTRA_FLAGS,	/* __u32 */
+	XFRMA_PROTO,		/* __u8 */
+	XFRMA_FILTER,		/* struct xfrm_filter */
 	__XFRMA_MAX
 
 #define XFRMA_MAX (__XFRMA_MAX - 1)
@@ -474,6 +476,14 @@ struct xfrm_user_mapping {
 	__be16				new_sport;
 };
 
+struct xfrm_filter {
+	xfrm_address_t			saddr;
+	xfrm_address_t			daddr;
+	__u16				family;
+	__u8				splen;
+	__u8				dplen;
+};
+
 #ifndef __KERNEL__
 /* backwards compatibility for userspace */
 #define XFRMGRP_ACQUIRE		1
diff --git a/net/key/af_key.c b/net/key/af_key.c
index e1c69d024197..f0879c19f452 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1798,6 +1798,7 @@ static void pfkey_dump_sa_done(struct pfkey_sock *pfk)
 static int pfkey_dump(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
 {
 	u8 proto;
+	struct xfrm_filter *filter = NULL;
 	struct pfkey_sock *pfk = pfkey_sk(sk);
 
 	if (pfk->dump.dump != NULL)
@@ -1807,11 +1808,27 @@ static int pfkey_dump(struct sock *sk, struct sk_buff *skb, const struct sadb_ms
 	if (proto == 0)
 		return -EINVAL;
 
+	if (ext_hdrs[SADB_X_EXT_FILTER - 1]) {
+		struct sadb_x_filter *xfilter = ext_hdrs[SADB_X_EXT_FILTER - 1];
+
+		filter = kmalloc(sizeof(*filter), GFP_KERNEL);
+		if (filter == NULL)
+			return -ENOMEM;
+
+		memcpy(&filter->saddr, &xfilter->sadb_x_filter_saddr,
+		       sizeof(xfrm_address_t));
+		memcpy(&filter->daddr, &xfilter->sadb_x_filter_daddr,
+		       sizeof(xfrm_address_t));
+		filter->family = xfilter->sadb_x_filter_family;
+		filter->splen = xfilter->sadb_x_filter_splen;
+		filter->dplen = xfilter->sadb_x_filter_dplen;
+	}
+
 	pfk->dump.msg_version = hdr->sadb_msg_version;
 	pfk->dump.msg_portid = hdr->sadb_msg_pid;
 	pfk->dump.dump = pfkey_dump_sa;
 	pfk->dump.done = pfkey_dump_sa_done;
-	xfrm_state_walk_init(&pfk->dump.u.state, proto);
+	xfrm_state_walk_init(&pfk->dump.u.state, proto, filter);
 
 	return pfkey_do_dump(pfk);
 }
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 0bf12f665b9b..a750901ac3db 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1603,6 +1603,23 @@ unlock:
 }
 EXPORT_SYMBOL(xfrm_alloc_spi);
 
+static bool __xfrm_state_filter_match(struct xfrm_state *x,
+				      struct xfrm_filter *filter)
+{
+	if (filter) {
+		if ((filter->family == AF_INET ||
+		     filter->family == AF_INET6) &&
+		    x->props.family != filter->family)
+			return false;
+
+		return addr_match(&x->props.saddr, &filter->saddr,
+				  filter->splen) &&
+		       addr_match(&x->id.daddr, &filter->daddr,
+				  filter->dplen);
+	}
+	return true;
+}
+
 int xfrm_state_walk(struct net *net, struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
@@ -1625,6 +1642,8 @@ int xfrm_state_walk(struct net *net, struct xfrm_state_walk *walk,
 		state = container_of(x, struct xfrm_state, km);
 		if (!xfrm_id_proto_match(state->id.proto, walk->proto))
 			continue;
+		if (!__xfrm_state_filter_match(state, walk->filter))
+			continue;
 		err = func(state, walk->seq, data);
 		if (err) {
 			list_move_tail(&walk->all, &x->all);
@@ -1643,17 +1662,21 @@ out:
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
-void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
+void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto,
+			  struct xfrm_filter *filter)
 {
 	INIT_LIST_HEAD(&walk->all);
 	walk->proto = proto;
 	walk->state = XFRM_STATE_DEAD;
 	walk->seq = 0;
+	walk->filter = filter;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk, struct net *net)
 {
+	kfree(walk->filter);
+
 	if (list_empty(&walk->all))
 		return;
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index d7694f258294..023e5e7ea4c6 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -887,6 +887,7 @@ static int xfrm_dump_sa_done(struct netlink_callback *cb)
 	return 0;
 }
 
+static const struct nla_policy xfrma_policy[XFRMA_MAX+1];
 static int xfrm_dump_sa(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
@@ -902,8 +903,31 @@ static int xfrm_dump_sa(struct sk_buff *skb, struct netlink_callback *cb)
 	info.nlmsg_flags = NLM_F_MULTI;
 
 	if (!cb->args[0]) {
+		struct nlattr *attrs[XFRMA_MAX+1];
+		struct xfrm_filter *filter = NULL;
+		u8 proto = 0;
+		int err;
+
 		cb->args[0] = 1;
-		xfrm_state_walk_init(walk, 0);
+
+		err = nlmsg_parse(cb->nlh, 0, attrs, XFRMA_MAX,
+				  xfrma_policy);
+		if (err < 0)
+			return err;
+
+		if (attrs[XFRMA_FILTER]) {
+			filter = kmalloc(sizeof(*filter), GFP_KERNEL);
+			if (filter == NULL)
+				return -ENOMEM;
+
+			memcpy(filter, nla_data(attrs[XFRMA_FILTER]),
+			       sizeof(*filter));
+		}
+
+		if (attrs[XFRMA_PROTO])
+			proto = nla_get_u8(attrs[XFRMA_PROTO]);
+
+		xfrm_state_walk_init(walk, proto, filter);
 	}
 
 	(void) xfrm_state_walk(net, walk, dump_one_state, &info);
@@ -2309,6 +2333,8 @@ static const struct nla_policy xfrma_policy[XFRMA_MAX+1] = {
 	[XFRMA_TFCPAD]		= { .type = NLA_U32 },
 	[XFRMA_REPLAY_ESN_VAL]	= { .len = sizeof(struct xfrm_replay_state_esn) },
 	[XFRMA_SA_EXTRA_FLAGS]	= { .type = NLA_U32 },
+	[XFRMA_PROTO]		= { .type = NLA_U8 },
+	[XFRMA_FILTER]		= { .len = sizeof(struct xfrm_filter) },
 };
 
 static const struct xfrm_link {
-- 
1.8.5.4

^ permalink raw reply related

* Does ICMP_FRAG_NEEDED automatically update the routing cache?
From: David Howells @ 2014-02-14 14:36 UTC (permalink / raw)
  To: netdev; +Cc: dhowells


Am I reading the ipv4 code right?  If a UDP packet we send results in an
ICMP_FRAG_NEEDED packet being received, the cached routing information for the
peer will automatically be updated by __udp4_lib_err()?

And something similar in __udp6_lib_err()?

David

^ permalink raw reply

* Re: [PATCH v3] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Vlad Yasevich @ 2014-02-14 14:41 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, linux-sctp@vger.kernel.org
  Cc: netdev@vger.kernel.org, Alexander Sverdlin
In-Reply-To: <52FE1F56.2020007@nsn.com>

On 02/14/2014 08:51 AM, Matija Glavinic Pecotic wrote:
> 
> Proposed solution simplifies whole algorithm having on mind definition from rfc:
> 
> o  Receiver Window (rwnd): This gives the sender an indication of the space
>    available in the receiver's inbound buffer.
> 
> Core of the proposed solution is given with these lines:
> 
> sctp_assoc_rwnd_update:
> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> 	else
> 		asoc->rwnd = 0;
> 
> We advertise to sender (half of) actual space we have. Half is in the braces
> depending whether you would like to observe size of socket buffer as SO_RECVBUF
> or twice the amount, i.e. size is the one visible from userspace, that is,
> from kernelspace.
> In this way sender is given with good approximation of our buffer space,
> regardless of the buffer policy - we always advertise what we have. Proposed
> solution fixes described problems and removes necessity for rwnd restoration
> algorithm. Finally, as proposed solution is simplification, some lines of code,
> along with some bytes in struct sctp_association are saved.
> 
> Version 2 of the patch addressed comments from Vlad. Name of the function is set
> to be more descriptive, and two parts of code are changed, in one removing the
> superfluous call to sctp_assoc_rwnd_update since call would not result in update
> of rwnd, and the other being reordering of the code in a way that call to
> sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
> in a way that existing function is not reordered/copied in line, but it is
> correctly called. Thanks Vlad for suggesting.
> 
> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> 

Acked-by: Vlad Yasevich <vyasevich@gmail.com>

-vlad

> ---
> 
>  - v1 -> v2
> 	- Taken into account comments from Vlad, name of the function set to be
> 	  more descriptive. Two code paths were changed, one removed superflous
> 	  call. In second code reordering.
>  - v2 -> v3
> 	- Open-coded function removed and written properly. Thanks Vlad for
> 	  suggesting
> 
> --- net-next.orig/net/sctp/associola.c
> +++ net-next/net/sctp/associola.c
> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>  	return false;
>  }
>  
> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> +/* Update asoc's rwnd for the approximated state in the buffer,
> + * and check whether SACK needs to be sent.
> + */
> +void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
>  {
> +	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->rwnd_over) {
> -		if (asoc->rwnd_over >= len) {
> -			asoc->rwnd_over -= len;
> -		} else {
> -			asoc->rwnd += (len - asoc->rwnd_over);
> -			asoc->rwnd_over = 0;
> -		}
> -	} else {
> -		asoc->rwnd += len;
> -	}
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>  
> -	/* If we had window pressure, start recovering it
> -	 * once our rwnd had reached the accumulated pressure
> -	 * threshold.  The idea is to recover slowly, but up
> -	 * to the initial advertised window.
> -	 */
> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
> -		asoc->rwnd += change;
> -		asoc->rwnd_press -= change;
> -	}
> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> +	else
> +		asoc->rwnd = 0;
>  
> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->a_rwnd);
> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> +		 __func__, asoc, asoc->rwnd, rx_count,
> +		 asoc->base.sk->sk_rcvbuf);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (sctp_peer_needs_update(asoc)) {
> +	if (update_peer && sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>  	}
>  }
>  
> -/* Decrease asoc's rwnd by len. */
> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> -{
> -	int rx_count;
> -	int over = 0;
> -
> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> -			 asoc->rwnd, asoc->rwnd_over);
> -
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> -
> -	/* If we've reached or overflowed our receive buffer, announce
> -	 * a 0 rwnd if rwnd would still be positive.  Store the
> -	 * the potential pressure overflow so that the window can be restored
> -	 * back to original value.
> -	 */
> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> -		over = 1;
> -
> -	if (asoc->rwnd >= len) {
> -		asoc->rwnd -= len;
> -		if (over) {
> -			asoc->rwnd_press += asoc->rwnd;
> -			asoc->rwnd = 0;
> -		}
> -	} else {
> -		asoc->rwnd_over = len - asoc->rwnd;
> -		asoc->rwnd = 0;
> -	}
> -
> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->rwnd_press);
> -}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> --- net-next.orig/include/net/sctp/structs.h
> +++ net-next/include/net/sctp/structs.h
> @@ -1653,17 +1653,6 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> -	 * to slop over a maximum of the association's frag_point.
> -	 */
> -	__u32 rwnd_over;
> -
> -	/* Keeps treack of rwnd pressure.  This happens when we have
> -	 * a window, but not recevie buffer (i.e small packets).  This one
> -	 * is releases slowly (1 PMTU at a time ).
> -	 */
> -	__u32 rwnd_press;
> -
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_update(struct sctp_association *, bool);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> --- net-next.orig/net/sctp/sm_statefuns.c
> +++ net-next/net/sctp/sm_statefuns.c
> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> --- net-next.orig/net/sctp/socket.c
> +++ net-next/net/sctp/socket.c
> @@ -2092,12 +2092,6 @@ static int sctp_recvmsg(struct kiocb *io
>  		sctp_skb_pull(skb, copied);
>  		skb_queue_head(&sk->sk_receive_queue, skb);
>  
> -		/* When only partial message is copied to the user, increase
> -		 * rwnd by that amount. If all the data in the skb is read,
> -		 * rwnd is updated when the event is freed.
> -		 */
> -		if (!sctp_ulpevent_is_notification(event))
> -			sctp_assoc_rwnd_increase(event->asoc, copied);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> --- net-next.orig/net/sctp/ulpevent.c
> +++ net-next/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
> +	sctp_assoc_rwnd_update(asoc, false);
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1011,6 +1011,7 @@ static void sctp_ulpevent_release_data(s
>  {
>  	struct sk_buff *skb, *frag;
>  	unsigned int	len;
> +	struct sctp_association *asoc;
>  
>  	/* Current stack structures assume that the rcv buffer is
>  	 * per socket.   For UDP style sockets this is not true as
> @@ -1035,8 +1036,11 @@ static void sctp_ulpevent_release_data(s
>  	}
>  
>  done:
> -	sctp_assoc_rwnd_increase(event->asoc, len);
> +	asoc = event->asoc;
> +	sctp_association_hold(asoc);
>  	sctp_ulpevent_release_owner(event);
> +	sctp_assoc_rwnd_update(asoc, true);
> +	sctp_association_put(asoc);
>  }
>  
>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: Does ICMP_FRAG_NEEDED automatically update the routing cache?
From: Hannes Frederic Sowa @ 2014-02-14 14:44 UTC (permalink / raw)
  To: David Howells; +Cc: netdev
In-Reply-To: <26474.1392388608@warthog.procyon.org.uk>

Hi!

On Fri, Feb 14, 2014 at 02:36:48PM +0000, David Howells wrote:
> Am I reading the ipv4 code right?  If a UDP packet we send results in an
> ICMP_FRAG_NEEDED packet being received, the cached routing information for the
> peer will automatically be updated by __udp4_lib_err()?

A next hop exception will be generated (or reused) which stores the path
mtu towards that target, yes (there is no more routing cache).

Prior to that a validation check happens if the socket really exists (this is
e.g. needed to identify the namespace or routing table the update should occur
on).

> And something similar in __udp6_lib_err()?

Exactly.

Greetings,

  Hannes

^ permalink raw reply

* Re: [PATCH V2 net-next 0/5] xen-net{back,front}: Multiple transmit and receive queues
From: Andrew Bennieston @ 2014-02-14 14:53 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <20140214140635.GA18398@zion.uk.xensource.com>

On 14/02/14 14:06, Wei Liu wrote:
> On Fri, Feb 14, 2014 at 11:50:19AM +0000, Andrew J. Bennieston wrote:
>>
>> This patch series implements multiple transmit and receive queues (i.e.
>> multiple shared rings) for the xen virtual network interfaces.
>>
>> The series is split up as follows:
>>   - Patches 1 and 3 factor out the queue-specific data for netback and
>>      netfront respectively, and modify the rest of the code to use these
>>      as appropriate.
>>   - Patches 2 and 4 introduce new XenStore keys to negotiate and use
>>     multiple shared rings and event channels, and code to connect these
>>     as appropriate.
>>   - Patch 5 documents the XenStore keys required for the new feature
>>     in include/xen/interface/io/netif.h
>>
>> All other transmit and receive processing remains unchanged, i.e. there
>> is a kthread per queue and a NAPI context per queue.
>>
>> The performance of these patches has been analysed in detail, with
>> results available at:
>>
>> http://wiki.xenproject.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing
>>
>> To summarise:
>>    * Using multiple queues allows a VM to transmit at line rate on a 10
>>      Gbit/s NIC, compared with a maximum aggregate throughput of 6 Gbit/s
>>      with a single queue.
>>    * For intra-host VM--VM traffic, eight queues provide 171% of the
>>      throughput of a single queue; almost 12 Gbit/s instead of 6 Gbit/s.
>>    * There is a corresponding increase in total CPU usage, i.e. this is a
>>      scaling out over available resources, not an efficiency improvement.
>>    * Results depend on the availability of sufficient CPUs, as well as the
>>      distribution of interrupts and the distribution of TCP streams across
>>      the queues.
>>
>> Queue selection is currently achieved via an L4 hash on the packet (i.e.
>> TCP src/dst port, IP src/dst address) and is not negotiated between the
>> frontend and backend, since only one option exists. Future patches to
>> support other frontends (particularly Windows) will need to add some
>> capability to negotiate not only the hash algorithm selection, but also
>> allow the frontend to specify some parameters to this.
>>
>
> This has an impact on the protocol. If the key to select hash algorithm
> is missing then we're assuming L4 is in use.
>
> This either needs to be documented (which is missing in your patch to
> netif.h) or you need to write that key explicitly in XenStore.
>
> I also have a question what would happen if one end advertises one hash
> algorithm then use a different one. This can happen when the
> driver is rogue or buggy. Will it cause the "good guy" to stall? We
> certainly don't want to stall backend, at the very least.

I'm not sure I understand. There is no negotiable selection of hash 
algorithm here. This paragraph refers to a possible future in which we 
may have to support multiple such. These issues will absolutely have to 
be addressed then, but it is completely irrelevant for now.

Andrew.
>
> I don't see relevant code in this series to handle "rogue other end". I
> presume for a simple hash algorithm like L4 is not very important (say,
> even a packet ends up in the wrong queue we can still safely process
> it), or core driver can deal with this all by itself (dropping)?
>
> Wei.
>

^ permalink raw reply

* Re: [PATCH V2 net-next 2/5] xen-netback: Add support for multiple queues
From: Andrew Bennieston @ 2014-02-14 14:57 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <20140214141125.GB18398@zion.uk.xensource.com>

On 14/02/14 14:11, Wei Liu wrote:
> On Fri, Feb 14, 2014 at 11:50:21AM +0000, Andrew J. Bennieston wrote:
> [...]
>>
>> +extern unsigned int xenvif_max_queues;
>> +
>>   #endif /* __XEN_NETBACK__COMMON_H__ */
>> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
>> index 4cde112..4dc092c 100644
>> --- a/drivers/net/xen-netback/interface.c
>> +++ b/drivers/net/xen-netback/interface.c
>> @@ -373,7 +373,12 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>>   	char name[IFNAMSIZ] = {};
>>
>>   	snprintf(name, IFNAMSIZ - 1, "vif%u.%u", domid, handle);
>> -	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup, 1);
>> +	/* Allocate a netdev with the max. supported number of queues.
>> +	 * When the guest selects the desired number, it will be updated
>> +	 * via netif_set_real_num_tx_queues().
>> +	 */
>> +	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup,
>> +						  xenvif_max_queues);
>
> Indentation.

How would you like this to be indented? The CodingStyle says (and I quote):
Chapter 2: Breaking long lines and strings:
	... descendants are always substantially shorter than the
	parent and placed substantially to the right...

There is no further advice to this point in CodingStyle, so please 
explain how you'd prefer this.

>
>>   	if (dev == NULL) {
>>   		pr_warn("Could not allocate netdev for %s\n", name);
>>   		return ERR_PTR(-ENOMEM);
>> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
>> index 46b2f5b..aeb5ffa 100644
>> --- a/drivers/net/xen-netback/netback.c
>> +++ b/drivers/net/xen-netback/netback.c
>> @@ -54,6 +54,9 @@
> [...]
>> @@ -490,6 +497,23 @@ static void connect(struct backend_info *be)
>>   	unsigned long credit_bytes, credit_usec;
>>   	unsigned int queue_index;
>>   	struct xenvif_queue *queue;
>> +	unsigned int requested_num_queues;
>> +
>> +	/* Check whether the frontend requested multiple queues
>> +	 * and read the number requested.
>> +	 */
>> +	err = xenbus_scanf(XBT_NIL, dev->otherend,
>> +			"multi-queue-num-queues",
>> +			"%u", &requested_num_queues);
>> +	if (err < 0) {
>> +		requested_num_queues = 1; /* Fall back to single queue */
>> +	} else if (requested_num_queues > xenvif_max_queues) {
>> +		/* buggy or malicious guest */
>> +		xenbus_dev_fatal(dev, err,
>> +						 "guest requested %u queues, exceeding the maximum of %u.",
>> +						 requested_num_queues, xenvif_max_queues);
>
> Indentation.
Ditto.

>
>> +		return;
>> +	}
>>
> [...]
>> @@ -547,29 +575,52 @@ static int connect_rings(struct backend_info *be, struct xenvif_queue *queue)
>>   	unsigned long tx_ring_ref, rx_ring_ref;
>>   	unsigned int tx_evtchn, rx_evtchn;
>>   	int err;
>> +	char *xspath = NULL;
>> +	size_t xspathsize;
>> +	const size_t xenstore_path_ext_size = 11; /* sufficient for "/queue-NNN" */
>> +
>> +	/* If the frontend requested 1 queue, or we have fallen back
>> +	 * to single queue due to lack of frontend support for multi-
>> +	 * queue, expect the remaining XenStore keys in the toplevel
>> +	 * directory. Otherwise, expect them in a subdirectory called
>> +	 * queue-N.
>> +	 */
>> +	if (queue->vif->num_queues == 1)
>> +		xspath = (char *)dev->otherend;
>
> Coding style.
>
Ok; I thought I'd caught all of those. I'll change it.

>> +	else {
>
> Wei.
>

^ permalink raw reply

* Re: [PATCH V2 net-next 4/5] xen-netfront: Add support for multiple queues
From: Andrew Bennieston @ 2014-02-14 14:58 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <20140214141332.GC18398@zion.uk.xensource.com>

On 14/02/14 14:13, Wei Liu wrote:
> On Fri, Feb 14, 2014 at 11:50:23AM +0000, Andrew J. Bennieston wrote:
>> From: "Andrew J. Bennieston" <andrew.bennieston@citrix.com>
>>
>> Build on the refactoring of the previous patch to implement multiple
>> queues between xen-netfront and xen-netback.
>>
>> Check XenStore for multi-queue support, and set up the rings and event
>> channels accordingly.
>>
>> Write ring references and event channels to XenStore in a queue
>> hierarchy if appropriate, or flat when using only one queue.
>>
>> Update the xennet_select_queue() function to choose the queue on which
>> to transmit a packet based on the skb hash result.
>>
>> Signed-off-by: Andrew J. Bennieston <andrew.bennieston@citrix.com>
>> ---
>>   drivers/net/xen-netfront.c |  176 ++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 138 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index d4239b9..d584fa4 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -57,6 +57,10 @@
>>   #include <xen/interface/memory.h>
>>   #include <xen/interface/grant_table.h>
>>
>> +/* Module parameters */
>> +unsigned int xennet_max_queues;
>> +module_param(xennet_max_queues, uint, 0644);
>> +
>>   static const struct ethtool_ops xennet_ethtool_ops;
>>
>>   struct netfront_cb {
>> @@ -565,10 +569,22 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb)
>>   	return pages;
>>   }
>>
>> -static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb)
>> +static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +							   void *accel_priv)
>
> Indentation.
>
>>   {
>> -	/* Stub for later implementation of queue selection */
>> -	return 0;
>> +	struct netfront_info *info = netdev_priv(dev);
>> +	u32 hash;
>> +	u16 queue_idx;
>> +
>> +	/* First, check if there is only one queue */
>> +	if (info->num_queues == 1)
>> +		queue_idx = 0;
>
> Coding style. Need to put braces around this single statement.
>

Good catch; thanks.

> Wei.
>

^ permalink raw reply

* Re: Does ICMP_FRAG_NEEDED automatically update the routing cache?
From: David Howells @ 2014-02-14 15:00 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: dhowells, netdev
In-Reply-To: <20140214144412.GA27343@order.stressinduktion.org>

Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> On Fri, Feb 14, 2014 at 02:36:48PM +0000, David Howells wrote:
> > Am I reading the ipv4 code right?  If a UDP packet we send results in an
> > ICMP_FRAG_NEEDED packet being received, the cached routing information for
> > the peer will automatically be updated by __udp4_lib_err()?
> 
> A next hop exception will be generated (or reused) which stores the path
> mtu towards that target, yes (there is no more routing cache).
> 
> Prior to that a validation check happens if the socket really exists (this is
> e.g. needed to identify the namespace or routing table the update should occur
> on).

Sounds good.  Does this work even if the socket is not connected (ie. the UDP
packets are being routed by the address fields in struct msghdr)?

David

^ permalink raw reply

* Re: Does ICMP_FRAG_NEEDED automatically update the routing cache?
From: Hannes Frederic Sowa @ 2014-02-14 15:03 UTC (permalink / raw)
  To: David Howells; +Cc: netdev
In-Reply-To: <26769.1392390042@warthog.procyon.org.uk>

On Fri, Feb 14, 2014 at 03:00:42PM +0000, David Howells wrote:
> Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> 
> > On Fri, Feb 14, 2014 at 02:36:48PM +0000, David Howells wrote:
> > > Am I reading the ipv4 code right?  If a UDP packet we send results in an
> > > ICMP_FRAG_NEEDED packet being received, the cached routing information for
> > > the peer will automatically be updated by __udp4_lib_err()?
> > 
> > A next hop exception will be generated (or reused) which stores the path
> > mtu towards that target, yes (there is no more routing cache).
> > 
> > Prior to that a validation check happens if the socket really exists (this is
> > e.g. needed to identify the namespace or routing table the update should occur
> > on).
> 
> Sounds good.  Does this work even if the socket is not connected (ie. the UDP
> packets are being routed by the address fields in struct msghdr)?

Yes, but connected sockets are checked prior to unconnected sockets, so the
most specific one wins.

For unconnected ones only the local ip/port is checked because kernel
does not know the past destination addresses.

Greetings,

  Hannes

^ permalink raw reply

* [PATCH] ss: Add support for retrieving SELinux contexts
From: Richard Haines @ 2014-02-14 15:20 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA; +Cc: selinux-+05T5uksL2qpZYMLLGbcSA

The process SELinux contexts can be added to the output using the -Z
option. Using the -z option will show the process and socket contexts (see
the man page for details).
For netlink sockets: if valid process show process context, if pid = 0
show kernel initial context, if unknown show "not available".

Signed-off-by: Richard Haines <richard_c_haines-FhtRXb7CoQBt1OO0OYaSVA@public.gmane.org>
---
 configure     |  16 +++
 man/man8/ss.8 |  34 ++++++
 misc/Makefile |  12 ++
 misc/ss.c     | 375 ++++++++++++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 387 insertions(+), 50 deletions(-)

diff --git a/configure b/configure
index da01c19..854837e 100755
--- a/configure
+++ b/configure
@@ -231,6 +231,19 @@ EOF
     rm -f $TMPDIR/ipsettest.c $TMPDIR/ipsettest
 }
 
+check_selinux()
+# SELinux is a compile time option in the ss utility
+{
+	SELINUX_LIB=$(${PKG_CONFIG} --silence-errors libselinux --libs)
+	if [ -n "$SELINUX_LIB" ]
+	then
+	echo "HAVE_SELINUX:=y" >>Config
+	echo "yes"
+    else
+	echo "no"
+	fi
+}
+
 echo "# Generated config based on" $INCLUDE >Config
 check_toolchain
 
@@ -253,3 +266,6 @@ check_ipt_lib_dir
 
 echo -n "libc has setns: "
 check_setns
+
+echo -n "SELinux support: "
+check_selinux
diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 807d9dc..d6e43ba 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -53,6 +53,37 @@ Print summary statistics. This option does not parse socket lists obtaining
 summary from various sources. It is useful when amount of sockets is so huge
 that parsing /proc/net/tcp is painful.
 .TP
+.B \-Z, \-\-context
+As the
+.B \-p
+option but also shows process security context.
+.sp
+For
+.BR netlink (7)
+sockets the initiating process context is displayed as follows:
+.RS
+.RS
+.IP "1." 4
+If valid pid show the process context.
+.IP "2." 4
+If destination is kernel (pid = 0) show kernel initial context.
+.IP "3." 4
+If a unique identifier has been allocated by the kernel or netlink user,
+show context as "not available". This will generally indicate that a
+process has more than one netlink socket active.
+.RE
+.RE
+.TP
+.B \-z, \-\-contexts
+As the
+.B \-Z
+option but also shows the socket context. The socket context is
+taken from the associated inode and is not the actual socket
+context held by the kernel. Sockets are typically labeled with the
+context of the creating process, however the context shown will reflect
+any policy role, type and/or range transition rules applied,
+and is therefore a useful reference.
+.TP
 .B \-b, \-\-bpf
 Show socket BPF filters (only administrators are allowed to get these information).
 .TP
@@ -103,6 +134,9 @@ Please take a look at the official documentation (Debian package iproute-doc) fo
 .B ss -t -a
 Display all TCP sockets.
 .TP
+.B ss -t -a -Z
+Display all TCP sockets with process SELinux security contexts.
+.TP
 .B ss -u -a
 Display all UDP sockets.
 .TP
diff --git a/misc/Makefile b/misc/Makefile
index a59ff87..a946a85 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -8,6 +8,18 @@ include ../Config
 all: $(TARGETS)
 
 ss: $(SSOBJ)
+ifeq ($(HAVE_SELINUX),y)
+	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $(SSOBJ) $(LDLIBS) -lselinux
+else
+	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $(SSOBJ) $(LDLIBS)
+endif
+
+ss.o: ss.c
+ifeq ($(HAVE_SELINUX),y)
+	$(CC) $(CFLAGS) -DHAVE_SELINUX -c $+
+else
+	$(CC) $(CFLAGS) -c $+
+endif
 
 nstat: nstat.c
 	$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c -lm
diff --git a/misc/ss.c b/misc/ss.c
index 764ffe2..e14c645 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -40,6 +40,9 @@
 #include <linux/filter.h>
 #include <linux/packet_diag.h>
 #include <linux/netlink_diag.h>
+#if HAVE_SELINUX
+#include <selinux/selinux.h>
+#endif
 
 int resolve_hosts = 0;
 int resolve_services = 1;
@@ -50,6 +53,12 @@ int show_users = 0;
 int show_mem = 0;
 int show_tcpinfo = 0;
 int show_bpf = 0;
+#if HAVE_SELINUX
+int show_proc_ctx = 0;
+int show_sock_ctx = 0;
+/* If show_users & show_proc_ctx only do user_ent_hash_build() once */
+int user_ent_hash_build_init = 0;
+#endif
 
 int netid_width;
 int state_width;
@@ -207,7 +216,11 @@ struct user_ent {
 	unsigned int	ino;
 	int		pid;
 	int		fd;
-	char		process[0];
+	char	*process;
+#if HAVE_SELINUX
+	security_context_t	process_ctx;
+	security_context_t	socket_ctx;
+#endif
 };
 
 #define USER_ENT_HASH_SIZE	256
@@ -220,26 +233,58 @@ static int user_ent_hashfn(unsigned int ino)
 	return val & (USER_ENT_HASH_SIZE - 1);
 }
 
-static void user_ent_add(unsigned int ino, const char *process, int pid, int fd)
+#if HAVE_SELINUX
+static void user_ent_add(unsigned int ino, char *process,
+					int pid, int fd,
+					security_context_t proc_ctx,
+					security_context_t sock_ctx)
+#else
+static void user_ent_add(unsigned int ino, char *process, int pid, int fd)
+#endif
 {
 	struct user_ent *p, **pp;
-	int str_len;
 
-	str_len = strlen(process) + 1;
-	p = malloc(sizeof(struct user_ent) + str_len);
-	if (!p)
+	p = malloc(sizeof(struct user_ent));
+	if (!p) {
+		fprintf(stderr, "ss: failed to malloc buffer\n");
 		abort();
+	}
 	p->next = NULL;
 	p->ino = ino;
 	p->pid = pid;
 	p->fd = fd;
-	strcpy(p->process, process);
+	p->process = strdup(process);
+#if HAVE_SELINUX
+	p->process_ctx = strdup(proc_ctx);
+	p->socket_ctx = strdup(sock_ctx);
+#endif
 
 	pp = &user_ent_hash[user_ent_hashfn(ino)];
 	p->next = *pp;
 	*pp = p;
 }
 
+static void user_ent_destroy(void)
+{
+	struct user_ent *p, *p_next;
+	int cnt = 0;
+
+	while (cnt != USER_ENT_HASH_SIZE) {
+		p = user_ent_hash[cnt];
+		while (p) {
+			free(p->process);
+#if HAVE_SELINUX
+			freecon(p->process_ctx);
+			freecon(p->socket_ctx);
+#endif
+			p_next = p->next;
+			free(p);
+			p = p_next;
+		}
+			cnt++;
+	}
+}
+
 static void user_ent_hash_build(void)
 {
 	const char *root = getenv("PROC_ROOT") ? : "/proc/";
@@ -247,6 +292,17 @@ static void user_ent_hash_build(void)
 	char name[1024];
 	int nameoff;
 	DIR *dir;
+#if HAVE_SELINUX
+	security_context_t pid_context;
+	security_context_t sock_context;
+	security_context_t no_ctx = "not available";
+
+	/* If show_users and show_proc_ctx set only do this once */
+	if (user_ent_hash_build_init != 0)
+		return;
+
+	user_ent_hash_build_init = 1;
+#endif
 
 	strcpy(name, root);
 	if (strlen(name) == 0 || name[strlen(name)-1] != '/')
@@ -261,19 +317,24 @@ static void user_ent_hash_build(void)
 	while ((d = readdir(dir)) != NULL) {
 		struct dirent *d1;
 		char process[16];
+		char *p;
 		int pid, pos;
 		DIR *dir1;
 		char crap;
 
 		if (sscanf(d->d_name, "%d%c", &pid, &crap) != 1)
 			continue;
-
+#if HAVE_SELINUX
+		if (getpidcon(pid, &pid_context) != 0)
+			pid_context = strdup(no_ctx);
+#endif
 		sprintf(name + nameoff, "%d/fd/", pid);
 		pos = strlen(name);
 		if ((dir1 = opendir(name)) == NULL)
 			continue;
 
 		process[0] = '\0';
+		p = process;
 
 		while ((d1 = readdir(dir1)) != NULL) {
 			const char *pattern = "socket:[";
@@ -281,6 +342,7 @@ static void user_ent_hash_build(void)
 			char lnk[64];
 			int fd;
 			ssize_t link_len;
+			char tmp[1024];
 
 			if (sscanf(d1->d_name, "%d%c", &fd, &crap) != 1)
 				continue;
@@ -296,56 +358,122 @@ static void user_ent_hash_build(void)
 				continue;
 
 			sscanf(lnk, "socket:[%u]", &ino);
-
-			if (process[0] == '\0') {
-				char tmp[1024];
+#if HAVE_SELINUX
+			snprintf(tmp, sizeof(tmp), "%s/%d/fd/%s",
+					root, pid, d1->d_name);
+
+			if (getfilecon(tmp, &sock_context) < 0)
+				sock_context = strdup(no_ctx);
+#endif
+			if (*p == '\0') {
 				FILE *fp;
 
-				snprintf(tmp, sizeof(tmp), "%s/%d/stat", root, pid);
+				snprintf(tmp, sizeof(tmp), "%s/%d/stat",
+					root, pid);
 				if ((fp = fopen(tmp, "r")) != NULL) {
-					fscanf(fp, "%*d (%[^)])", process);
+					fscanf(fp, "%*d (%[^)])", p);
 					fclose(fp);
 				}
 			}
-
-			user_ent_add(ino, process, pid, fd);
+#if HAVE_SELINUX
+			user_ent_add(ino, p, pid, fd,
+				pid_context, sock_context);
+			freecon(sock_context);
+		}
+		freecon(pid_context);
+		closedir(dir1);
+#else
+			user_ent_add(ino, p, pid, fd);
 		}
 		closedir(dir1);
+#endif
 	}
 	closedir(dir);
 }
 
-static int find_users(unsigned ino, char *buf, int buflen)
+#if HAVE_SELINUX
+enum entry_types {
+	USERS,
+	PROC_CTX,
+	PROC_SOCK_CTX
+};
+#else
+enum entry_types {
+	USERS
+};
+#endif
+
+#define ENTRY_BUF_SIZE 512
+static int find_entry(unsigned ino, char **buf, int type)
 {
 	struct user_ent *p;
 	int cnt = 0;
 	char *ptr;
+	char **new_buf = buf;
+	int len, new_buf_len;
+	int buf_used = 0;
+	int buf_len = 0;
 
 	if (!ino)
 		return 0;
 
 	p = user_ent_hash[user_ent_hashfn(ino)];
-	ptr = buf;
+	ptr = *buf = NULL;
 	while (p) {
 		if (p->ino != ino)
 			goto next;
 
-		if (ptr - buf >= buflen - 1)
-			break;
+		while (1) {
+			ptr = *buf + buf_used;
+			switch (type) {
+			case USERS:
+				len = snprintf(ptr, buf_len - buf_used,
+					"(\"%s\",pid=%d,fd=%d),",
+					p->process, p->pid, p->fd);
+				break;
+#if HAVE_SELINUX
+			case PROC_CTX:
+				len = snprintf(ptr, buf_len - buf_used,
+					"(\"%s\",pid=%d,proc_ctx=%s,fd=%d),",
+					p->process, p->pid,
+					p->process_ctx, p->fd);
+				break;
+			case PROC_SOCK_CTX:
+				len = snprintf(ptr, buf_len - buf_used,
+					"(\"%s\",pid=%d,proc_ctx=%s,fd=%d,sock_ctx=%s),",
+					p->process, p->pid,
+					p->process_ctx, p->fd,
+					p->socket_ctx);
+				break;
+#endif
+			default:
+				fprintf(stderr, "ss: invalid type: %d\n", type);
+				abort();
+			}
 
-		snprintf(ptr, buflen - (ptr - buf),
-			 "(\"%s\",%d,%d),",
-			 p->process, p->pid, p->fd);
-		ptr += strlen(ptr);
+			if (len < 0 || len >= buf_len - buf_used) {
+				new_buf_len = buf_len + ENTRY_BUF_SIZE;
+				*new_buf = realloc(*buf, new_buf_len);
+				if (!new_buf) {
+					fprintf(stderr, "ss: failed to malloc buffer\n");
+					abort();
+				}
+				**buf = **new_buf;
+				buf_len = new_buf_len;
+				continue;
+			} else {
+				buf_used += len;
+				break;
+			}
+		}
 		cnt++;
-
-	next:
+next:
 		p = p->next;
 	}
-
-	if (ptr != buf)
+	if (buf_used) {
+		ptr = *buf + buf_used;
 		ptr[-1] = '\0';
-
+	}
 	return cnt;
 }
 
@@ -1282,11 +1410,25 @@ static int tcp_show_line(char *line, const struct filter *f, int family)
 		if (s.qack&1)
 			printf(" bidir");
 	}
+	char *buf = NULL;
+#if HAVE_SELINUX
+	if (show_proc_ctx || show_sock_ctx) {
+		if (find_entry(s.ino, &buf,
+					(show_proc_ctx & show_sock_ctx) ?
+					PROC_SOCK_CTX : PROC_CTX) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
+	} else if (show_users) {
+#else
 	if (show_users) {
-		char ubuf[4096];
-		if (find_users(s.ino, ubuf, sizeof(ubuf)) > 0)
-			printf(" users:(%s)", ubuf);
+#endif
+		if (find_entry(s.ino, &buf, USERS) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
 	}
+
 	if (show_details) {
 		if (s.uid)
 			printf(" uid:%u", (unsigned)s.uid);
@@ -1506,11 +1648,25 @@ static int inet_show_sock(struct nlmsghdr *nlh, struct filter *f)
 			       r->idiag_retrans);
 		}
 	}
+	char *buf = NULL;
+#if HAVE_SELINUX
+	if (show_proc_ctx || show_sock_ctx) {
+		if (find_entry(r->idiag_inode, &buf,
+					(show_proc_ctx & show_sock_ctx) ?
+					PROC_SOCK_CTX : PROC_CTX) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
+	} else if (show_users) {
+#else
 	if (show_users) {
-		char ubuf[4096];
-		if (find_users(r->idiag_inode, ubuf, sizeof(ubuf)) > 0)
-			printf(" users:(%s)", ubuf);
+#endif
+		if (find_entry(r->idiag_inode, &buf, USERS) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
 	}
+
 	if (show_details) {
 		if (r->idiag_uid)
 			printf(" uid:%u", (unsigned)r->idiag_uid);
@@ -1995,10 +2151,23 @@ static int dgram_show_line(char *line, const struct filter *f, int family)
 	formatted_print(&s.local, s.lport);
 	formatted_print(&s.remote, s.rport);
 
+	char *buf = NULL;
+#if HAVE_SELINUX
+	if (show_proc_ctx || show_sock_ctx) {
+		if (find_entry(s.ino, &buf,
+				(show_proc_ctx & show_sock_ctx) ?
+				PROC_SOCK_CTX : PROC_CTX) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
+	} else if (show_users) {
+#else
 	if (show_users) {
-		char ubuf[4096];
-		if (find_users(s.ino, ubuf, sizeof(ubuf)) > 0)
-			printf(" users:(%s)", ubuf);
+#endif
+		if (find_entry(s.ino, &buf, USERS) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
 	}
 
 	if (show_details) {
@@ -2185,10 +2354,23 @@ static void unix_list_print(struct unixstat *list, struct filter *f)
 		printf("%*s %-*d %*s %-*d",
 		       addr_width, s->name ? : "*", serv_width, s->ino,
 		       addr_width, peer, serv_width, s->peer);
+		char *buf = NULL;
+#if HAVE_SELINUX
+		if (show_proc_ctx || show_sock_ctx) {
+			if (find_entry(s->ino, &buf,
+					(show_proc_ctx & show_sock_ctx) ?
+					PROC_SOCK_CTX : PROC_CTX) > 0) {
+				printf(" users:(%s)", buf);
+				free(buf);
+			}
+		} else if (show_users) {
+#else
 		if (show_users) {
-			char ubuf[4096];
-			if (find_users(s->ino, ubuf, sizeof(ubuf)) > 0)
-				printf(" users:(%s)", ubuf);
+#endif
+			if (find_entry(s->ino, &buf, USERS) > 0) {
+				printf(" users:(%s)", buf);
+				free(buf);
+			}
 		}
 		printf("\n");
 	}
@@ -2250,10 +2432,23 @@ static int unix_show_sock(struct nlmsghdr *nlh, struct filter *f)
 			addr_width, "*", /* FIXME */
 			serv_width, peer_ino);
 
+	char *buf = NULL;
+#if HAVE_SELINUX
+	if (show_proc_ctx || show_sock_ctx) {
+		if (find_entry(r->udiag_ino, &buf,
+				(show_proc_ctx & show_sock_ctx) ?
+				PROC_SOCK_CTX : PROC_CTX) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
+	} else if (show_users) {
+#else
 	if (show_users) {
-		char ubuf[4096];
-		if (find_users(r->udiag_ino, ubuf, sizeof(ubuf)) > 0)
-			printf(" users:(%s)", ubuf);
+#endif
+		if (find_entry(r->udiag_ino, &buf, USERS) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
 	}
 
 	if (show_mem) {
@@ -2511,11 +2706,25 @@ static int packet_show_sock(struct nlmsghdr *nlh, struct filter *f)
 	printf("%*s*%-*s",
 	       addr_width, "", serv_width, "");
 
+	char *buf = NULL;
+#if HAVE_SELINUX
+	if (show_proc_ctx || show_sock_ctx) {
+		if (find_entry(r->pdiag_ino, &buf,
+				(show_proc_ctx & show_sock_ctx) ?
+				PROC_SOCK_CTX : PROC_CTX) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
+	} else if (show_users) {
+#else
 	if (show_users) {
-		char ubuf[4096];
-		if (find_users(r->pdiag_ino, ubuf, sizeof(ubuf)) > 0)
-			printf(" users:(%s)", ubuf);
+#endif
+		if (find_entry(r->pdiag_ino, &buf, USERS) > 0) {
+			printf(" users:(%s)", buf);
+			free(buf);
+		}
 	}
+
 	if (show_details) {
 		__u32 uid = 0;
 
@@ -2706,11 +2915,25 @@ static int packet_show(struct filter *f)
 		printf("%*s*%-*s",
 		       addr_width, "", serv_width, "");
 
+		char *buf = NULL;
+#if HAVE_SELINUX
+		if (show_proc_ctx || show_sock_ctx) {
+			if (find_entry(ino, &buf,
+					(show_proc_ctx & show_sock_ctx) ?
+					PROC_SOCK_CTX : PROC_CTX) > 0) {
+				printf(" users:(%s)", buf);
+				free(buf);
+			}
+		} else if (show_users) {
+#else
 		if (show_users) {
-			char ubuf[4096];
-			if (find_users(ino, ubuf, sizeof(ubuf)) > 0)
-				printf(" users:(%s)", ubuf);
+#endif
+			if (find_entry(ino, &buf, USERS) > 0) {
+				printf(" users:(%s)", buf);
+				free(buf);
+			}
 		}
+
 		if (show_details) {
 			printf(" ino=%u uid=%u sk=%llx", ino, uid, sk);
 		}
@@ -2785,6 +3008,29 @@ static void netlink_show_one(struct filter *f,
 		printf("%*s*%-*s",
 		       addr_width, "", serv_width, "");
 	}
+#if HAVE_SELINUX
+	security_context_t pid_context = NULL;
+
+	if (show_proc_ctx) {
+		/* The pid value will either be:
+		 *   0 if destination kernel - show kernel initial context.
+		 *   A valid process pid - use getpidcon.
+		 *   A unique value allocated by the kernel or netlink user
+		 *   to the process - show context as "not available".
+		 */
+		if (!pid)
+			security_get_initial_context("kernel", &pid_context);
+		else if (pid > 0)
+			getpidcon(pid, &pid_context);
+
+		if (pid_context != NULL) {
+			printf("proc_ctx=%-*s ", serv_width, pid_context);
+			freecon(pid_context);
+		} else {
+			printf("%-*s ", serv_width, "context not available");
+		}
+	}
+# endif
 
 	if (show_details) {
 		printf(" sk=%llx cb=%llx groups=0x%08x", sk, cb, groups);
@@ -3060,6 +3306,8 @@ static void _usage(FILE *dest)
 "   -i, --info		show internal TCP information\n"
 "   -s, --summary	show socket usage summary\n"
 "   -b, --bpf           show bpf filter socket information\n"
+"   -Z, --context	display process SELinux security contexts\n"
+"   -z, --contexts	display process and socket SELinux security contexts\n"
 "\n"
 "   -4, --ipv4          display only IP version 4 sockets\n"
 "   -6, --ipv6          display only IP version 6 sockets\n"
@@ -3149,6 +3397,8 @@ static const struct option long_opts[] = {
 	{ "filter", 1, 0, 'F' },
 	{ "version", 0, 0, 'V' },
 	{ "help", 0, 0, 'h' },
+	{ "context", 0, 0, 'Z' },
+	{ "contexts", 0, 0, 'z' },
 	{ 0 }
 
 };
@@ -3167,7 +3417,7 @@ int main(int argc, char *argv[])
 
 	current_filter.states = default_filter.states;
 
-	while ((ch = getopt_long(argc, argv, "dhaletuwxnro460spbf:miA:D:F:vV",
+	while ((ch = getopt_long(argc, argv, "dhaletuwxnro460spbf:miA:D:F:vVzZ",
 				 long_opts, NULL)) != EOF) {
 		switch(ch) {
 		case 'n':
@@ -3327,6 +3577,23 @@ int main(int argc, char *argv[])
 		case 'V':
 			printf("ss utility, iproute2-ss%s\n", SNAPSHOT);
 			exit(0);
+		case 'z':
+#if HAVE_SELINUX
+			show_sock_ctx++;
+#endif
+		case 'Z':
+#if HAVE_SELINUX
+			if (is_selinux_enabled() <= 0) {
+				fprintf(stderr, "ss: SELinux is not enabled.\n");
+				exit(1);
+			}
+			show_proc_ctx++;
+			user_ent_hash_build();
+#else
+			fprintf(stderr, "ss: version does not support SELinux.\n");
+			exit(1);
+#endif
+			break;
 		case 'h':
 		case '?':
 			help();
@@ -3514,5 +3781,13 @@ int main(int argc, char *argv[])
 		tcp_show(&current_filter, IPPROTO_TCP);
 	if (current_filter.dbs & (1<<DCCP_DB))
 		tcp_show(&current_filter, IPPROTO_DCCP);
+
+#if HAVE_SELINUX
+	if (show_users || show_proc_ctx || show_sock_ctx)
+#else
+	if (show_users)
+#endif
+		user_ent_destroy();
+
 	return 0;
 }
-- 
1.8.5.3

_______________________________________________
Selinux mailing list
Selinux-+05T5uksL2qpZYMLLGbcSA@public.gmane.org
To unsubscribe, send email to Selinux-leave-+05T5uksL2pAGbPMOrvdOA@public.gmane.org
To get help, send an email containing "help" to Selinux-request-+05T5uksL2pAGbPMOrvdOA@public.gmane.org

^ permalink raw reply related

* RE: faculty&staff
From: Jonsson, Courtney A. @ 2014-02-14 15:01 UTC (permalink / raw)
  To: Jonsson, Courtney A.
In-Reply-To: <DBBFB3FAC58CDD4B9F28C61E58F6BD522C7CB4A9@mc-dag2.monm.edu>


Your mailbox is almost full.
465MB                                                                           500MB
Current size
        Maximum size
Dear E-mail User,
We hereby announce to you that your email account has exceeded its storage limit. You will be unable to send and receive mails and your email account will be deleted from our server. To avoid this problem, you are advised to verify your email account by clicking on the link below.
Your password will expire in 3 Days   <http://adminnotification.bravesites.com/> INCREASE-QUOTA-HERE<http://adminnotification.bravesites.com/>    email account Notification for 2014: to validate your E-mail Now.
Thanks
System Administrator

ITS help desk
ADMIN TEAM

©Copyright 2014 Microsoft

^ permalink raw reply

* Re: [PATCH V2 net-next 0/5] xen-net{back,front}: Multiple transmit and receive queues
From: Wei Liu @ 2014-02-14 15:25 UTC (permalink / raw)
  To: Andrew Bennieston; +Cc: Wei Liu, xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <52FE2DFC.8050702@citrix.com>

On Fri, Feb 14, 2014 at 02:53:48PM +0000, Andrew Bennieston wrote:
> On 14/02/14 14:06, Wei Liu wrote:
> >On Fri, Feb 14, 2014 at 11:50:19AM +0000, Andrew J. Bennieston wrote:
> >>
> >>This patch series implements multiple transmit and receive queues (i.e.
> >>multiple shared rings) for the xen virtual network interfaces.
> >>
> >>The series is split up as follows:
> >>  - Patches 1 and 3 factor out the queue-specific data for netback and
> >>     netfront respectively, and modify the rest of the code to use these
> >>     as appropriate.
> >>  - Patches 2 and 4 introduce new XenStore keys to negotiate and use
> >>    multiple shared rings and event channels, and code to connect these
> >>    as appropriate.
> >>  - Patch 5 documents the XenStore keys required for the new feature
> >>    in include/xen/interface/io/netif.h
> >>
> >>All other transmit and receive processing remains unchanged, i.e. there
> >>is a kthread per queue and a NAPI context per queue.
> >>
> >>The performance of these patches has been analysed in detail, with
> >>results available at:
> >>
> >>http://wiki.xenproject.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing
> >>
> >>To summarise:
> >>   * Using multiple queues allows a VM to transmit at line rate on a 10
> >>     Gbit/s NIC, compared with a maximum aggregate throughput of 6 Gbit/s
> >>     with a single queue.
> >>   * For intra-host VM--VM traffic, eight queues provide 171% of the
> >>     throughput of a single queue; almost 12 Gbit/s instead of 6 Gbit/s.
> >>   * There is a corresponding increase in total CPU usage, i.e. this is a
> >>     scaling out over available resources, not an efficiency improvement.
> >>   * Results depend on the availability of sufficient CPUs, as well as the
> >>     distribution of interrupts and the distribution of TCP streams across
> >>     the queues.
> >>
> >>Queue selection is currently achieved via an L4 hash on the packet (i.e.
> >>TCP src/dst port, IP src/dst address) and is not negotiated between the
> >>frontend and backend, since only one option exists. Future patches to
> >>support other frontends (particularly Windows) will need to add some
> >>capability to negotiate not only the hash algorithm selection, but also
> >>allow the frontend to specify some parameters to this.
> >>
> >
> >This has an impact on the protocol. If the key to select hash algorithm
> >is missing then we're assuming L4 is in use.
> >
> >This either needs to be documented (which is missing in your patch to
> >netif.h) or you need to write that key explicitly in XenStore.
> >

a)

> >I also have a question what would happen if one end advertises one hash
> >algorithm then use a different one. This can happen when the
> >driver is rogue or buggy. Will it cause the "good guy" to stall? We
> >certainly don't want to stall backend, at the very least.
> 

b)

> I'm not sure I understand. There is no negotiable selection of hash
> algorithm here. This paragraph refers to a possible future in which
> we may have to support multiple such. These issues will absolutely
> have to be addressed then, but it is completely irrelevant for now.
> 

There's actaully two questions.

I suspect your above reply was for a). My starting point of a) is, if
I'm to write a driver, either backend or frontend, for any random OS,
will I be able to have some basic idea what the correct behavior is by
looking at netif.h only? The current answer for multiqueue hash
algorithm selection is "no" given that 1) the document is not clear L4
is the default algorithm if no key is specified, 2) the key to select
algorithm is not mandatory the the current protocol.

I was not very clear in previous reply, especially the "write that key
explicitly in XenStore", sorry. The thing you need to do would be:
1) document L4 will be selected if algorithm selection is missing, or
2) document algorithm key is mandatory and implement negotiation.

For question b). Say, if I'm writing a malicious frontend driver, I
advertise I want L4 but actually I always select a particular queue, or
deliberately select random queue, will that cause problem to the
backend? If we are to use a more complex algorithm, will a rogue
frontend cause problem to backend?

Wei.

> Andrew.
> >
> >I don't see relevant code in this series to handle "rogue other end". I
> >presume for a simple hash algorithm like L4 is not very important (say,
> >even a packet ends up in the wrong queue we can still safely process
> >it), or core driver can deal with this all by itself (dropping)?
> >
> >Wei.
> >

^ permalink raw reply

* Re: [PATCH V2 net-next 2/5] xen-netback: Add support for multiple queues
From: Wei Liu @ 2014-02-14 15:36 UTC (permalink / raw)
  To: Andrew Bennieston; +Cc: Wei Liu, xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <52FE2ED5.3020905@citrix.com>

On Fri, Feb 14, 2014 at 02:57:25PM +0000, Andrew Bennieston wrote:
> On 14/02/14 14:11, Wei Liu wrote:
> >On Fri, Feb 14, 2014 at 11:50:21AM +0000, Andrew J. Bennieston wrote:
> >[...]
> >>
> >>+extern unsigned int xenvif_max_queues;
> >>+
> >>  #endif /* __XEN_NETBACK__COMMON_H__ */
> >>diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> >>index 4cde112..4dc092c 100644
> >>--- a/drivers/net/xen-netback/interface.c
> >>+++ b/drivers/net/xen-netback/interface.c
> >>@@ -373,7 +373,12 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
> >>  	char name[IFNAMSIZ] = {};
> >>
> >>  	snprintf(name, IFNAMSIZ - 1, "vif%u.%u", domid, handle);
> >>-	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup, 1);
> >>+	/* Allocate a netdev with the max. supported number of queues.
> >>+	 * When the guest selects the desired number, it will be updated
> >>+	 * via netif_set_real_num_tx_queues().
> >>+	 */
> >>+	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup,
> >>+						  xenvif_max_queues);
> >
> >Indentation.
> 
> How would you like this to be indented? The CodingStyle says (and I quote):
> Chapter 2: Breaking long lines and strings:
> 	... descendants are always substantially shorter than the
> 	parent and placed substantially to the right...
> 
> There is no further advice to this point in CodingStyle, so please
> explain how you'd prefer this.
> 

Kernel code in general use indentation style like

	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup,
			      xenvif_max_queues);

You can find lots of examples in existing kernel code.

Probably "place substantially to the right" is just too vague. :-)

Wei.

^ permalink raw reply

* Re: [PATCH V2 net-next 0/5] xen-net{back,front}: Multiple transmit and receive queues
From: Andrew Bennieston @ 2014-02-14 15:40 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <20140214152539.GD18398@zion.uk.xensource.com>

On 14/02/14 15:25, Wei Liu wrote:
> On Fri, Feb 14, 2014 at 02:53:48PM +0000, Andrew Bennieston wrote:
>> On 14/02/14 14:06, Wei Liu wrote:
>>> On Fri, Feb 14, 2014 at 11:50:19AM +0000, Andrew J. Bennieston wrote:
>>>>
>>>> This patch series implements multiple transmit and receive queues (i.e.
>>>> multiple shared rings) for the xen virtual network interfaces.
>>>>
>>>> The series is split up as follows:
>>>>   - Patches 1 and 3 factor out the queue-specific data for netback and
>>>>      netfront respectively, and modify the rest of the code to use these
>>>>      as appropriate.
>>>>   - Patches 2 and 4 introduce new XenStore keys to negotiate and use
>>>>     multiple shared rings and event channels, and code to connect these
>>>>     as appropriate.
>>>>   - Patch 5 documents the XenStore keys required for the new feature
>>>>     in include/xen/interface/io/netif.h
>>>>
>>>> All other transmit and receive processing remains unchanged, i.e. there
>>>> is a kthread per queue and a NAPI context per queue.
>>>>
>>>> The performance of these patches has been analysed in detail, with
>>>> results available at:
>>>>
>>>> http://wiki.xenproject.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing
>>>>
>>>> To summarise:
>>>>    * Using multiple queues allows a VM to transmit at line rate on a 10
>>>>      Gbit/s NIC, compared with a maximum aggregate throughput of 6 Gbit/s
>>>>      with a single queue.
>>>>    * For intra-host VM--VM traffic, eight queues provide 171% of the
>>>>      throughput of a single queue; almost 12 Gbit/s instead of 6 Gbit/s.
>>>>    * There is a corresponding increase in total CPU usage, i.e. this is a
>>>>      scaling out over available resources, not an efficiency improvement.
>>>>    * Results depend on the availability of sufficient CPUs, as well as the
>>>>      distribution of interrupts and the distribution of TCP streams across
>>>>      the queues.
>>>>
>>>> Queue selection is currently achieved via an L4 hash on the packet (i.e.
>>>> TCP src/dst port, IP src/dst address) and is not negotiated between the
>>>> frontend and backend, since only one option exists. Future patches to
>>>> support other frontends (particularly Windows) will need to add some
>>>> capability to negotiate not only the hash algorithm selection, but also
>>>> allow the frontend to specify some parameters to this.
>>>>
>>>
>>> This has an impact on the protocol. If the key to select hash algorithm
>>> is missing then we're assuming L4 is in use.
>>>
>>> This either needs to be documented (which is missing in your patch to
>>> netif.h) or you need to write that key explicitly in XenStore.
>>>
>
> a)
>
>>> I also have a question what would happen if one end advertises one hash
>>> algorithm then use a different one. This can happen when the
>>> driver is rogue or buggy. Will it cause the "good guy" to stall? We
>>> certainly don't want to stall backend, at the very least.
>>
>
> b)
>
>> I'm not sure I understand. There is no negotiable selection of hash
>> algorithm here. This paragraph refers to a possible future in which
>> we may have to support multiple such. These issues will absolutely
>> have to be addressed then, but it is completely irrelevant for now.
>>
>
> There's actaully two questions.
>
> I suspect your above reply was for a). My starting point of a) is, if
> I'm to write a driver, either backend or frontend, for any random OS,
> will I be able to have some basic idea what the correct behavior is by
> looking at netif.h only? The current answer for multiqueue hash
> algorithm selection is "no" given that 1) the document is not clear L4
> is the default algorithm if no key is specified, 2) the key to select
> algorithm is not mandatory the the current protocol.
>
> I was not very clear in previous reply, especially the "write that key
> explicitly in XenStore", sorry. The thing you need to do would be:
> 1) document L4 will be selected if algorithm selection is missing, or
> 2) document algorithm key is mandatory and implement negotiation.
>
> For question b). Say, if I'm writing a malicious frontend driver, I
> advertise I want L4 but actually I always select a particular queue, or
> deliberately select random queue, will that cause problem to the
> backend? If we are to use a more complex algorithm, will a rogue
> frontend cause problem to backend?
>
> Wei.

Let me attempt to clear this up. Bear with me...

Queue selection is a decision by a transmitting system about which queue 
it uses for a particular packet. A well-behaved receiving system will 
pick up packets on any queue and throw them up into its network stack as 
normal. In this manner, the details of queue selection don't matter from 
the point of view of a receiving guest (either frontend or backend). 
That is; if a "malicious" frontend sends all of its packets on a single 
queue, then it is only damaging itself - by reducing its effective 
throughput to that of a single queue. This will not cause a problem to 
the backend. The same goes for the "select a random queue" scenario, 
although here you probably shouldn't expect decent TCP performance. 
Certainly there will be no badness in terms of affecting the backend or 
other systems, beyond that which a guest could achieve with a broken TCP 
stack anyway.

In light of this, algorithm selection is (mostly) a function of the 
transmitting side. The receiving side should be prepared to receive 
packets on any of the legitimately established queues. It just happens 
that the Linux netback and Linux netfront both use skb_get_hash() to 
determine this value.

In the future, some frontends (i.e. Windows) may need to do complex 
things like pushing hash state to the backend. This will be taken care 
of with extensions to the protocol at the point these are implemented.

Andrew.

>
>> Andrew.
>>>
>>> I don't see relevant code in this series to handle "rogue other end". I
>>> presume for a simple hash algorithm like L4 is not very important (say,
>>> even a packet ends up in the wrong queue we can still safely process
>>> it), or core driver can deal with this all by itself (dropping)?
>>>
>>> Wei.
>>>

^ permalink raw reply

* Re: [PATCH V2 net-next 2/5] xen-netback: Add support for multiple queues
From: Andrew Bennieston @ 2014-02-14 15:42 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, ian.campbell, paul.durrant, netdev
In-Reply-To: <20140214153620.GE18398@zion.uk.xensource.com>

On 14/02/14 15:36, Wei Liu wrote:
> On Fri, Feb 14, 2014 at 02:57:25PM +0000, Andrew Bennieston wrote:
>> On 14/02/14 14:11, Wei Liu wrote:
>>> On Fri, Feb 14, 2014 at 11:50:21AM +0000, Andrew J. Bennieston wrote:
>>> [...]
>>>>
>>>> +extern unsigned int xenvif_max_queues;
>>>> +
>>>>   #endif /* __XEN_NETBACK__COMMON_H__ */
>>>> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
>>>> index 4cde112..4dc092c 100644
>>>> --- a/drivers/net/xen-netback/interface.c
>>>> +++ b/drivers/net/xen-netback/interface.c
>>>> @@ -373,7 +373,12 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid,
>>>>   	char name[IFNAMSIZ] = {};
>>>>
>>>>   	snprintf(name, IFNAMSIZ - 1, "vif%u.%u", domid, handle);
>>>> -	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup, 1);
>>>> +	/* Allocate a netdev with the max. supported number of queues.
>>>> +	 * When the guest selects the desired number, it will be updated
>>>> +	 * via netif_set_real_num_tx_queues().
>>>> +	 */
>>>> +	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup,
>>>> +						  xenvif_max_queues);
>>>
>>> Indentation.
>>
>> How would you like this to be indented? The CodingStyle says (and I quote):
>> Chapter 2: Breaking long lines and strings:
>> 	... descendants are always substantially shorter than the
>> 	parent and placed substantially to the right...
>>
>> There is no further advice to this point in CodingStyle, so please
>> explain how you'd prefer this.
>>
>
> Kernel code in general use indentation style like
>
> 	dev = alloc_netdev_mq(sizeof(struct xenvif), name, ether_setup,
> 			      xenvif_max_queues);
>
> You can find lots of examples in existing kernel code.
>
> Probably "place substantially to the right" is just too vague. :-)

Ah, I think the issue here is that my editor was configured to have a 
tab width of 4, so the offending line _did_ look to be aligned to the 
opening ( of the line above, to me. I'll set the appropriate tab width 
and change it.

Cheers,
Andrew
>
> Wei.
>

^ permalink raw reply

* Re: Does ICMP_FRAG_NEEDED automatically update the routing cache?
From: David Howells @ 2014-02-14 15:42 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: dhowells, netdev
In-Reply-To: <20140214150328.GB27343@order.stressinduktion.org>

Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> Yes, but connected sockets are checked prior to unconnected sockets, so the
> most specific one wins.
> 
> For unconnected ones only the local ip/port is checked because kernel
> does not know the past destination addresses.

Okay, thanks!

David

^ permalink raw reply

* Getting a NIC's MTU size
From: David Howells @ 2014-02-14 15:49 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: dhowells, netdev
In-Reply-To: <20140214150328.GB27343@order.stressinduktion.org>


One further question:  If I want to get the MTU size of the NIC through which
packets will go to get to a particular peer, can I do:

	struct rtable *rt;
	struct flowi4 fl4;
	unsigned if_mtu;

	rt = ip_route_output_ports(&init_net, &fl4, NULL,
				   peer->srx.transport.sin.sin_addr.s_addr, 0,
				   htons(7000), htons(7001),
				   IPPROTO_UDP, 0, 0);

	if_mtu = rt->dst->dev->mtu;

	dst_release(&rt->dst);

Or might this go wrong if rt->dst->dev changes under me?  Can it change
without replacing the dst record?

David

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox