Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: Veaceslav Falico @ 2014-01-17  8:02 UTC (permalink / raw)
  To: David Miller; +Cc: fubar, netdev, andy
In-Reply-To: <20140115.220132.1518410490710218099.davem@davemloft.net>

On Wed, Jan 15, 2014 at 10:01:32PM -0800, David Miller wrote:
>From: Jay Vosburgh <fubar@us.ibm.com>
>Date: Wed, 15 Jan 2014 21:09:57 -0800
>
>> 	The main reason for preserving the non-validate behavior (any
>> traffic counts) is for the loadbalance (xor and rr) modes.  In those
>> modes, the switch decides which slave receives the incoming traffic, and
>> so it's to our advantage to permit any incoming traffic to count for
>> "up-ness."  The arp_validate option is not allowed in these modes
>> because it won't work.
>>
>> 	With these changes, I suspect that the loadbalance ARP monitor
>> will be less reliable with these changes (granted that it's already a
>> bit dodgy in its dependence on the switch to hit all slaves with
>> incoming packets regularly).  Particularly if the switch ports are
>> configured into an Etherchannel ("static link aggregation") group, in
>> which case only one slave will receive any given frame (broadcast /
>> multicast traffic will not be duplicated across all slaves).
>>
>> 	I'm not sure that this change (the "only count ARPs even without
>> arp_validate" bit) won't break existing configurations.  Did you test
>> the -rr and -xor modes with ARP monitor after your changes (with and
>> without configuring a channel group on the switch ports)?
>
>Sorry Jay, I only read this just now.  I won't push these changes
>out until you've had some time to discuss them.

Sorry David, this change indeed might break some configurations (leaving
aside the question if they're legit or not...). So, please, drop them, I'll
discuss with Jay and send a v2.

Thanks a lot, and sorry for the noise.

>
>To my untrained eye they looked rather straightforward :-)

^ permalink raw reply

* Re: [PATCH net-next v6 0/6] virtio-net: mergeable rx buffer size auto-tuning
From: David Miller @ 2014-01-17  7:51 UTC (permalink / raw)
  To: mwdalton; +Cc: mst, netdev, virtualization, edumazet, bhutchings
In-Reply-To: <1389939810-14998-1-git-send-email-mwdalton@google.com>

From: Michael Dalton <mwdalton@google.com>
Date: Thu, 16 Jan 2014 22:23:24 -0800

> The virtio-net device currently uses aligned MTU-sized mergeable receive
> packet buffers. Network throughput for workloads with large average
> packet size can be improved by posting larger receive packet buffers.
> However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE)
> buffers reduces the throughput of workloads that do not benefit from GRO
> and have no large inbound packets.
> 
> This patchset introduces virtio-net mergeable buffer size auto-tuning,
> with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet
> buffer size is chosen based on a per-receive queue EWMA of incoming
> packet size.
> 
> To unify mergeable receive buffer memory allocation and improve
> SKB frag coalescing, all mergeable buffer memory allocation is
> migrated to per-receive queue page frag allocators.
> 
> The per-receive queue mergeable packet buffer size is exported via
> sysfs, and the network device sysfs layer has been extended to add
> support for device-specific per-receive queue sysfs attribute groups.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] bonding: add slave netlink and sysfs support
From: Ding Tianhong @ 2014-01-17  7:47 UTC (permalink / raw)
  To: Scott Feldman, vfalico, fubar, andy; +Cc: netdev, roopa, shm
In-Reply-To: <20140117065316.3194.94624.stgit@monster-03.cumulusnetworks.com>

On 2014/1/17 14:57, Scott Feldman wrote:
> v2:
> 
>   - Address review comment from Ding (and Veacesiav): handle kobj cleanup
>     if sysfs_create_file() fails when adding slave attribute nodes.
> 
> v1:
> 
>   The following series adds bonding slave netlink and sysfs interfaces.
>   Slave interfaces get a new IFLA_SLAVE set of netlink attributes, along
>   with RTM_NEWLINK notification when slave's active status changes.  The
>   sysfs interface adds read-only nodes for slave attributes under a /slave
>   dir, simliar to how bond interfaces get a /bonding dir for bonding
>   attributes.
> 
> ---
> 
> Scott Feldman (2):
>       bonding: add sysfs /slave dir for bond slave devices.
>       bonding: add netlink attributes to slave link dev
> 
> 
>  drivers/net/bonding/Makefile           |    2 
>  drivers/net/bonding/bond_main.c        |   28 ++++++
>  drivers/net/bonding/bond_netlink.c     |   36 ++++++++
>  drivers/net/bonding/bond_procfs.c      |   12 ---
>  drivers/net/bonding/bond_sysfs_slave.c |  144 ++++++++++++++++++++++++++++++++
>  drivers/net/bonding/bonding.h          |   15 +++
>  include/linux/netdevice.h              |    5 +
>  include/uapi/linux/if_link.h           |   13 +++
>  net/core/rtnetlink.c                   |   54 ++++++++++++
>  9 files changed, 294 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/net/bonding/bond_sysfs_slave.c
> 
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>

^ permalink raw reply

* [PATCH net-next] net: ftgmac100: use kfree_skb() where appropriate
From: Eric Dumazet @ 2014-01-17  7:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

In order to get correct drop monitor notifications for dropped
packets, we should call kfree_skb() instead of dev_kfree_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/faraday/ftgmac100.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 24d87898209b..c11ecbc98149 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -766,7 +766,7 @@ static void ftgmac100_free_buffers(struct ftgmac100 *priv)
 			continue;
 
 		dma_unmap_single(priv->dev, map, skb_headlen(skb), DMA_TO_DEVICE);
-		dev_kfree_skb(skb);
+		kfree_skb(skb);
 	}
 
 	dma_free_coherent(priv->dev, sizeof(struct ftgmac100_descs),
@@ -1148,7 +1148,7 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
 			netdev_dbg(netdev, "tx packet too big\n");
 
 		netdev->stats.tx_dropped++;
-		dev_kfree_skb(skb);
+		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
 
@@ -1159,7 +1159,7 @@ static int ftgmac100_hard_start_xmit(struct sk_buff *skb,
 			netdev_err(netdev, "map socket buffer failed\n");
 
 		netdev->stats.tx_dropped++;
-		dev_kfree_skb(skb);
+		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
 

^ permalink raw reply related

* [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-01-17  7:01 UTC (permalink / raw)
  To: linux-sctp; +Cc: Alexander Sverdlin, netdev

Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while 
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and 
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with 
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_account:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>

---

--- net-next.orig/net/sctp/associola.c
+++ net-next/net/sctp/associola.c
@@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
 	return false;
 }

-/* Increase asoc's rwnd by len and send any window update SACK if needed. */
-void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
+/* Account asoc's rwnd for the approximated state in the buffer,
+ * and check whether SACK needs to be sent.
+ */
+void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)
 {
+	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;

-	if (asoc->rwnd_over) {
-		if (asoc->rwnd_over >= len) {
-			asoc->rwnd_over -= len;
-		} else {
-			asoc->rwnd += (len - asoc->rwnd_over);
-			asoc->rwnd_over = 0;
-		}
-	} else {
-		asoc->rwnd += len;
-	}
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);

-	/* If we had window pressure, start recovering it
-	 * once our rwnd had reached the accumulated pressure
-	 * threshold.  The idea is to recover slowly, but up
-	 * to the initial advertised window.
-	 */
-	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
-		int change = min(asoc->pathmtu, asoc->rwnd_press);
-		asoc->rwnd += change;
-		asoc->rwnd_press -= change;
-	}
+	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
+		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
+	else
+		asoc->rwnd = 0;

-	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->a_rwnd);
+	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
+		 __func__, asoc, asoc->rwnd, rx_count,
+		 asoc->base.sk->sk_rcvbuf);

 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (sctp_peer_needs_update(asoc)) {
+	if (check_sack && sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;

 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
 	}
 }

-/* Decrease asoc's rwnd by len. */
-void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
-{
-	int rx_count;
-	int over = 0;
-
-	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
-		pr_debug("%s: association:%p has asoc->rwnd:%u, "
-			 "asoc->rwnd_over:%u!\n", __func__, asoc,
-			 asoc->rwnd, asoc->rwnd_over);
-
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
-
-	/* If we've reached or overflowed our receive buffer, announce
-	 * a 0 rwnd if rwnd would still be positive.  Store the
-	 * the potential pressure overflow so that the window can be restored
-	 * back to original value.
-	 */
-	if (rx_count >= asoc->base.sk->sk_rcvbuf)
-		over = 1;
-
-	if (asoc->rwnd >= len) {
-		asoc->rwnd -= len;
-		if (over) {
-			asoc->rwnd_press += asoc->rwnd;
-			asoc->rwnd = 0;
-		}
-	} else {
-		asoc->rwnd_over = len - asoc->rwnd;
-		asoc->rwnd = 0;
-	}
-
-	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->rwnd_press);
-}

 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
--- net-next.orig/include/net/sctp/structs.h
+++ net-next/include/net/sctp/structs.h
@@ -1653,17 +1653,6 @@ struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;

-	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
-	 * to slop over a maximum of the association's frag_point.
-	 */
-	__u32 rwnd_over;
-
-	/* Keeps treack of rwnd pressure.  This happens when we have
-	 * a window, but not recevie buffer (i.e small packets).  This one
-	 * is releases slowly (1 PMTU at a time ).
-	 */
-	__u32 rwnd_press;
-
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
 __u32 sctp_association_get_next_tsn(struct sctp_association *);

 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
-void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_account(struct sctp_association *, int);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
--- net-next.orig/net/sctp/sm_statefuns.c
+++ net-next/net/sctp/sm_statefuns.c
@@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {

 		/* If this is the next TSN, consider reneging to make
--- net-next.orig/net/sctp/socket.c
+++ net-next/net/sctp/socket.c
@@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
 		 * rwnd is updated when the event is freed.
 		 */
 		if (!sctp_ulpevent_is_notification(event))
-			sctp_assoc_rwnd_increase(event->asoc, copied);
+			sctp_assoc_rwnd_account(event->asoc, 1);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
--- net-next.orig/net/sctp/ulpevent.c
+++ net-next/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
+	sctp_assoc_rwnd_account(asoc, 0);

 	if (!skb->data_len)
 		return;
@@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
 	}

 done:
-	sctp_assoc_rwnd_increase(event->asoc, len);
+	sctp_assoc_rwnd_account(event->asoc, 1);
 	sctp_ulpevent_release_owner(event);
 }

^ permalink raw reply

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: Veaceslav Falico @ 2014-01-17  6:57 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, Andy Gospodarek, David S. Miller
In-Reply-To: <21868.1389911939@death.nxdomain>

On Thu, Jan 16, 2014 at 02:38:59PM -0800, Jay Vosburgh wrote:
...snip...
>	I think the bottom line here is pretty simple:
>
>	Using the ARP monitor with the loadbalance modes is not a common
>configuration in my experience, and making it work is tricky.  However,
>anyone using it today will be relying on the current behavior, which we
>therefore must not change.

Yep, agreed. It might be against the documentation, these use cases might
be weird/illogical - but they (kind of) work, and we both agree that this
change might break them, so it's definitely a no go.

OTOH, I'd still like to help people who have some kind of broadcast traffic
(STP, CDP, some routing etc.) running over network and keeping their slaves
up (and those that cannot/don't want to use arp_validate=3).

What do you think about this*? It's on top of this series, extends
arp_validate to (not) filter out ARPs on not-validated slaves and permits
it to be used in non-AB mode (also, we don't need that bond->lock, we're
always under RCU).

*:

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index a4d925e..7223ef4 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -279,19 +279,45 @@ arp_validate
  
  	none or 0
  
-		No validation is performed.  This is the default.
+		No validation is performed.  This is the default. Any arriving
+		traffic (arp or non-arp) is considered a proof that the slave
+		is up.
  
  	active or 1
  
-		Validation is performed only for the active slave.
+		Validation is performed only for the active slave. Only ARPs
+		that arrive from any arp_ip_target are considered legit. The
+		backup slave still does no validation (as if arp_validate=0).
  
  	backup or 2
  
-		Validation is performed only for backup slaves.
+		Validation is performed only for backup slaves. Only ARPs
+		that arrive from any arp_ip_target are considered legit. The
+		active slave still has no validation (as if arp_validate=0).
  
  	all or 3
  
-		Validation is performed for all slaves.
+		Validation is performed for all slaves. Only ARPs
+		that arrive from any arp_ip_target are considered legit.
+
+	arp or 4
+
+		Any arp packet is accepted as a proof that any slave is up,
+		but no IP-based validation is made.
+
+	active_arp or 5
+
+		Validation is performed only for the active slave. Only ARPs
+		that arrive from any arp_ip_target are considered legit. The
+		backup slave validates only arp packets, but doesn't check the
+		source (as if arp_validate=4).
+
+	backup_any or 6
+
+		Validation is performed only for backup slaves. Only ARPs
+		that arrive from any arp_ip_target are considered legit. The
+		active slave validates only arp packets, but doesn't check the
+		source (as if arp_validate=4).
  
  	For the active slave, the validation checks ARP replies to
  	confirm that they were generated by an arp_ip_target.  Since
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0f613ae..2ef1d5a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -246,6 +246,9 @@ const struct bond_parm_tbl arp_validate_tbl[] = {
  {	"active",		BOND_ARP_VALIDATE_ACTIVE},
  {	"backup",		BOND_ARP_VALIDATE_BACKUP},
  {	"all",			BOND_ARP_VALIDATE_ALL},
+{	"arp",			BOND_ARP_VALIDATE_ARP},
+{	"active_arp",		BOND_ARP_VALIDATE_ACTIVE_ARP},
+{	"backup_arp",		BOND_ARP_VALIDATE_BACKUP_ARP},
  {	NULL,			-1},
  };
  
@@ -2284,16 +2287,15 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
  	struct arphdr *arp = (struct arphdr *)skb->data;
  	unsigned char *arp_ptr;
  	__be32 sip, tip;
-	int alen;
-
-	if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
-		return RX_HANDLER_ANOTHER;
-
-	read_lock(&bond->lock);
+	int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP);
  
  	if (!slave_do_arp_validate(bond, slave)) {
-		slave->last_arp_rx = jiffies;
-		goto out_unlock;
+		if ((slave_do_arp_validate_only(bond, slave) && is_arp) ||
+		    !slave_do_arp_validate_only(bond, slave))
+			slave->last_arp_rx = jiffies;
+		return RX_HANDLER_ANOTHER;
+	} else if (!is_arp) {
+		return RX_HANDLER_ANOTHER;
  	}
  
  	alen = arp_hdr_len(bond->dev);
@@ -2349,7 +2351,6 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
  		bond_validate_arp(bond, slave, tip, sip);
  
  out_unlock:
-	read_unlock(&bond->lock);
  	if (arp != (struct arphdr *)skb->data)
  		kfree(arp);
  	return RX_HANDLER_ANOTHER;
@@ -4181,10 +4182,6 @@ static int bond_check_params(struct bond_params *params)
  	}
  
  	if (arp_validate) {
-		if (bond_mode != BOND_MODE_ACTIVEBACKUP) {
-			pr_err("arp_validate only supported in active-backup mode\n");
-			return -EINVAL;
-		}
  		if (!arp_interval) {
  			pr_err("arp_validate requires arp_interval\n");
  			return -EINVAL;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 1bab20e..9d6d231 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -434,12 +434,6 @@ int bond_option_arp_validate_set(struct bonding *bond, int arp_validate)
  		return -EINVAL;
  	}
  
-	if (bond->params.mode != BOND_MODE_ACTIVEBACKUP) {
-		pr_err("%s: arp_validate only supported in active-backup mode.\n",
-		       bond->dev->name);
-		return -EINVAL;
-	}
-
  	pr_info("%s: setting arp_validate to %s (%d).\n",
  		bond->dev->name, arp_validate_tbl[arp_validate].modename,
  		arp_validate);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 9f07af1..19eb023 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -319,6 +319,11 @@ static inline bool bond_is_active_slave(struct slave *slave)
  #define BOND_ARP_VALIDATE_BACKUP	(1 << BOND_STATE_BACKUP)
  #define BOND_ARP_VALIDATE_ALL		(BOND_ARP_VALIDATE_ACTIVE | \
  					 BOND_ARP_VALIDATE_BACKUP)
+#define BOND_ARP_VALIDATE_ARP		(BOND_ARP_VALIDATE_ALL + 1) /* бля... */
+#define BOND_ARP_VALIDATE_ACTIVE_ARP	(BOND_ARP_VALIDATE_ACTIVE | \
+					 BOND_ARP_VALIDATE_ARP)
+#define BOND_ARP_VALIDATE_BACKUP_ARP	(BOND_ARP_VALIDATE_BACKUP | \
+					 BOND_ARP_VALIDATE_ARP)
  
  static inline int slave_do_arp_validate(struct bonding *bond,
  					struct slave *slave)
@@ -326,6 +331,12 @@ static inline int slave_do_arp_validate(struct bonding *bond,
  	return bond->params.arp_validate & (1 << bond_slave_state(slave));
  }
  
+static inline int slave_do_arp_validate_only(struct bonding *bond,
+					     struct slave *slave)
+{
+	return bond->params.arp_validate & BOND_ARP_VALIDATE_ARP;
+}
+
  /* Get the oldest arp which we've received on this slave for bond's
   * arp_targets.
   */

>
>	-J

^ permalink raw reply related

* Re: [Xen-devel] [PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: annie li @ 2014-01-17  6:58 UTC (permalink / raw)
  To: David Vrabel
  Cc: wei.liu2, ian.campbell, netdev, xen-devel, andrew.bennieston,
	davem
In-Reply-To: <52D8CCE4.9010804@oracle.com>


On 2014/1/17 14:25, annie li wrote:
>
> On 2014/1/16 19:10, David Vrabel wrote:
>> On 15/01/14 23:57, Annie Li wrote:
>>> This patch implements two things:
>>>
>>> * release grant reference and skb for rx path, this fixex resource 
>>> leaking.
>>> * clean up grant transfer code kept from old netfront(2.6.18) which 
>>> grants
>>> pages for access/map and transfer. But grant transfer is deprecated 
>>> in current
>>> netfront, so remove corresponding release code for transfer.
>>>
>>> gnttab_end_foreign_access_ref may fail when the grant entry is 
>>> currently used
>>> for reading or writing. But this patch does not cover this and 
>>> improvement for
>>> this failure may be implemented in a separate patch.
>> I don't think replacing a resource leak with a security bug is a good 
>> idea.
>>
>> If you would prefer not to fix the gnttab_end_foreign_access() call, I
>> think you can fix this in netfront by taking a reference to the page
>> before calling gnttab_end_foreign_access().  This will ensure the page
>> isn't freed until the subsequent kfree_skb(), or the gref is released by
>> the foreign domain (whichever is later).
>
> Taking a reference to the page before calling 
> gnttab_end_foreign_access() delays the free work until kfree_skb(). 
> Simply adding put_page before kfree_skb() does not make things 
> different from gnttab_end_foreign_access_ref(), and the pages will be 
> freed by kfree_skb(), problem will be hit in gnttab_handle_deferred() 
> when freeing pages which already be freed.
>
> So put_page is required in gnttab_end_foreign_access(), this will 
> ensure either free is taken by kfree_skb or gnttab_handle_deferred. 
> This involves changes in blkfront/pcifront/tpmfront(just like your 
> patch), this way ensure page is released when ref is end.
>
> Another solution I am thinking is calling gnttab_end_foreign_access() 
> with page parameter as NULL, then gnttab_end_foreign_access will only 
> do ending grant reference work and releasing page work is done by 
> kfree_skb().

Not NULL above, it should be 0UL.

Thanks
Annie

^ permalink raw reply

* [PATCH net-next v2 2/2] bonding: add netlink attributes to slave link dev
From: Scott Feldman @ 2014-01-17  6:57 UTC (permalink / raw)
  To: vfalico, fubar, andy; +Cc: netdev, roopa, shm, dingtianhong

If link is IFF_SLAVE, extend link dev netlink attributes to include
slave attributes with new IFLA_SLAVE nest.  Add netlink notification
(RTM_NEWLINK) when slave status changes from backup to active, or
visa-versa.

Adds new ndo_get_slave op to net_device_ops to fill skb with IFLA_SLAVE
attributes.  Currently only used by bonding driver, but could be
used by other aggregating devices with slaves.

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
---
 drivers/net/bonding/bond_main.c    |    1 +
 drivers/net/bonding/bond_netlink.c |   36 ++++++++++++++++++++++++
 drivers/net/bonding/bonding.h      |   11 ++++++-
 include/linux/netdevice.h          |    5 +++
 include/uapi/linux/if_link.h       |   13 +++++++++
 net/core/rtnetlink.c               |   54 ++++++++++++++++++++++++++++++++++++
 6 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index df85cec..3220b48 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3883,6 +3883,7 @@ static const struct net_device_ops bond_netdev_ops = {
 #endif
 	.ndo_add_slave		= bond_enslave,
 	.ndo_del_slave		= bond_release,
+	.ndo_get_slave		= bond_get_slave,
 	.ndo_fix_features	= bond_fix_features,
 };
 
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 555c783..21c6488 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -22,6 +22,42 @@
 #include <linux/reciprocal_div.h>
 #include "bonding.h"
 
+int bond_get_slave(struct net_device *slave_dev, struct sk_buff *skb)
+{
+	struct slave *slave = bond_slave_get_rtnl(slave_dev);
+	const struct aggregator *agg;
+
+	if (nla_put_u8(skb, IFLA_SLAVE_STATE, bond_slave_state(slave)))
+		goto nla_put_failure;
+
+	if (nla_put_u8(skb, IFLA_SLAVE_MII_STATUS, slave->link))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, IFLA_SLAVE_LINK_FAILURE_COUNT,
+			slave->link_failure_count))
+		goto nla_put_failure;
+
+	if (nla_put(skb, IFLA_SLAVE_PERM_HWADDR,
+		    slave_dev->addr_len, slave->perm_hwaddr))
+		goto nla_put_failure;
+
+	if (nla_put_u16(skb, IFLA_SLAVE_QUEUE_ID, slave->queue_id))
+		goto nla_put_failure;
+
+	if (slave->bond->params.mode == BOND_MODE_8023AD) {
+		agg = SLAVE_AD_INFO(slave).port.aggregator;
+		if (agg)
+			if (nla_put_u16(skb, IFLA_SLAVE_AD_AGGREGATOR_ID,
+					agg->aggregator_identifier))
+				goto nla_put_failure;
+	}
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
 	[IFLA_BOND_MODE]		= { .type = NLA_U8 },
 	[IFLA_BOND_ACTIVE_SLAVE]	= { .type = NLA_U32 },
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 309757d..8a935f8 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -285,12 +285,18 @@ static inline bool bond_is_lb(const struct bonding *bond)
 
 static inline void bond_set_active_slave(struct slave *slave)
 {
-	slave->backup = 0;
+	if (slave->backup) {
+		slave->backup = 0;
+		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_KERNEL);
+	}
 }
 
 static inline void bond_set_backup_slave(struct slave *slave)
 {
-	slave->backup = 1;
+	if (!slave->backup) {
+		slave->backup = 1;
+		rtmsg_ifinfo(RTM_NEWLINK, slave->dev, 0, GFP_KERNEL);
+	}
 }
 
 static inline int bond_slave_state(struct slave *slave)
@@ -426,6 +432,7 @@ int bond_sysfs_slave_add(struct slave *slave);
 void bond_sysfs_slave_del(struct slave *slave);
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev);
 int bond_release(struct net_device *bond_dev, struct net_device *slave_dev);
+int bond_get_slave(struct net_device *slave_dev, struct sk_buff *skb);
 int bond_xmit_hash(struct bonding *bond, struct sk_buff *skb, int count);
 int bond_parse_parm(const char *mode_arg, const struct bond_parm_tbl *tbl);
 int bond_parm_tbl_lookup(int mode, const struct bond_parm_tbl *tbl);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d7668b88..22c8698 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -908,6 +908,9 @@ struct netdev_phys_port_id {
  * int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev);
  *	Called to release previously enslaved netdev.
  *
+ * int (*ndo_get_slave)(struct net_device *slave_dev, struct sk_buff *skb);
+ *	Called to fill netlink skb with slave info.
+ *
  *      Feature/offload setting functions.
  * netdev_features_t (*ndo_fix_features)(struct net_device *dev,
  *		netdev_features_t features);
@@ -1080,6 +1083,8 @@ struct net_device_ops {
 						 struct net_device *slave_dev);
 	int			(*ndo_del_slave)(struct net_device *dev,
 						 struct net_device *slave_dev);
+	int			(*ndo_get_slave)(struct net_device *slave_dev,
+						 struct sk_buff *skb);
 	netdev_features_t	(*ndo_fix_features)(struct net_device *dev,
 						    netdev_features_t features);
 	int			(*ndo_set_features)(struct net_device *dev,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 3e6bd3c..ba2f3bf 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -144,6 +144,7 @@ enum {
 	IFLA_NUM_RX_QUEUES,
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
+	IFLA_SLAVE,
 	__IFLA_MAX
 };
 
@@ -368,6 +369,18 @@ enum {
 
 #define IFLA_BOND_AD_INFO_MAX	(__IFLA_BOND_AD_INFO_MAX - 1)
 
+enum {
+	IFLA_SLAVE_STATE,
+	IFLA_SLAVE_MII_STATUS,
+	IFLA_SLAVE_LINK_FAILURE_COUNT,
+	IFLA_SLAVE_PERM_HWADDR,
+	IFLA_SLAVE_QUEUE_ID,
+	IFLA_SLAVE_AD_AGGREGATOR_ID,
+	__IFLA_SLAVE_MAX,
+};
+
+#define IFLA_SLAVE_MAX	(__IFLA_SLAVE_MAX - 1)
+
 /* SR-IOV virtual function management section */
 
 enum {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6e7d58..4f85de7 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -721,6 +721,28 @@ static size_t rtnl_port_size(const struct net_device *dev)
 		return port_self_size;
 }
 
+static size_t rtnl_bond_slave_size(const struct net_device *dev)
+{
+	struct net_device *bond;
+	size_t slave_size =
+		nla_total_size(sizeof(struct nlattr)) +	/* IFLA_SLAVE */
+		nla_total_size(1) +	/* IFLA_SLAVE_STATE */
+		nla_total_size(1) +	/* IFLA_SLAVE_MII_STATUS */
+		nla_total_size(4) +	/* IFLA_SLAVE_LINK_FAILURE_COUNT */
+		nla_total_size(MAX_ADDR_LEN) +	/* IFLA_SLAVE_PERM_HWADDR */
+		nla_total_size(2) +	/* IFLA_SLAVE_QUEUE_ID */
+		nla_total_size(2) +	/* IFLA_SLAVE_AD_AGGREGATOR_ID */
+		0;
+
+	if (netif_is_bond_slave((struct net_device *)dev)) {
+		bond = netdev_master_upper_dev_get((struct net_device *)dev);
+		if (bond && bond->netdev_ops->ndo_get_slave)
+			return slave_size;
+	}
+
+	return 0;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 				     u32 ext_filter_mask)
 {
@@ -750,6 +772,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_port_size(dev) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
 	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
+	       + rtnl_bond_slave_size(dev) /* IFLA_SLAVE */
 	       + nla_total_size(MAX_PHYS_PORT_ID_LEN); /* IFLA_PHYS_PORT_ID */
 }
 
@@ -847,6 +870,34 @@ static int rtnl_phys_port_id_fill(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static size_t rtnl_bond_slave_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	struct net_device *bond;
+	struct nlattr *nest;
+	int err;
+
+	if (!netif_is_bond_slave(dev))
+		return 0;
+
+	bond = netdev_master_upper_dev_get(dev);
+	if (!bond || !bond->netdev_ops->ndo_get_slave)
+		return 0;
+
+	nest = nla_nest_start(skb, IFLA_SLAVE);
+	if (!nest)
+		return -EMSGSIZE;
+
+	err = bond->netdev_ops->ndo_get_slave(dev, skb);
+	if (err) {
+		nla_nest_cancel(skb, nest);
+		return (err == -EMSGSIZE) ? err : 0;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1001,6 +1052,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_port_fill(skb, dev))
 		goto nla_put_failure;
 
+	if (rtnl_bond_slave_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;

^ permalink raw reply related

* [PATCH net-next v2 1/2] bonding: add sysfs /slave dir for bond slave devices.
From: Scott Feldman @ 2014-01-17  6:57 UTC (permalink / raw)
  To: vfalico, fubar, andy; +Cc: netdev, roopa, shm, dingtianhong

Add sub-directory under /sys/class/net/<interface>/slave with
read-only attributes for slave.  Directory only appears when
<interface> is a slave.

$ tree /sys/class/net/eth2/slave/
/sys/class/net/eth2/slave/
├── ad_aggregator_id
├── link_failure_count
├── mii_status
├── perm_hwaddr
├── queue_id
└── state

$ cat /sys/class/net/eth2/slave/*
2
0
up
40:02:10:ef:06:01
0
active

Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
---
 drivers/net/bonding/Makefile           |    2 
 drivers/net/bonding/bond_main.c        |   27 ++++++
 drivers/net/bonding/bond_procfs.c      |   12 ---
 drivers/net/bonding/bond_sysfs_slave.c |  144 ++++++++++++++++++++++++++++++++
 drivers/net/bonding/bonding.h          |    4 +
 5 files changed, 176 insertions(+), 13 deletions(-)
 create mode 100644 drivers/net/bonding/bond_sysfs_slave.c

diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 5a5d720..6f4e808 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_BONDING) += bonding.o
 
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o bond_netlink.o bond_options.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
 
 proc-$(CONFIG_PROC_FS) += bond_procfs.o
 bonding-objs += $(proc-y)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f00dd45..df85cec 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -466,6 +466,22 @@ static void bond_update_speed_duplex(struct slave *slave)
 	return;
 }
 
+const char *bond_slave_link_status(s8 link)
+{
+	switch (link) {
+	case BOND_LINK_UP:
+		return "up";
+	case BOND_LINK_FAIL:
+		return "going down";
+	case BOND_LINK_DOWN:
+		return "down";
+	case BOND_LINK_BACK:
+		return "going back";
+	default:
+		return "unknown";
+	}
+}
+
 /*
  * if <dev> supports MII link status reporting, check its link status.
  *
@@ -1576,6 +1592,12 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		goto err_unregister;
 	}
 
+	res = bond_sysfs_slave_add(new_slave);
+	if (res) {
+		pr_debug("Error %d calling bond_sysfs_slave_add\n", res);
+		goto err_upper_unlink;
+	}
+
 	bond->slave_cnt++;
 	bond_compute_features(bond);
 	bond_set_carrier(bond);
@@ -1595,6 +1617,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	return 0;
 
 /* Undo stages on error */
+err_upper_unlink:
+	bond_upper_dev_unlink(bond_dev, slave_dev);
+
 err_unregister:
 	netdev_rx_handler_unregister(slave_dev);
 
@@ -1687,6 +1712,8 @@ static int __bond_release_one(struct net_device *bond_dev,
 	/* release the slave from its bond */
 	bond->slave_cnt--;
 
+	bond_sysfs_slave_del(slave);
+
 	bond_upper_dev_unlink(bond_dev, slave_dev);
 	/* unregister rx_handler early so bond_handle_frame wouldn't be called
 	 * for this slave anymore.
diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
index fb868d6..8515b344 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -159,18 +159,6 @@ static void bond_info_show_master(struct seq_file *seq)
 	}
 }
 
-static const char *bond_slave_link_status(s8 link)
-{
-	static const char * const status[] = {
-		[BOND_LINK_UP] = "up",
-		[BOND_LINK_FAIL] = "going down",
-		[BOND_LINK_DOWN] = "down",
-		[BOND_LINK_BACK] = "going back",
-	};
-
-	return status[link];
-}
-
 static void bond_info_show_slave(struct seq_file *seq,
 				 const struct slave *slave)
 {
diff --git a/drivers/net/bonding/bond_sysfs_slave.c b/drivers/net/bonding/bond_sysfs_slave.c
new file mode 100644
index 0000000..7cb97de
--- /dev/null
+++ b/drivers/net/bonding/bond_sysfs_slave.c
@@ -0,0 +1,144 @@
+/*	Sysfs attributes of bond slaves
+ *
+ *      Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/capability.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+
+#include "bonding.h"
+
+struct slave_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct slave *, char *);
+};
+
+#define SLAVE_ATTR(_name, _mode, _show)				\
+const struct slave_attribute slave_attr_##_name = {		\
+	.attr = {.name = __stringify(_name),			\
+		 .mode = _mode },				\
+	.show	= _show,					\
+};
+#define SLAVE_ATTR_RO(_name) \
+	SLAVE_ATTR(_name, S_IRUGO, _name##_show)
+
+static ssize_t state_show(struct slave *slave, char *buf)
+{
+	switch (bond_slave_state(slave)) {
+	case BOND_STATE_ACTIVE:
+		return sprintf(buf, "active\n");
+	case BOND_STATE_BACKUP:
+		return sprintf(buf, "backup\n");
+	default:
+		return sprintf(buf, "UNKONWN\n");
+	}
+}
+static SLAVE_ATTR_RO(state);
+
+static ssize_t mii_status_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%s\n", bond_slave_link_status(slave->link));
+}
+static SLAVE_ATTR_RO(mii_status);
+
+static ssize_t link_failure_count_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%d\n", slave->link_failure_count);
+}
+static SLAVE_ATTR_RO(link_failure_count);
+
+static ssize_t perm_hwaddr_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%pM\n", slave->perm_hwaddr);
+}
+static SLAVE_ATTR_RO(perm_hwaddr);
+
+static ssize_t queue_id_show(struct slave *slave, char *buf)
+{
+	return sprintf(buf, "%d\n", slave->queue_id);
+}
+static SLAVE_ATTR_RO(queue_id);
+
+static ssize_t ad_aggregator_id_show(struct slave *slave, char *buf)
+{
+	const struct aggregator *agg;
+
+	if (slave->bond->params.mode == BOND_MODE_8023AD) {
+		agg = SLAVE_AD_INFO(slave).port.aggregator;
+		if (agg)
+			return sprintf(buf, "%d\n",
+				       agg->aggregator_identifier);
+	}
+
+	return sprintf(buf, "N/A\n");
+}
+static SLAVE_ATTR_RO(ad_aggregator_id);
+
+static const struct slave_attribute *slave_attrs[] = {
+	&slave_attr_state,
+	&slave_attr_mii_status,
+	&slave_attr_link_failure_count,
+	&slave_attr_perm_hwaddr,
+	&slave_attr_queue_id,
+	&slave_attr_ad_aggregator_id,
+	NULL
+};
+
+#define to_slave_attr(_at) container_of(_at, struct slave_attribute, attr)
+#define to_slave(obj)	container_of(obj, struct slave, kobj)
+
+static ssize_t slave_show(struct kobject *kobj,
+			  struct attribute *attr, char *buf)
+{
+	struct slave_attribute *slave_attr = to_slave_attr(attr);
+	struct slave *slave = to_slave(kobj);
+
+	return slave_attr->show(slave, buf);
+}
+
+const struct sysfs_ops slave_sysfs_ops = {
+	.show = slave_show,
+};
+
+static struct kobj_type slave_ktype = {
+#ifdef CONFIG_SYSFS
+	.sysfs_ops = &slave_sysfs_ops,
+#endif
+};
+
+int bond_sysfs_slave_add(struct slave *slave)
+{
+	const struct slave_attribute **a;
+	int err;
+
+	err = kobject_init_and_add(&slave->kobj, &slave_ktype,
+				   &(slave->dev->dev.kobj), "slave");
+	if (err)
+		return err;
+
+	for (a = slave_attrs; *a; ++a) {
+		err = sysfs_create_file(&slave->kobj, &((*a)->attr));
+		if (err) {
+			kobject_del(&slave->kobj);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+void bond_sysfs_slave_del(struct slave *slave)
+{
+	const struct slave_attribute **a;
+
+	for (a = slave_attrs; *a; ++a)
+		sysfs_remove_file(&slave->kobj, &((*a)->attr));
+
+	kobject_del(&slave->kobj);
+}
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 955dc48..309757d 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -203,6 +203,7 @@ struct slave {
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	struct netpoll *np;
 #endif
+	struct kobject kobj;
 };
 
 /*
@@ -421,6 +422,8 @@ int bond_create(struct net *net, const char *name);
 int bond_create_sysfs(struct bond_net *net);
 void bond_destroy_sysfs(struct bond_net *net);
 void bond_prepare_sysfs_group(struct bonding *bond);
+int bond_sysfs_slave_add(struct slave *slave);
+void bond_sysfs_slave_del(struct slave *slave);
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev);
 int bond_release(struct net_device *bond_dev, struct net_device *slave_dev);
 int bond_xmit_hash(struct bonding *bond, struct sk_buff *skb, int count);
@@ -469,6 +472,7 @@ int bond_option_lacp_rate_set(struct bonding *bond, int lacp_rate);
 int bond_option_ad_select_set(struct bonding *bond, int ad_select);
 struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond);
 struct net_device *bond_option_active_slave_get(struct bonding *bond);
+const char *bond_slave_link_status(s8 link);
 
 struct bond_net {
 	struct net *		net;	/* Associated network namespace */

^ permalink raw reply related

* [PATCH net-next v2 0/2] bonding: add slave netlink and sysfs support
From: Scott Feldman @ 2014-01-17  6:57 UTC (permalink / raw)
  To: vfalico, fubar, andy; +Cc: netdev, roopa, shm, dingtianhong

v2:

  - Address review comment from Ding (and Veacesiav): handle kobj cleanup
    if sysfs_create_file() fails when adding slave attribute nodes.

v1:

  The following series adds bonding slave netlink and sysfs interfaces.
  Slave interfaces get a new IFLA_SLAVE set of netlink attributes, along
  with RTM_NEWLINK notification when slave's active status changes.  The
  sysfs interface adds read-only nodes for slave attributes under a /slave
  dir, simliar to how bond interfaces get a /bonding dir for bonding
  attributes.

---

Scott Feldman (2):
      bonding: add sysfs /slave dir for bond slave devices.
      bonding: add netlink attributes to slave link dev


 drivers/net/bonding/Makefile           |    2 
 drivers/net/bonding/bond_main.c        |   28 ++++++
 drivers/net/bonding/bond_netlink.c     |   36 ++++++++
 drivers/net/bonding/bond_procfs.c      |   12 ---
 drivers/net/bonding/bond_sysfs_slave.c |  144 ++++++++++++++++++++++++++++++++
 drivers/net/bonding/bonding.h          |   15 +++
 include/linux/netdevice.h              |    5 +
 include/uapi/linux/if_link.h           |   13 +++
 net/core/rtnetlink.c                   |   54 ++++++++++++
 9 files changed, 294 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/bonding/bond_sysfs_slave.c

-- 
Signature

^ permalink raw reply

* Re: [Patch 2/2 net-next v6] ixgbe: set driver_max_VFs should be done before enabling SRIOV
From: ethan zhao @ 2014-01-17  6:51 UTC (permalink / raw)
  To: Aaron Brown; +Cc: davem, netdev, gospo, sassmann, ethan.kernel
In-Reply-To: <1389930065-3330-3-git-send-email-aaron.f.brown@intel.com>

Great job. Thanks a lot.

Ethan

On 2014/1/17 11:41, Aaron Brown wrote:
> From: "ethan.zhao" <ethan.zhao@oracle.com>
>
> commit 43dc4e01 Limit number of reported VFs to device
>   specific value It doesn't work and always returns -EBUSY because VFs are
>   already enabled.
>
> ixgbe_enable_sriov()
>          pci_enable_sriov()
>                  sriov_enable()
>                  {
>                  ... ..
>                  iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
>                  pci_cfg_access_lock(dev);
>                  ... ...
>                  }
>
> pci_sriov_set_totalvfs()
> {
> ... ...
> if (dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
>                  return -EBUSY;
> ...
> }
>
> So should set driver_max_VFs with pci_sriov_set_totalvfs() before
> enable VFs with ixgbe_enable_sriov().
>
> V2: revised for net-next tree.
>
> Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
> ---
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 3fd4d3f..61d985c 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -8019,8 +8019,8 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   	/* Mailbox */
>   	ixgbe_init_mbx_params_pf(hw);
>   	memcpy(&hw->mbx.ops, ii->mbx_ops, sizeof(hw->mbx.ops));
> -	ixgbe_enable_sriov(adapter);
>   	pci_sriov_set_totalvfs(pdev, IXGBE_MAX_VFS_DRV_LIMIT);
> +	ixgbe_enable_sriov(adapter);
>   skip_sriov:
>   
>   #endif

^ permalink raw reply

* [PATCH RFC 6/6] ARM: sun7i: cubietruck: enable bluetooth module
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389941251-32692-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

The CubieTruck has an AMPAK AP6210 WiFi+Bluetooth module. The
Bluetooth part is a BCM20710 device connected to UART2 in the
A20 SoC.

The IC also requires a 32.768 KHz low power clock input for
proper auto-detection of the main clock, and power enable and
wake signals via GPIO.

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 arch/arm/boot/dts/sun7i-a20-cubietruck.dts | 37 ++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/arm/boot/dts/sun7i-a20-cubietruck.dts b/arch/arm/boot/dts/sun7i-a20-cubietruck.dts
index c8b3ea9..f172a8f 100644
--- a/arch/arm/boot/dts/sun7i-a20-cubietruck.dts
+++ b/arch/arm/boot/dts/sun7i-a20-cubietruck.dts
@@ -78,6 +78,20 @@
 					allwinner,drive = <0>;
 					allwinner,pull = <2>;
 			};
+
+			bt_pwr_pin: bt_pwr_pin@0 {
+				allwinner,pins = "PH18";
+				allwinner,function = "gpio_out";
+				allwinner,drive = <0>;
+				allwinner,pull = <0>;
+			};
+
+			bt_wake_pin: bt_wake_pin@0 {
+				allwinner,pins = "PH24";
+				allwinner,function = "gpio_out";
+				allwinner,drive = <0>;
+				allwinner,pull = <0>;
+			};
 		};
 
 		uart0: serial@01c28000 {
@@ -86,6 +100,12 @@
 			status = "okay";
 		};
 
+		uart2: serial@01c28800 {
+			pinctrl-names = "default";
+			pinctrl-0 = <&uart2_pins_a>;
+			status = "okay";
+		};
+
 		gmac: ethernet@01c50000 {
 			pinctrl-names = "default";
 			pinctrl-0 = <&gmac_pins_rgmii>;
@@ -157,4 +177,21 @@
 			gpio = <&pio 7 3 0>;
 		};
 	};
+
+	rfkill-switches {
+		compatible = "simple-bus";
+		pinctrl-names = "default";
+
+		rfkill_bt {
+			compatible = "rfkill-gpio";
+			pinctrl-names = "default";
+			pinctrl-0 = <&bt_pwr_pin>, <&clk_out_a_pins>;
+			rfkill-name = "bt";
+			rfkill-type = <2>;
+			bt_shutdown-gpios = <0>, <&pio 7 18 0>; /* PH18 */
+			bt_reset-gpios = <&pio 7 24 0>; /* PH24 */
+			clocks = <&clk_out_a>;
+			clock-frequency = <32768>;
+		};
+	};
 };
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH RFC 5/6] net: rfkill: gpio: add clock-frequency device tree property
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389941251-32692-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

Some devices, such as Broadcom Bluetooth devices, require a specific
clock rate for the clock tied to the rfkill device. Add clock-frequency
property so we can specify this from the device tree.

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt | 2 ++
 net/rfkill/rfkill-gpio.c                                 | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt b/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
index 8a07ea4..8b8db0a 100644
--- a/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
+++ b/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
@@ -13,6 +13,7 @@ NAME_reset-gpios, or both, must be defined.
 
 Optional properties:
 - clocks		: phandle to clock to enable/disable
+- clock-frequency	: clock rate to set for the given clock
 
 Example:
 
@@ -23,4 +24,5 @@ Example:
 		bluetooth_shutdown-gpios = <0>, <&pio 7 18 0>;
 		bluetooth_reset-gpios = <&pio 7 24 0>;
 		clocks = <&clk_out_a>;
+		clock-frequency = <32678>;
 	};
diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
index 48381a8..3092681 100644
--- a/net/rfkill/rfkill-gpio.c
+++ b/net/rfkill/rfkill-gpio.c
@@ -40,6 +40,7 @@ struct rfkill_gpio_data {
 	char			*reset_name;
 	char			*shutdown_name;
 	struct clk		*clk;
+	int			clk_frequency;
 
 	bool			clk_enabled;
 };
@@ -92,6 +93,7 @@ static int rfkill_gpio_dt_probe(struct device *dev,
 	rfkill->name = np->name;
 	of_property_read_string(np, "rfkill-name", &rfkill->name);
 	of_property_read_u32(np, "rfkill-type", &rfkill->type);
+	of_property_read_u32(np, "clock-frequency", &rfkill->clk_frequency);
 
 	return 0;
 }
@@ -138,6 +140,8 @@ static int rfkill_gpio_probe(struct platform_device *pdev)
 	snprintf(rfkill->shutdown_name, len + 10, "%s_shutdown", rfkill->name);
 
 	rfkill->clk = devm_clk_get(&pdev->dev, clk_name);
+	if (!IS_ERR(rfkill->clk) && rfkill->clk_frequency > 0)
+		clk_set_rate(rfkill->clk, rfkill->clk_frequency);
 
 	gpio = devm_gpiod_get_index(&pdev->dev, rfkill->reset_name, 0);
 	if (!IS_ERR(gpio)) {
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389941251-32692-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 .../devicetree/bindings/rfkill/rfkill-gpio.txt     | 26 ++++++++++++++++++++++
 net/rfkill/rfkill-gpio.c                           | 23 +++++++++++++++++++
 2 files changed, 49 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt

diff --git a/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt b/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
new file mode 100644
index 0000000..8a07ea4
--- /dev/null
+++ b/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
@@ -0,0 +1,26 @@
+GPIO controlled RFKILL devices
+
+Required properties:
+- compatible	: Must be "rfkill-gpio".
+- rfkill-name	: Name of RFKILL device
+- rfkill-type	: Type of RFKILL device: 1 for WiFi, 2 for BlueTooth
+- NAME_shutdown-gpios	: GPIO phandle to shutdown control
+			  (phandle must be the second)
+- NAME_reset-gpios	: GPIO phandle to reset control
+
+NAME must match the rfkill-name property. NAME_shutdown-gpios or
+NAME_reset-gpios, or both, must be defined.
+
+Optional properties:
+- clocks		: phandle to clock to enable/disable
+
+Example:
+
+	rfkill_bt: rfkill@0 {
+		compatible = "rfkill-gpio";
+		rfkill-name = "bluetooth";
+		rfkill-type = <2>;
+		bluetooth_shutdown-gpios = <0>, <&pio 7 18 0>;
+		bluetooth_reset-gpios = <&pio 7 24 0>;
+		clocks = <&clk_out_a>;
+	};
diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
index 3084fa3..48381a8 100644
--- a/net/rfkill/rfkill-gpio.c
+++ b/net/rfkill/rfkill-gpio.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/acpi.h>
 #include <linux/gpio/consumer.h>
+#include <linux/of_gpio.h>
 
 #include <linux/rfkill-gpio.h>
 
@@ -83,6 +84,18 @@ static int rfkill_gpio_acpi_probe(struct device *dev,
 	return 0;
 }
 
+static int rfkill_gpio_dt_probe(struct device *dev,
+				struct rfkill_gpio_data *rfkill)
+{
+	struct device_node * np = dev->of_node;
+
+	rfkill->name = np->name;
+	of_property_read_string(np, "rfkill-name", &rfkill->name);
+	of_property_read_u32(np, "rfkill-type", &rfkill->type);
+
+	return 0;
+}
+
 static int rfkill_gpio_probe(struct platform_device *pdev)
 {
 	struct rfkill_gpio_platform_data *pdata = pdev->dev.platform_data;
@@ -100,6 +113,10 @@ static int rfkill_gpio_probe(struct platform_device *pdev)
 		ret = rfkill_gpio_acpi_probe(&pdev->dev, rfkill);
 		if (ret)
 			return ret;
+	} else if (&pdev->dev.of_node) {
+		ret = rfkill_gpio_dt_probe(&pdev->dev, rfkill);
+		if (ret)
+			return ret;
 	} else if (pdata) {
 		clk_name = pdata->power_clk_name;
 		rfkill->name = pdata->name;
@@ -189,6 +206,11 @@ static const struct acpi_device_id rfkill_acpi_match[] = {
 	{ },
 };
 
+static const struct of_device_id rfkill_of_match[] = {
+	{ .compatible = "rfkill-gpio", },
+	{},
+};
+
 static struct platform_driver rfkill_gpio_driver = {
 	.probe = rfkill_gpio_probe,
 	.remove = rfkill_gpio_remove,
@@ -196,6 +218,7 @@ static struct platform_driver rfkill_gpio_driver = {
 		.name = "rfkill_gpio",
 		.owner = THIS_MODULE,
 		.acpi_match_table = ACPI_PTR(rfkill_acpi_match),
+		.of_match_table = of_match_ptr(rfkill_of_match),
 	},
 };
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH RFC 3/6] net: rfkill: gpio: fix reversed clock enable state
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389941251-32692-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

rfkill-gpio has clk_enabled = blocked, which is true when rfkill
blocks the device. This results in calling clock enable/disable at
the wrong time. Reversing the value fixes this.

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 net/rfkill/rfkill-gpio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
index c7081b7..3084fa3 100644
--- a/net/rfkill/rfkill-gpio.c
+++ b/net/rfkill/rfkill-gpio.c
@@ -59,7 +59,7 @@ static int rfkill_gpio_set_power(void *data, bool blocked)
 		gpiod_set_value(rfkill->shutdown_gpio, 1);
 	}
 
-	rfkill->clk_enabled = blocked;
+	rfkill->clk_enabled = !blocked;
 
 	return 0;
 }
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH RFC 2/6] net: rfkill: gpio: use clk_prepare_enable/clk_disable_unprepare
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389941251-32692-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

rfkill-gpio calls clk_enable() without first calling clk_prepare(),
resulting in a warning and no effect. Switch to clk_prepare_enable()
and clk_disable_unprepare.

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 net/rfkill/rfkill-gpio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
index 97ec12a..c7081b7 100644
--- a/net/rfkill/rfkill-gpio.c
+++ b/net/rfkill/rfkill-gpio.c
@@ -51,10 +51,10 @@ static int rfkill_gpio_set_power(void *data, bool blocked)
 		gpiod_set_value(rfkill->shutdown_gpio, 0);
 		gpiod_set_value(rfkill->reset_gpio, 0);
 		if (!IS_ERR(rfkill->clk) && rfkill->clk_enabled)
-			clk_disable(rfkill->clk);
+			clk_disable_unprepare(rfkill->clk);
 	} else {
 		if (!IS_ERR(rfkill->clk) && !rfkill->clk_enabled)
-			clk_enable(rfkill->clk);
+			clk_prepare_enable(rfkill->clk);
 		gpiod_set_value(rfkill->reset_gpio, 1);
 		gpiod_set_value(rfkill->shutdown_gpio, 1);
 	}
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH RFC 1/6] net: rfkill: gpio: fix gpio name buffer size off by 1
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389941251-32692-1-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>

snprintf should be passed the complete size of the buffer, including
the space for '\0'. The previous code resulted in the *_reset and
*_shutdown strings being truncated.

Signed-off-by: Chen-Yu Tsai <wens-jdAy2FN1RRM@public.gmane.org>
---
 net/rfkill/rfkill-gpio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
index bd2a5b9..97ec12a 100644
--- a/net/rfkill/rfkill-gpio.c
+++ b/net/rfkill/rfkill-gpio.c
@@ -117,8 +117,8 @@ static int rfkill_gpio_probe(struct platform_device *pdev)
 	if (!rfkill->shutdown_name)
 		return -ENOMEM;
 
-	snprintf(rfkill->reset_name, len + 6 , "%s_reset", rfkill->name);
-	snprintf(rfkill->shutdown_name, len + 9, "%s_shutdown", rfkill->name);
+	snprintf(rfkill->reset_name, len + 7 , "%s_reset", rfkill->name);
+	snprintf(rfkill->shutdown_name, len + 10, "%s_shutdown", rfkill->name);
 
 	rfkill->clk = devm_clk_get(&pdev->dev, clk_name);
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH RFC 0/6] net: rfkill: gpio: Add device tree support
From: Chen-Yu Tsai @ 2014-01-17  6:47 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: Chen-Yu Tsai, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA

Hi everyone,

This patch series adds device tree support to rfkill-gpio, and
fixes some issues I ran into. This is so we can define and control
RF devices through the device tree, such as the Broadcom BCM20710
UART-based Bluetooth device found on the CubieTruck,

The CubieTruck uses a non-default clock rate oscillator for the
BCM20710 device. As the datasheet states, a precise 32.768 KHz
low power clock must be provided at power on for the device to
detect the correct clock rate of the main oscillator. Hence the
need for the "clock-frequency" property.

The device tree bindings aren't pretty. They are the result of how
gpiod_find was implemented: of_gpiod_find includes con_id in the
DT property name; acpi_gpiod_find ignores it and only uses the index.
A more elegant DT binding would mean splitting the gpio lookup code
path in rfkill-gpio, which would be more like rfkill-gpio prior to
the descriptor-based GPIO patch.

I am aware there is a need for similar functionality for SDIO devices,
which the CubieTruck has as well. A mail thread [1] started yesterday
indicated that generic SDIO DT support was the way to go. I don't know
if that could be applied to UART-based devices though.

[1] http://www.spinics.net/lists/arm-kernel/msg301182.html

The series depends on

    [PATCH v3 2/6] net: rfkill: gpio: convert to descriptor-based GPIO interface

which has been applied through the GPIO tree.

The last patch depends on

    ARM: dts: sun7i: add pin muxing options for UART2

which I sent earlier this week.

Comments, please?

Cheers,
ChenYu

Chen-Yu Tsai (6):
  net: rfkill: gpio: fix gpio name buffer size off by 1
  net: rfkill: gpio: use clk_prepare_enable/clk_disable_unprepare
  net: rfkill: gpio: fix reversed clock enable state
  net: rfkill: gpio: add device tree support
  net: rfkill: gpio: add clock-frequency device tree property
  ARM: sun7i: cubietruck: enable bluetooth module

 .../devicetree/bindings/rfkill/rfkill-gpio.txt     | 28 ++++++++++++++++
 arch/arm/boot/dts/sun7i-a20-cubietruck.dts         | 37 ++++++++++++++++++++++
 net/rfkill/rfkill-gpio.c                           | 37 +++++++++++++++++++---
 3 files changed, 97 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt

-- 
1.8.5.2

^ permalink raw reply

* Re: Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Jason Wang @ 2014-01-17  6:45 UTC (permalink / raw)
  To: Stefan Hajnoczi, Tom Herbert
  Cc: Zhi Yong Wu, Linux Netdev List, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell
In-Reply-To: <20140117052229.GE16061@stefanha-thinkpad.redhat.com>

On 01/17/2014 01:22 PM, Stefan Hajnoczi wrote:
> On Thu, Jan 16, 2014 at 09:12:29AM -0800, Tom Herbert wrote:
>> On Thu, Jan 16, 2014 at 12:52 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Thu, Jan 16, 2014 at 04:34:10PM +0800, Zhi Yong Wu wrote:
>>>> CC: stefanha, MST, Rusty Russel
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Jason Wang <jasowang@redhat.com>
>>>> Date: Thu, Jan 16, 2014 at 12:23 PM
>>>> Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
>>>> To: Zhi Yong Wu <zwu.kernel@gmail.com>
>>>> Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
>>>> davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>>
>>>>
>>>> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>>>>> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>>>>>
>>>>> HI, folks
>>>>>
>>>>> The patchset is trying to integrate aRFS support to virtio_net. In this case,
>>>>> aRFS will be used to select the RX queue. To make sure that it's going ahead
>>>>> in the correct direction, although it is still one RFC and isn't tested, it's
>>>>> post out ASAP. Any comment are appreciated, thanks.
>>>>>
>>>>> If anyone is interested in playing with it, you can get this patchset from my
>>>>> dev git on github:
>>>>>    git://github.com/wuzhy/kernel.git virtnet_rfs
>>>>>
>>>>> Zhi Yong Wu (3):
>>>>>    virtio_pci: Introduce one new config api vp_get_vq_irq()
>>>>>    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>>>>>    virtio-net: Add accelerated RFS support
>>>>>
>>>>>   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
>>>>>   drivers/virtio/virtio_pci.c   |   11 +++++++
>>>>>   include/linux/virtio_config.h |   12 +++++++
>>>>>   3 files changed, 89 insertions(+), 1 deletions(-)
>>>>>
>>>> Please run get_maintainter.pl before sending the patch. You'd better
>>>> at least cc virtio maintainer/list for this.
>>>>
>>>> The core aRFS method is a noop in this RFC which make this series no
>>>> much sense to discuss. You should at least mention the big picture
>>>> here in the cover letter. I suggest you should post a RFC which can
>>>> run and has expected result or you can just raise a thread for the
>>>> design discussion.
>>>>
>>>> And this method has been discussed before, you can search "[net-next
>>>> RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
>>>> for a very old prototype implemented by me. It can work and looks like
>>>> most of this RFC have already done there.
>>>>
>>>> A basic question is whether or not we need this, not all the mq cards
>>>> use aRFS (see ixgbe ATR). And whether or not it can bring extra
>>>> overheads? For virtio, we want to reduce the vmexits as much as
>>>> possible but this aRFS seems introduce a lot of more of this. Making a
>>>> complex interfaces just for an virtual device may not be good, simple
>>>> method may works for most of the cases.
>>>>
>>>> We really should consider to offload this to real nic. VMDq and L2
>>>> forwarding offload may help in this case.
>> Adding flow director support would be a good step, Zhi's patches for
>> support in tun have been merged, so support in virtio-net would be a
>> good follow on. But, flow-director does have some limitations and
>> performance issues of it's own (forced pairing between TX and RX
>> queues, lookup on every TX packet). In the case of virtualization,
>> aRFS, RSS, ntuple filtering, LRO, etc. can be implemented as software
>> emulations and so far seems to be wins in most cases. Extending these
>> down into the stack so that they can leverage HW mechanisms is a good
>> goal for best performance. It's probably generally true that most of
>> the offloads commonly available for NICs we'll want in virtualization
>> path. Of course, we need to deomonstrate that they provide real
>> performance benefit in this use case.
>>
>> I believe tying in aRFS (or flow director) into a real aRFS is just a
>> matter of programming the RFS table properly. This is not the complex
>> side of the interface, I believe this already works with the tun
>> patches.
>>
>>> Zhi Yong and I had an IRC chat.  I wanted to post my questions on the
>>> list - it's still the same concern I had in the old email thread that
>>> Jason mentioned.
>>>
>>> In order for virtio-net aRFS to make sense there needs to be an overall
>>> plan for pushing flow mapping information down to the physical NIC.
>>> That's the only way to actually achieve the benefit of steering:
>>> processing the packet on the CPU where the application is running.
>>>
>> I don't think this is necessarily true. Per flow steering amongst
>> virtual queues should be beneficial in itself. virtio-net can leverage
>> RFS or aRFS where it's available.
> I guess we need to see benchmark results :)
>
>>> If it's not possible or too hard to implement aRFS down the entire
>>> stack, we won't be able to process the packet on the right CPU.
>>> Then we might as well not bother with aRFS and just distribute uniformly
>>> across the rx virtqueues.
>>>
>>> Please post an outline of how rx packets will be steered up the stack so
>>> we can discuss whether aRFS can bring any benefit.
>>>
>> 1. The aRFS interface for the guest to specify which virtual queue to
>> receive a packet on is fairly straight forward.
>> 2. To hook into RFS, we need to match the virtual queue to the real
>> CPU it will processed on, and then program the RFS table for that flow
>> and CPU.
>> 3. NIC aRFS keys off the RFS tables so it can program the HW with the
>> correct queue for the CPU.
> There are a lot of details that are not yet worked out:
>
> If you want to implement aRFS down the vhost_net + macvtap path
> (probably easiest?) how will Step 2 work?  Do the necessary kernel
> interfaces exist to take the flow information in vhost_net, give them to
> macvtap, and finally push them down to the physical NIC?
>
> Not sure if aRFS will work down the full stack with vhost_net + tap +
> bridge.  Any ideas?

It actually works, tun will record the flow to cpu mapping through RFS
when it receive a packet from vhost_net. And host card driver will use
this info to program the hardware flow director to let it send the
interrupt directly to the cpu where vhost_net is running.
>
> At the QEMU level it is currently pointless to implement virtio-net aRFS
> emulation since the QEMU global mutex is taken and virtio-net emulation
> is not multi-threaded.

I don't think qemu is the proper layer to do this. It should be the work
of tun/macvtap which allows us to cooperate more with kernel.
>
> I think aRFS is a good thing, we just need to see performance results
> and know that this won't be a dead end after merging changes to
> virtio-net and the virtio specification.
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Jason Wang @ 2014-01-17  6:36 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Stefan Hajnoczi, Zhi Yong Wu, Linux Netdev List, Eric Dumazet,
	David S. Miller, Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell
In-Reply-To: <CA+mtBx8U1sYzH--QDnLnmSLpisn0DZPSdewUPCEhQkjTMmvb6w@mail.gmail.com>

On 01/17/2014 01:08 PM, Tom Herbert wrote:
> On Thu, Jan 16, 2014 at 7:26 PM, Jason Wang <jasowang@redhat.com> wrote:
>> On 01/17/2014 01:12 AM, Tom Herbert wrote:
>>> On Thu, Jan 16, 2014 at 12:52 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Thu, Jan 16, 2014 at 04:34:10PM +0800, Zhi Yong Wu wrote:
>>>>> CC: stefanha, MST, Rusty Russel
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>> Date: Thu, Jan 16, 2014 at 12:23 PM
>>>>> Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
>>>>> To: Zhi Yong Wu <zwu.kernel@gmail.com>
>>>>> Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
>>>>> davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>>>
>>>>>
>>>>> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>>>>>> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>>>>>>
>>>>>> HI, folks
>>>>>>
>>>>>> The patchset is trying to integrate aRFS support to virtio_net. In this case,
>>>>>> aRFS will be used to select the RX queue. To make sure that it's going ahead
>>>>>> in the correct direction, although it is still one RFC and isn't tested, it's
>>>>>> post out ASAP. Any comment are appreciated, thanks.
>>>>>>
>>>>>> If anyone is interested in playing with it, you can get this patchset from my
>>>>>> dev git on github:
>>>>>>    git://github.com/wuzhy/kernel.git virtnet_rfs
>>>>>>
>>>>>> Zhi Yong Wu (3):
>>>>>>    virtio_pci: Introduce one new config api vp_get_vq_irq()
>>>>>>    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>>>>>>    virtio-net: Add accelerated RFS support
>>>>>>
>>>>>>   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
>>>>>>   drivers/virtio/virtio_pci.c   |   11 +++++++
>>>>>>   include/linux/virtio_config.h |   12 +++++++
>>>>>>   3 files changed, 89 insertions(+), 1 deletions(-)
>>>>>>
>>>>> Please run get_maintainter.pl before sending the patch. You'd better
>>>>> at least cc virtio maintainer/list for this.
>>>>>
>>>>> The core aRFS method is a noop in this RFC which make this series no
>>>>> much sense to discuss. You should at least mention the big picture
>>>>> here in the cover letter. I suggest you should post a RFC which can
>>>>> run and has expected result or you can just raise a thread for the
>>>>> design discussion.
>>>>>
>>>>> And this method has been discussed before, you can search "[net-next
>>>>> RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
>>>>> for a very old prototype implemented by me. It can work and looks like
>>>>> most of this RFC have already done there.
>>>>>
>>>>> A basic question is whether or not we need this, not all the mq cards
>>>>> use aRFS (see ixgbe ATR). And whether or not it can bring extra
>>>>> overheads? For virtio, we want to reduce the vmexits as much as
>>>>> possible but this aRFS seems introduce a lot of more of this. Making a
>>>>> complex interfaces just for an virtual device may not be good, simple
>>>>> method may works for most of the cases.
>>>>>
>>>>> We really should consider to offload this to real nic. VMDq and L2
>>>>> forwarding offload may help in this case.
>>> Adding flow director support would be a good step, Zhi's patches for
>>> support in tun have been merged, so support in virtio-net would be a
>>> good follow on. But, flow-director does have some limitations and
>>> performance issues of it's own (forced pairing between TX and RX
>>> queues, lookup on every TX packet).
>> True. But the pairing was designed to work without guest involving since
>> we really want to reduce the vmexits from guest. And lookup on every TX
>> packets could be released to every N packets. But I agree exposing the
>> API to guest may bring lots of flexibility.
>>> In the case of virtualization,
>>> aRFS, RSS, ntuple filtering, LRO, etc. can be implemented as software
>>> emulations and so far seems to be wins in most cases. Extending these
>>> down into the stack so that they can leverage HW mechanisms is a good
>>> goal for best performance. It's probably generally true that most of
>>> the offloads commonly available for NICs we'll want in virtualization
>>> path. Of course, we need to deomonstrate that they provide real
>>> performance benefit in this use case.
>> Yes, we need a prototype to see how much it can help.
>>> I believe tying in aRFS (or flow director) into a real aRFS is just a
>>> matter of programming the RFS table properly. This is not the complex
>>> side of the interface, I believe this already works with the tun
>>> patches.
>> Right, what we may needs is
>>
>> - exposing new tun ioctls for qemu adding or removing a flow
>> - new virtqueue command for guest driver to adding or removing a flow
>> (btw, current control virtqueue is really slow, we may need to improve it).
>> - an agreement of host and guest to use the same hash method, or just
>> compute software hash in host and pass it to guest (which needs extra
>> API to do)
> The model to get RX hash from a device is well known, the guest can
> use that to reflect information about a flow back to the host, and for
> performance we might piggyback RX queue selection on the TX
> descriptors of a flow. Probably some limitations with real HW, but I
> assume would have less issues in SW.

It may work but may need extending the current virtio-net TX descriptor
or extra API such as vnet header.
>
> IMO, if we have a flow state on the host we should *never* need to
> perform any hash computation on TX (a host is not a switch :-) ), we
> may want to have some mirrored flow state in the kernel for these
> flows which are indexed by the hash provided in TX.

The problem is host may have several different type cards, so the it was
not guaranteed that they can provide the same rxhash.
>
>> - change guest driver to use aRFS
>>
>> Some of the above has been implemented in my old RFC.
> Looks pretty similar to Zhi's tun work. Are you planning to refresh
> those patches?

I have the plan. But there's another concern:

During my testing ( and also tested by some IBM engineers in the past),
we find it's better for a single vhost thread to handle both rx and tx
for a single flow. Using two different vhost threads to handle a flow
may damage the performance in most of the cases. That's why we enforce
the pairing of rx and tx in tun currently. But looks like aRFS can't
guarantee this. If we want to enforce this paring through XPS/irq
affinity, there's no need for aRFS.
>
>>>> Zhi Yong and I had an IRC chat.  I wanted to post my questions on the
>>>> list - it's still the same concern I had in the old email thread that
>>>> Jason mentioned.
>>>>
>>>> In order for virtio-net aRFS to make sense there needs to be an overall
>>>> plan for pushing flow mapping information down to the physical NIC.
>>>> That's the only way to actually achieve the benefit of steering:
>>>> processing the packet on the CPU where the application is running.
>>>>
>>> I don't think this is necessarily true. Per flow steering amongst
>>> virtual queues should be beneficial in itself. virtio-net can leverage
>>> RFS or aRFS where it's available.
>>>
>>>> If it's not possible or too hard to implement aRFS down the entire
>>>> stack, we won't be able to process the packet on the right CPU.
>>>> Then we might as well not bother with aRFS and just distribute uniformly
>>>> across the rx virtqueues.
>>>>
>>>> Please post an outline of how rx packets will be steered up the stack so
>>>> we can discuss whether aRFS can bring any benefit.
>>>>
>>> 1. The aRFS interface for the guest to specify which virtual queue to
>>> receive a packet on is fairly straight forward.
>>> 2. To hook into RFS, we need to match the virtual queue to the real
>>> CPU it will processed on, and then program the RFS table for that flow
>>> and CPU.
>>> 3. NIC aRFS keys off the RFS tables so it can program the HW with the
>>> correct queue for the CPU.
>>>
>>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: annie li @ 2014-01-17  6:25 UTC (permalink / raw)
  To: David Vrabel
  Cc: wei.liu2, ian.campbell, netdev, xen-devel, andrew.bennieston,
	davem
In-Reply-To: <52D7BE19.2010009@citrix.com>

On 2014/1/16 19:10, David Vrabel wrote:
> On 15/01/14 23:57, Annie Li wrote:
>> This patch implements two things:
>>
>> * release grant reference and skb for rx path, this fixex resource leaking.
>> * clean up grant transfer code kept from old netfront(2.6.18) which grants
>> pages for access/map and transfer. But grant transfer is deprecated in current
>> netfront, so remove corresponding release code for transfer.
>>
>> gnttab_end_foreign_access_ref may fail when the grant entry is currently used
>> for reading or writing. But this patch does not cover this and improvement for
>> this failure may be implemented in a separate patch.
> I don't think replacing a resource leak with a security bug is a good idea.
>
> If you would prefer not to fix the gnttab_end_foreign_access() call, I
> think you can fix this in netfront by taking a reference to the page
> before calling gnttab_end_foreign_access().  This will ensure the page
> isn't freed until the subsequent kfree_skb(), or the gref is released by
> the foreign domain (whichever is later).

Taking a reference to the page before calling 
gnttab_end_foreign_access() delays the free work until kfree_skb(). 
Simply adding put_page before kfree_skb() does not make things different 
from gnttab_end_foreign_access_ref(), and the pages will be freed by 
kfree_skb(), problem will be hit in gnttab_handle_deferred() when 
freeing pages which already be freed.

So put_page is required in gnttab_end_foreign_access(), this will ensure 
either free is taken by kfree_skb or gnttab_handle_deferred. This 
involves changes in blkfront/pcifront/tpmfront(just like your patch), 
this way ensure page is released when ref is end.

Another solution I am thinking is calling gnttab_end_foreign_access() 
with page parameter as NULL, then gnttab_end_foreign_access will only do 
ending grant reference work and releasing page work is done by kfree_skb().

Thanks
Annie

^ permalink raw reply

* [PATCH net-next v6 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
From: Michael Dalton @ 2014-01-17  6:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389939810-14998-1-git-send-email-mwdalton@google.com>

Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v3->v4: Remove seqcount due to EWMA changes in patch 5.
        Add missing Suggested-By. 

 drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dacd43b..d75f8ed 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -600,18 +600,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
 	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	unsigned int len;
+
+	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+			GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
 	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
-				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
 	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
 
@@ -1584,6 +1591,33 @@ err:
 	return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+		struct rx_queue_attribute *attribute, char *buf)
+{
+	struct virtnet_info *vi = netdev_priv(queue->dev);
+	unsigned int queue_index = get_netdev_rx_queue_index(queue);
+	struct ewma *avg;
+
+	BUG_ON(queue_index >= vi->max_queue_pairs);
+	avg = &vi->rq[queue_index].mrg_avg_pkt_len;
+	return sprintf(buf, "%u\n", get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+	__ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+	&mergeable_rx_buffer_size_attribute.attr,
+	NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+	.name = "virtio_net",
+	.attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
@@ -1698,6 +1732,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (err)
 		goto free_stats;
 
+#ifdef CONFIG_SYSFS
+	if (vi->mergeable_rx_bufs)
+		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
+#endif
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v6 5/6] lib: Ensure EWMA does not store wrong intermediate values
From: Michael Dalton @ 2014-01-17  6:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389939810-14998-1-git-send-email-mwdalton@google.com>

To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg->internal.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 lib/average.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/average.c b/lib/average.c
index 99a67e6..114d1be 100644
--- a/lib/average.c
+++ b/lib/average.c
@@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
  */
 struct ewma *ewma_add(struct ewma *avg, unsigned long val)
 {
-	avg->internal = avg->internal  ?
-		(((avg->internal << avg->weight) - avg->internal) +
+	unsigned long internal = ACCESS_ONCE(avg->internal);
+
+	ACCESS_ONCE(avg->internal) = internal ?
+		(((internal << avg->weight) - internal) +
 			(val << avg->factor)) >> avg->weight :
 		(val << avg->factor);
 	return avg;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v6 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Michael Dalton @ 2014-01-17  6:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389939810-14998-1-git-send-email-mwdalton@google.com>

Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v4->v5: Handle sysfs_create_group failure. Call sysfs_remove_group when
        removing a RX queue kobj if a device-specific group exists.
v3->v4: Simplify by removing loop in get_netdev_rx_queue_index.

 include/linux/netdevice.h | 35 +++++++++++++++++++++++++++++----
 net/core/dev.c            | 12 ++++++------
 net/core/net-sysfs.c      | 50 +++++++++++++++++++++++++++--------------------
 3 files changed, 66 insertions(+), 31 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d7668b88..e985231 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 			 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
 	struct rps_map __rcu		*rps_map;
 	struct rps_dev_flow_table __rcu	*rps_flow_table;
+#endif
 	struct kobject			kobj;
 	struct net_device		*dev;
 } ____cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
 						   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	struct netdev_rx_queue	*_rx;
 
 	/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
 	const struct attribute_group *sysfs_groups[4];
+	/* space for optional per-rx queue attributes */
+	const struct attribute_group *sysfs_rx_queue_group;
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
@@ -2375,7 +2390,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2394,7 +2409,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 					   from_dev->real_num_tx_queues);
 	if (err)
 		return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	return netif_set_real_num_rx_queues(to_dev,
 					    from_dev->real_num_rx_queues);
 #else
@@ -2402,6 +2417,18 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+		struct netdev_rx_queue *queue)
+{
+	struct net_device *dev = queue->dev;
+	int index = queue - dev->_rx;
+
+	BUG_ON(index >= dev->num_rx_queues);
+	return index;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f87bedd..288df62 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2083,7 +2083,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  *	netif_set_real_num_rx_queues - set actual number of RX queues used
  *	@dev: Network device
@@ -5764,7 +5764,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
 	unsigned int i, count = dev->num_rx_queues;
@@ -6309,7 +6309,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	if (rxqs < 1) {
 		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
 		return NULL;
@@ -6365,7 +6365,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (netif_alloc_netdev_queues(dev))
 		goto free_all;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	dev->num_rx_queues = rxqs;
 	dev->real_num_rx_queues = rxqs;
 	if (netif_alloc_rx_queues(dev))
@@ -6385,7 +6385,7 @@ free_all:
 free_pcpu:
 	free_percpu(dev->pcpu_refcnt);
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	kfree(dev->_rx);
 #endif
 
@@ -6410,7 +6410,7 @@ void free_netdev(struct net_device *dev)
 	release_net(dev_net(dev));
 
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	kfree(dev->_rx);
 #endif
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 49843bf..7eeadee 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -498,17 +498,7 @@ static struct attribute_group wireless_group = {
 #define net_class_groups	NULL
 #endif /* CONFIG_SYSFS */
 
-#ifdef CONFIG_RPS
-/*
- * RX queue sysfs structures and functions.
- */
-struct rx_queue_attribute {
-	struct attribute attr;
-	ssize_t (*show)(struct netdev_rx_queue *queue,
-	    struct rx_queue_attribute *attr, char *buf);
-	ssize_t (*store)(struct netdev_rx_queue *queue,
-	    struct rx_queue_attribute *attr, const char *buf, size_t len);
-};
+#ifdef CONFIG_SYSFS
 #define to_rx_queue_attr(_attr) container_of(_attr,		\
     struct rx_queue_attribute, attr)
 
@@ -543,6 +533,7 @@ static const struct sysfs_ops rx_queue_sysfs_ops = {
 	.store = rx_queue_attr_store,
 };
 
+#ifdef CONFIG_RPS
 static ssize_t show_rps_map(struct netdev_rx_queue *queue,
 			    struct rx_queue_attribute *attribute, char *buf)
 {
@@ -718,16 +709,20 @@ static struct rx_queue_attribute rps_cpus_attribute =
 static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
 	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
 	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+#endif /* CONFIG_RPS */
 
 static struct attribute *rx_queue_default_attrs[] = {
+#ifdef CONFIG_RPS
 	&rps_cpus_attribute.attr,
 	&rps_dev_flow_table_cnt_attribute.attr,
+#endif
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+#ifdef CONFIG_RPS
 	struct rps_map *map;
 	struct rps_dev_flow_table *flow_table;
 
@@ -743,6 +738,7 @@ static void rx_queue_release(struct kobject *kobj)
 		RCU_INIT_POINTER(queue->rps_flow_table, NULL);
 		call_rcu(&flow_table->rcu, rps_dev_flow_table_release);
 	}
+#endif
 
 	memset(kobj, 0, sizeof(*kobj));
 	dev_put(queue->dev);
@@ -763,25 +759,36 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
 	kobj->kset = net->queues_kset;
 	error = kobject_init_and_add(kobj, &rx_queue_ktype, NULL,
 	    "rx-%u", index);
-	if (error) {
-		kobject_put(kobj);
-		return error;
+	if (error)
+		goto exit;
+
+	if (net->sysfs_rx_queue_group) {
+		error = sysfs_create_group(kobj, net->sysfs_rx_queue_group);
+		if (error)
+			goto exit;
 	}
 
 	kobject_uevent(kobj, KOBJ_ADD);
 	dev_hold(queue->dev);
 
 	return error;
+exit:
+	kobject_put(kobj);
+	return error;
 }
-#endif /* CONFIG_RPS */
+#endif /* CONFIG_SYFS */
 
 int
 net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 {
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	int i;
 	int error = 0;
 
+#ifndef CONFIG_RPS
+	if (!net->sysfs_rx_queue_group)
+		return 0;
+#endif
 	for (i = old_num; i < new_num; i++) {
 		error = rx_queue_add_kobject(net, i);
 		if (error) {
@@ -790,8 +797,12 @@ net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 		}
 	}
 
-	while (--i >= new_num)
+	while (--i >= new_num) {
+		if (net->sysfs_rx_queue_group)
+			sysfs_remove_group(&net->_rx[i].kobj,
+					   net->sysfs_rx_queue_group);
 		kobject_put(&net->_rx[i].kobj);
+	}
 
 	return error;
 #else
@@ -1155,9 +1166,6 @@ static int register_queue_kobjects(struct net_device *net)
 	    NULL, &net->dev.kobj);
 	if (!net->queues_kset)
 		return -ENOMEM;
-#endif
-
-#ifdef CONFIG_RPS
 	real_rx = net->real_num_rx_queues;
 #endif
 	real_tx = net->real_num_tx_queues;
@@ -1184,7 +1192,7 @@ static void remove_queue_kobjects(struct net_device *net)
 {
 	int real_rx = 0, real_tx = 0;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	real_rx = net->real_num_rx_queues;
 #endif
 	real_tx = net->real_num_tx_queues;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v6 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
From: Michael Dalton @ 2014-01-17  6:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389939810-14998-1-git-send-email-mwdalton@google.com>

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v5->v6: Fix merge conflict. Subtract 1 before encoding the scaled truesize
        for a mergeable buffer ctx to support 64KB PAGE_SIZE.
v2->v3: Remove per-receive queue metadata ring. Encode packet buffer
        base address and truesize into an unsigned long by requiring a
        minimum packet size alignment of 256. Permit attempts to fill
        an already-full RX ring (reverting the change in v2).
v1->v2: Add per-receive queue metadata ring to track precise truesize for
        mergeable receive buffers. Remove all truesize approximation. Never
        try to fill a full RX ring (required for metadata ring in v2).
 drivers/net/virtio_net.c | 100 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 75 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5ee71dc..dacd43b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/average.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-                                L1_CACHE_BYTES))
 #define GOOD_COPY_LEN	128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 struct virtnet_stats {
@@ -75,6 +83,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma mrg_avg_pkt_len;
+
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -216,6 +227,24 @@ static void skb_xmit_done(struct virtqueue *vq)
 	netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+	unsigned int truesize = mrg_ctx & (MERGEABLE_BUFFER_ALIGN - 1);
+	return (truesize + 1) * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+	return (void *)(mrg_ctx & -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+	unsigned int size = truesize / MERGEABLE_BUFFER_ALIGN;
+	return (unsigned long)buf | (size - 1);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int offset,
@@ -324,31 +353,33 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 struct receive_queue *rq,
-					 void *buf,
+					 unsigned long ctx,
 					 unsigned int len)
 {
+	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+
 	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
-
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
 				 dev->name, num_buf, hdr->mhdr.num_buffers);
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
 
+		buf = mergeable_ctx_to_buf_address(ctx);
 		page = virt_to_head_page(buf);
 
 		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
@@ -365,7 +396,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
-		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+		truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
@@ -382,19 +413,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
+	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_skb:
 	put_page(page);
 	while (--num_buf) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
 			dev->stats.rx_length_errors++;
 			break;
 		}
-		page = virt_to_head_page(buf);
+		page = virt_to_head_page(mergeable_ctx_to_buf_address(ctx));
 		put_page(page);
 	}
 err_buf:
@@ -414,17 +446,20 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
-		if (vi->mergeable_rx_bufs)
-			put_page(virt_to_head_page(buf));
-		else if (vi->big_packets)
+		if (vi->mergeable_rx_bufs) {
+			unsigned long ctx = (unsigned long)buf;
+			void *base = mergeable_ctx_to_buf_address(ctx);
+			put_page(virt_to_head_page(base));
+		} else if (vi->big_packets) {
 			give_pages(rq, buf);
-		else
+		} else {
 			dev_kfree_skb(buf);
+		}
 		return;
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, buf, len);
+		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -567,25 +602,36 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
+	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
+	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
+	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
+				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
+
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	ctx = mergeable_buf_to_ctx(buf, len);
 	get_page(alloc_frag->page);
-	len = MERGE_BUFFER_LEN;
 	alloc_frag->offset += len;
 	hole = alloc_frag->size - alloc_frag->offset;
-	if (hole < MERGE_BUFFER_LEN) {
+	if (hole < len) {
+		/* To avoid internal fragmentation, if there is very likely not
+		 * enough space for another buffer, add the remaining space to
+		 * the current buffer. This extra space is not included in
+		 * the truesize stored in ctx.
+		 */
 		len += hole;
 		alloc_frag->offset += hole;
 	}
 
 	sg_init_one(rq->sg, buf, len);
-	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
+	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, (void *)ctx, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
 
@@ -1385,12 +1431,15 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->rq[i].vq;
 
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs)
-				put_page(virt_to_head_page(buf));
-			else if (vi->big_packets)
+			if (vi->mergeable_rx_bufs) {
+				unsigned long ctx = (unsigned long)buf;
+				void *base = mergeable_ctx_to_buf_address(ctx);
+				put_page(virt_to_head_page(base));
+			} else if (vi->big_packets) {
 				give_pages(&vi->rq[i], buf);
-			else
+			} else {
 				dev_kfree_skb(buf);
+			}
 		}
 	}
 }
@@ -1498,6 +1547,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
 
-- 
1.8.5.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox