* Re: [Xen-devel] [PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: Wei Liu @ 2014-01-17 14:02 UTC (permalink / raw)
To: annie li
Cc: Wei Liu, David Vrabel, ian.campbell, netdev, xen-devel,
andrew.bennieston, davem
In-Reply-To: <52D922DD.2060407@oracle.com>
On Fri, Jan 17, 2014 at 08:32:29PM +0800, annie li wrote:
>
> On 2014-1-17 20:08, Wei Liu wrote:
> >On Fri, Jan 17, 2014 at 02:25:40PM +0800, annie li wrote:
> >>On 2014/1/16 19:10, David Vrabel wrote:
> >>>On 15/01/14 23:57, Annie Li wrote:
> >>>>This patch implements two things:
> >>>>
> >>>>* release grant reference and skb for rx path, this fixex resource leaking.
> >>>>* clean up grant transfer code kept from old netfront(2.6.18) which grants
> >>>>pages for access/map and transfer. But grant transfer is deprecated in current
> >>>>netfront, so remove corresponding release code for transfer.
> >>>>
> >>>>gnttab_end_foreign_access_ref may fail when the grant entry is currently used
> >>>>for reading or writing. But this patch does not cover this and improvement for
> >>>>this failure may be implemented in a separate patch.
> >>>I don't think replacing a resource leak with a security bug is a good idea.
> >>>
> >>>If you would prefer not to fix the gnttab_end_foreign_access() call, I
> >>>think you can fix this in netfront by taking a reference to the page
> >>>before calling gnttab_end_foreign_access(). This will ensure the page
> >>>isn't freed until the subsequent kfree_skb(), or the gref is released by
> >>>the foreign domain (whichever is later).
> >>Taking a reference to the page before calling
> >>gnttab_end_foreign_access() delays the free work until kfree_skb().
> >>Simply adding put_page before kfree_skb() does not make things
> >>different from gnttab_end_foreign_access_ref(), and the pages will
> >>be freed by kfree_skb(), problem will be hit in
> >>gnttab_handle_deferred() when freeing pages which already be freed.
> >>
> >I think David's idea is:
> >
> > get_page
> > gnttab_end_foreign_access
> > kfree_skb
> >
> >The get_page is to offset put_page in gnttab_end_foreign_access. You
> >don't need to put page before kfree_skb.
>
> Yes, this is what I described as following about David's patch.
>
> >>So put_page is required in gnttab_end_foreign_access(), this will
> >>ensure either free is taken by kfree_skb or gnttab_handle_deferred.
> >>This involves changes in blkfront/pcifront/tpmfront(just like your
> >>patch), this way ensure page is released when ref is end.
>
> But this would has some issue in netfront tx path. Netfront ends all
What issue with tx path? Your patch only touches rx skbs, doesn't it?
> grant reference of one skb first and then release this skb. If the
> gnttab_end_foreign_access_ref fails in gnttab_end_foreign_access(),
> this frag page and corresponding grant reference will be put in
> entry and release work will be done in the timer routine. If some
I understand up to this point.
> frag pages of one skb is free in this timer routine, then
> dev_kfree_skb_irq will free pages which have been freed.
Why is dev_kfree_skb_irq involved? It is used in tx path not rx path.
Even if we look at dev_kfree_skb_irq, it calls __kfree_skb for dropped
packet eventually, which should do the right thing if we don't mess up
ref counts.
Wei.
> So I prefer following way I mentioned, suggestions?
>
> >>Another solution I am thinking is calling
> >>gnttab_end_foreign_access() with page parameter as NULL, then
> >>gnttab_end_foreign_access will only do ending grant reference work
> >>and releasing page work is done by kfree_skb().
>
> Thanks
> Annie
^ permalink raw reply
* Re: [PATCH v4 net-next 2/4] sh_eth: Add support for r7s72100
From: Sergei Shtylyov @ 2014-01-17 14:05 UTC (permalink / raw)
To: Simon Horman
Cc: David S. Miller, netdev, linux-sh, linux-arm-kernel, Magnus Damm
In-Reply-To: <20140117061301.GD16455@verge.net.au>
Hello.
On 17-01-2014 10:13, Simon Horman wrote:
>>>>>>> This is a fast ethernet controller.
>>>>>>> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
>>>>>> [...]
>>>>>>> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
>>>>>>> index 4b38533..cc6d4af 100644
>>>>>>> --- a/drivers/net/ethernet/renesas/sh_eth.c
>>>>>>> +++ b/drivers/net/ethernet/renesas/sh_eth.c
>>>>>>> @@ -190,6 +190,59 @@ static const u16 sh_eth_offset_fast_rcar[SH_ETH_MAX_REGISTER_OFFSET] = {
>>>> [...]
>>>>>>> @@ -701,6 +762,35 @@ static struct sh_eth_cpu_data r8a7740_data = {
>>>>>>> .shift_rd0 = 1,
>>>>>>> };
>>>>>>>
>>>>>>> +/* R7S72100 */
>>>>>>> +static struct sh_eth_cpu_data r7s72100_data = {
>>>>>>> + .chip_reset = sh_eth_chip_reset,
>>>>>>> + .set_duplex = sh_eth_set_duplex,
>>>>>>> +
>>>>>>> + .register_type = SH_ETH_REG_FAST_RZ,
>>>>>>> +
>>>>>>> + .ecsr_value = ECSR_ICD,
>>>>>>> + .ecsipr_value = ECSIPR_ICDIP,
>>>>>>> + .eesipr_value = 0xff7f009f,
>>>>>>> +
>>>>>>> + .tx_check = EESR_TC1 | EESR_FTC,
>>>>>>> + .eesr_err_check = EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT |
>>>>>>> + EESR_RFE | EESR_RDE | EESR_RFRMER | EESR_TFE |
>>>>>>> + EESR_TDE | EESR_ECI,
>>>>>>> + .fdr_value = 0x0000070f,
>>>>>>> + .rmcr_value = RMCR_RNC,
>>>>>>> +
>>>>>>> + .apr = 1,
>>>>>>> + .mpr = 1,
>>>>>>> + .tpauser = 1,
>>>>>>> + .hw_swap = 1,
>>>>>>> + .rpadir = 1,
>>>>>>> + .rpadir_value = 2 << 16,
>>>>>>> + .no_trimd = 1,
>>>>>>> + .tsu = 1,
>>>>>>> + .shift_rd0 = 1,
>>>>>> Perhaps this field should be renamed to something talking about
>>>>>> check summing support (since bits 0..15 of RD0 contain a frame check
>>>>>> sum for those SoCs). Or maybe it should be just merged with the
>>>>>> 'hw_crc' field...
>>>>> I have no feelings about that one way or another.
>>>> Do you happen to have R8A7740 manual by chance? If so, does it
>>>> talk about RX check summing support and using RD0 for that?
>>> Yes and yes.
>>> I have taken a quick look and the documentation for RX checksumming on the
>>> R8A7740 appears to be very similar if not the same as that of the R7S72100.
>>> In particular both refer to using the bottom 16 bits of RD0 as
>>> containing the packet checksum.
>> OK, now if you had SH7734 manual to completely confirm that check
>> sum is stored in the same place there... most probably it is, of
>> course, and we should merge 'hw_crc' and 'shift_rd0' into a single
>> field.
> Unfortunately I don't have access to that manual.
Anyway, we also need Gen2 manuals accepting the fact that checksumming is
also supported (they also set 'shift_rd0' field) and giving the mapping of CSMR...
WBR, Sergei
^ permalink raw reply
* Re: [PATCH net] net: core: orphan frags before queuing to slow qdisc
From: Eric Dumazet @ 2014-01-17 14:28 UTC (permalink / raw)
To: Jason Wang; +Cc: davem, netdev, linux-kernel, Michael S. Tsirkin
In-Reply-To: <1389951734-13234-1-git-send-email-jasowang@redhat.com>
On Fri, 2014-01-17 at 17:42 +0800, Jason Wang wrote:
> Many qdiscs can queue a packet for a long time, this will lead an issue
> with zerocopy skb. It means the frags will not be orphaned in an expected
> short time, this breaks the assumption that virtio-net will transmit the
> packet in time.
>
> So if guest packets were queued through such kind of qdisc and hit the
> limitation of the max pending packets for virtio/vhost. All packets that
> go to another destination from guest will also be blocked.
>
> A case for reproducing the issue:
>
> - Boot two VMs and connect them to the same bridge kvmbr.
> - Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
> - Let VM1 send lots of packets thorugh eth0
> - After a while, VM1 is unable to send any packets out since the number of
> pending packets (queued to tbf) were exceeds the limitation of vhost/virito
So whats the problem ? If the limit is low, you cannot sent packets.
Solution : increase the limit, or tell the vm to lower its rate.
Oh wait, are you bitten because you did some prior skb_orphan() to allow
the vm to send unlimited number of skbs ???
>
> Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
> one without TCQ_F_CAN_BYPASS).
Why orphaning the frags only solves the problem ? A skb without zerocopy
frags should also be blocked for a while.
Seriously, lets admit this zero copy stuff is utterly broken.
TCQ_F_CAN_BYPASS is not enough. Some NIC have separate queues with
strict priorities.
It seems to me that you are pushing to use FIFO (the only qdisc setting
TCQ_F_CAN_BYPASS), by adding yet another test in fast path (I do not
know how we can still call it a fast path), while we already have smart
qdisc to avoid the inherent HOL and unfairness problems of FIFO.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> net/core/dev.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ce469e..1209774 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> contended = qdisc_is_running(q);
> if (unlikely(contended))
> spin_lock(&q->busylock);
> + if (!(q->flags & TCQ_F_CAN_BYPASS) &&
> + unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> + kfree_skb(skb);
> + rc = NET_XMIT_DROP;
> + goto out;
> + }
Are you aware that copying stuff takes time ?
If yes, why is it done after taking the busylock spinlock ?
>
> spin_lock(root_lock);
> if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> @@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> }
> }
> spin_unlock(root_lock);
> +out:
> if (unlikely(contended))
> spin_unlock(&q->busylock);
> return rc;
^ permalink raw reply
* Re: [PATCH v2] sch_htb: let skb->priority refer to non-leaf class
From: Eric Dumazet @ 2014-01-17 14:35 UTC (permalink / raw)
To: Harry Mason; +Cc: Jamal Hadi Salim, linux-netdev
In-Reply-To: <1389953999.4698.18.camel@azathoth.dev.smoothwall.net>
On Fri, 2014-01-17 at 10:19 +0000, Harry Mason wrote:
> If the class in skb->priority is not a leaf, apply filters from the
> selected class, not the qdisc. This lets netfilter or user space
> partially classify the packet.
>
> Signed-off-by: Harry Mason <harry.mason@smoothwall.net>
> ---
>
> On Thu, 2014-01-16 at 08:25 -0800, Eric Dumazet wrote:
> > On Thu, 2014-01-16 at 14:45 +0000, Harry Mason wrote:
> >
> >> + /* Start with inner filter chain if a non-leaf class is selected */
> >> + if (cl)
> >> + tcf = cl->filter_list;
> >> + else
> >> + tcf = q->filter_list;
> >
> > Could this break some existing htb setups ?
>
> I think it is unlikely. Setting skb->priority to a non-leaf class would
> be equivalent to setting it to the base qdisc. In theory an application
> might rely on this if it expects the classes to be dynamic, but adding
> a filter could restore the old behaviour.
>
Problem is : Your patch is one patch among thousands of patches, and
people will install new kernels without knowing this could have an
impact on their setup and might discover the problems too late
(after some failure)
> To me this is intuitively how it should behave, and reproduces what would
> happen if a tc filter instead of netfilter had first assigned the
> non-leaf class.
This is definitely a patch for net-next, not net tree.
>
> > Also we test cl being NULL at line 222, it would be nice to not
> > test it again...
>
> Updated below.
>
> net/sched/sch_htb.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index 717b210..8073d92 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -219,11 +219,15 @@ static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
> if (skb->priority == sch->handle)
> return HTB_DIRECT; /* X:0 (direct flow) selected */
> cl = htb_find(skb->priority, sch);
> - if (cl && cl->level == 0)
> - return cl;
> + if (cl) {
> + if (cl->level == 0)
> + return cl;
> + /* Start with inner filter chain if a non-leaf class is selected */
> + tcf = cl->filter_list;
> + } else
> + tcf = q->filter_list;
>
} else {
tcf = q->filter_list;
}
(Documentation/CodingStyle line 169)
> *qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> - tcf = q->filter_list;
> while (tcf && (result = tc_classify(skb, tcf, &res)) >= 0) {
> #ifdef CONFIG_NET_CLS_ACT
> switch (result) {
^ permalink raw reply
* RE: PROBLEM: usbnet / ax88179_178a: Panic in usb_hcd_map_urb_for_dma
From: David Laight @ 2014-01-17 14:46 UTC (permalink / raw)
To: 'Ming Lei'
Cc: Bjørn Mork, Thomas Kear, Ben Hutchings, netdev,
linux-usb@vger.kernel.org
In-Reply-To: <CACVXFVOGxWe5+o0hLROiPN43OCDekV2Ovz1yXr95H1m+yHGn4w@mail.gmail.com>
From: Ming Lei
> On Mon, Jan 13, 2014 at 9:26 PM, David Laight <David.Laight@aculab.com> wrote:
> >>
> >> I believe all processing use the urb->num_sgs field to limit the number
> >> of entries. Common interfaces like dma_map_sg() and for_each_sg() limit
> >> their processing to "nents" entries, and the USB code use the value of
> >> urb->num_sgs for this parameter.
> >
> > Which mostly means that the sg_xxx functions are doing a whole load
> > of unnecessary instructions and memory accesses...
> >
> > This probably has a lot to do with the significant difference in the
> > cpu use for the usb3 and 'normal' ethernet interfaces.
> >
> > While each bit doesn't seem significant, they soon add up.
>
> If you plan to remove the 'nents' parameter, I am wondering if it is
> a good idea, because sg_nents() should be more heavy. Not mention
> sometimes the callers just want to map/unmap part of entries.
I was thinking of using a simple address/length array without all
the extra fields and flags 'overpunched' in the low address bits.
I'm not even sure the current use is strictly correct.
The field names of the scatterlist have a page address and offset
but the buffers passed to xhci (at least by usbnet) can span multiple
pages - they are physically contiguous.
IIRC some traces of requests for USB disks show a separate SG entry
for each 4k page - even for both virtually and physically adjacent
pages.
David
^ permalink raw reply
* Multicast packets receiving problem on linux since version 3.10.1
From: Andrey Dmitrov @ 2014-01-17 15:10 UTC (permalink / raw)
To: netdev; +Cc: Konstantin Ushakov, Alexandra N. Kossovsky, Yurij Plotnikov
[-- Attachment #1: Type: text/plain, Size: 3118 bytes --]
Greetings,
there is a problem with receiving multicast packets on linux-3.10.1 and
newer. It's reproducible with two hosts (host_A, host_B), with eth3@host_A
directly connected to eth3@host_B. Two VLANs and one multicast group are
used.
Each side opens two sockets and binds them to different VLAN interfaces.
host_A arranges both sockets to receive multicast packets of the group.
Then the first socket is removed from the group. After this the second
socket
can no longer receive multicast packets of the group, which sent by the
second host. See the example below.
host_A:
Two VLANs 999 and 1001 are added on eth3:
eth3.999 10.208.14.1
eth3.1001 10.208.15.1
1. socket(SOCK_DGRAM) -> 3
2. setsockopt(3, SOL_SOCKET, SO_REUSEADDR, 1) -> 0
3. setsockopt(3, IPPROTO_IP, MCAST_JOIN_GROUP, {229.17.88.168,
eth3.999}) -> 0
4. setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, eth3.999) -> 0
5. bind(3, 0.0.0.0:29214)->0
6. socket(SOCK_DGRAM) -> 4
7. setsockopt(4, SOL_SOCKET, SO_REUSEADDR, 1) -> 0
8. setsockopt(4, IPPROTO_IP, MCAST_JOIN_GROUP, {229.17.88.168,
eth3.1001}) -> 0
9. setsockopt(4, SOL_SOCKET, SO_BINDTODEVICE, eth3.1001) -> 0
10. bind(4, 0.0.0.0:29214) -> 0
11. poll({{3, POLLIN}, {4, POLLIN}}, 2, 30000) -> 2
12. recv(3, buf, 100) -> 100
13. recv(4, buf, 99) -> 99
14. setsockopt(3, IPPROTO_IP, MCAST_LEAVE_GROUP, {229.17.88.168,
eth3.999}) -> 0
15. poll({{3, POLLIN}, {4, POLLIN}}, 2, 30000) -> 0
Socket 4 does not receive the multicast packet on linux 3.10 and newer. But
it receives the packet with older linux versions. Probably the packet is
filtered by NIC, tcpdump does not see it.
host_B:
Two VLANs 999 and 1001 are added on eth3:
eth3.999 10.208.14.2
eth3.1001 10.208.15.2
16. socket(SOCK_DGRAM) -> 3
17. bind(3, 10.208.14.2:29219) -> 0
18. socket(SOCK_DGRAM) -> 4
19. bind(4, 10.208.15.2:29219) -> 0
20. sendto(3, 229.17.88.168:29214, buf, 100) -> 100
21. sendto(4, 229.17.88.168:29214, buf, 99) -> 99
Continue when socket 3 on host_A will leave the group (line 14).
22. sendto(3, 229.17.88.168:29214, buf, 100) -> 100
23. sendto(4, 229.17.88.168:29214, buf, 99) -> 99
Note, that if I replace step #14 with:
> setsockopt(4, IPPROTO_IP, MCAST_LEAVE_GROUP, {229.17.88.168,
eth3.1001}) -> 0
and remove the second socket from the group (instead of the first one) -
then
the first socket will receive it's packet.
Find client and server C programs that reproduce the problem attached.
The client
logs received packets. As stated above in good case it will log 3
packets and in
bad one - only 2. Use the following command lines to start the client
and server:
host_A:
sudo ip link add link eth3 name eth3.999 type vlan id 999
sudo ifconfig eth3.999 10.208.14.1/24
sudo ip link add link eth3 name eth3.1001 type vlan id 1001
sudo ifconfig eth3.1001 10.208.15.1/24
gcc mcast_client.c -o cl
sudo ./cl
host_B:
sudo ip link add link eth3 name eth3.999 type vlan id 999
sudo ifconfig eth3.999 10.208.14.2/24
sudo ip link add link eth3 name eth3.1001 type vlan id 1001
sudo ifconfig eth3.1001 10.208.15.2/24
gcc mcast_serv.c -o serv
./serv
Thanks in advance,
Andrey Dmitrov
[-- Attachment #2: mcast_client.c --]
[-- Type: text/x-csrc, Size: 4162 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <net/if.h>
#define ERROR(_line...) \
do { \
fprintf(stderr, _line); \
fprintf(stderr, "\n"); \
assert(0); \
} while (0)
#define ERROR_CL(_line...) \
do { \
close(sock); \
ERROR(_line); \
} while (0)
#define WARN(_line...) \
do { \
printf(_line); \
printf("\n"); \
} while (0)
#define MCAST_GROUP "229.17.88.168"
#define PORT 12345
#define VLAN1 "999"
#define VLAN2 "1001"
#define VLAN_IF_STR(_vlan) "eth3." _vlan
#define VLAN_IF1 VLAN_IF_STR(VLAN1)
#define VLAN_IF2 VLAN_IF_STR(VLAN2)
static int
init_socket(const char *ifname)
{
int sock;
int val = 1;
struct group_req req;
struct sockaddr_in addr;
struct ifreq ifr;
if ((sock = socket(PF_INET, SOCK_DGRAM, 0)) < 0)
ERROR("Can't open socket: %s", strerror(errno));
if (setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val)) != 0)
ERROR_CL("Can't set SO_REUSEADDR for socket: %s", strerror(errno));
memset(&req, 0, sizeof(req));
if ((req.gr_interface = if_nametoindex(ifname)) <= 0)
ERROR_CL("Wrong interface index: %s", strerror(errno));
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
if (inet_pton(AF_INET, MCAST_GROUP, &addr.sin_addr) != 1)
ERROR_CL("Can't convert mcast group address: %s", strerror(errno));
memcpy(&req.gr_group, &addr, sizeof(addr));
if (setsockopt(sock, IPPROTO_IP, MCAST_JOIN_GROUP, &req,
sizeof(req)) != 0)
ERROR_CL("Can't set SO_REUSEADDR for socket: %s", strerror(errno));
memset(&ifr, 0, sizeof(struct ifreq));
snprintf(ifr.ifr_name, sizeof(ifr.ifr_name), ifname);
if (ioctl(sock, SIOCGIFINDEX, &ifr) != 0)
ERROR_CL("Can't get interface index with ioctl: %s", strerror(errno));
if (setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE, &ifr, sizeof(struct ifreq)) != 0)
ERROR_CL("SO_BINDTODEVICE failed: %s", strerror(errno));
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_port = htons(PORT);
if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) != 0)
ERROR_CL("bind failed: %s", strerror(errno));
return sock;
}
static int
leave_group(int sock, const char *ifname)
{
struct group_req req;
struct sockaddr_in addr;
memset(&req, 0, sizeof(req));
if ((req.gr_interface = if_nametoindex(ifname)) <= 0)
{
WARN("Wrong interface index: %s", strerror(errno));
return -1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
if (inet_pton(AF_INET, MCAST_GROUP, &addr.sin_addr) != 1)
{
WARN("Can't convert mcast group address: %s", strerror(errno));
return -1;
}
memcpy(&req.gr_group, &addr, sizeof(addr));
if (setsockopt(sock, IPPROTO_IP, MCAST_LEAVE_GROUP, &req, sizeof(req)) != 0)
{
WARN("Can't set SO_REUSEADDR for socket: %s", strerror(errno));
return -1;
}
return 0;
}
int
main(void)
{
int sock1;
int sock2;
char buf[100];
int len;
int rc = EXIT_SUCCESS;
sock1 = init_socket(VLAN_IF1);
sock2 = init_socket(VLAN_IF2);
if ((len = recv(sock1, buf, sizeof(buf), 0)) < 0)
WARN("Failed to receive packets: %s", strerror(errno));
WARN("packet length %d", len);
if ((len = recv(sock2, buf, sizeof(buf), 0)) < 0)
WARN("Failed to receive packets: %s", strerror(errno));
WARN("packet length %d", len);
if (leave_group(sock1, VLAN_IF1) != 0)
{
WARN("Failed to leave group");
rc = EXIT_FAILURE;
goto cleanup;
}
if ((len = recv(sock2, buf, sizeof(buf), 0)) < 0)
WARN("Failed to receive packets: %s", strerror(errno));
WARN("packet length %d", len);
if ((len = recv(sock1, buf, sizeof(buf), 0)) < 0)
WARN("Failed to receive packets: %s", strerror(errno));
WARN("packet length %d", len);
cleanup:
close(sock1);
close(sock2);
return rc;
}
[-- Attachment #3: mcast_serv.c --]
[-- Type: text/x-csrc, Size: 2678 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <arpa/inet.h>
#define ERROR(_line...) \
do { \
fprintf(stderr, _line); \
fprintf(stderr, "\n"); \
assert(0); \
} while (0)
#define ERROR_CL(_line...) \
do { \
close(sock); \
ERROR(_line); \
} while (0)
#define WARN(_line...) \
do { \
printf(_line); \
printf("\n"); \
} while (0)
#define MCAST_GROUP "229.17.88.168"
#define GROUP_PORT 12345
#define LOCAL_PORT 23456
#define LOCAL_ADDR1 "10.208.14.2"
#define LOCAL_ADDR2 "10.208.15.2"
#define VLAN1 "999"
#define VLAN2 "1001"
#define VLAN_IF_STR(_vlan) "eth3." _vlan
#define VLAN_IF1 VLAN_IF_STR(VLAN1)
#define VLAN_IF2 VLAN_IF_STR(VLAN2)
static int
init_socket_send(const char *ifname, const char *local_addr)
{
int sock;
struct sockaddr_in addr;
struct in_addr ifaddr;
if ((sock = socket(PF_INET, SOCK_DGRAM, 0)) < 0)
ERROR("Can't open socket: %s", strerror(errno));
WARN("if %s, addr %s", ifname, local_addr);
if (inet_aton(local_addr, &ifaddr) == 0)
ERROR_CL("inet_aton failed: %s", strerror(errno));
if (setsockopt(sock, SOL_IP, IP_MULTICAST_IF, &ifaddr, sizeof(ifaddr)) != 0)
ERROR_CL("IP_MULTICAST_IF failed: %s", strerror(errno));
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_port = htons(LOCAL_PORT);
if (inet_pton(AF_INET, local_addr, &addr.sin_addr) != 1)
ERROR_CL("Can't convert mcast group address: %s", strerror(errno));
if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) != 0)
ERROR_CL("bind failed: %s", strerror(errno));
return sock;
}
int
main(void)
{
int sock1;
int sock2;
char buf[100];
struct sockaddr_in addr;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_port = htons(GROUP_PORT);
if (inet_pton(AF_INET, MCAST_GROUP, &addr.sin_addr) != 1)
ERROR("Can't convert mcast group address: %s", strerror(errno));
sock1 = init_socket_send(VLAN_IF1, LOCAL_ADDR1);
sock2 = init_socket_send(VLAN_IF2, LOCAL_ADDR2);
if (sendto(sock1, buf, 99, 0, (struct sockaddr *)&addr, sizeof(addr)) != 99 ||
sendto(sock2, buf, 98, 0, (struct sockaddr *)&addr, sizeof(addr)) != 98)
WARN("send failed: %s", strerror(errno));
sleep(1);
if (sendto(sock1, buf, 97, 0, (struct sockaddr *)&addr, sizeof(addr)) != 97 ||
sendto(sock2, buf, 96, 0, (struct sockaddr *)&addr, sizeof(addr)) != 96)
WARN("send failed: %s", strerror(errno));
close(sock1);
close(sock2);
return EXIT_SUCCESS;
}
^ permalink raw reply
* [PATCH 17/41] net: Replace __this_cpu_inc in route.c with raw_cpu_inc
From: Christoph Lameter @ 2014-01-17 15:18 UTC (permalink / raw)
To: Tejun Heo
Cc: akpm, rostedt, linux-kernel, Ingo Molnar, Peter Zijlstra,
Thomas Gleixner, netdev, davem, edumazet
In-Reply-To: <20140117151812.770437629@linux.com>
[-- Attachment #1: preempt_rt_cache_stat --]
[-- Type: text/plain, Size: 3614 bytes --]
[Patch depends on another patch in this series that introduces raw_cpu_ops]
The RT_CACHE_STAT_INC macro triggers the new preemption checks
for __this_cpu ops.
I do not see any other synchronization that would allow the use
of a __this_cpu operation here however in commit
dbd2915ce87e811165da0717f8e159276ebb803e Andrew justifies
the use of raw_smp_processor_id() here because "we do not care"
about races. In the past we agreed that the price of disabling
interrupts here to get consistent counters would be too high.
These counters may be inaccurate due to race conditions.
The use of __this_cpu op improves the situation already from what commit
dbd2915ce87e811165da0717f8e159276ebb803e did since the single instruction
emitted on x86 does not allow the race to occur anymore. However,
non x86 platforms could still experience a race here.
Trace:
[ 1277.189084] __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
[ 1277.189085] caller is __this_cpu_preempt_check+0x38/0x60
[ 1277.189086] CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187
[ 1277.189087] Hardware name: FUJITSU CELSIUS W530 Power/D3227-A1, BIOS V4.6.5.4 R1.10.0 for D3227-A1x 09/16/2013
[ 1277.189088] 0000000000000001 ffff8807ef78fa00 ffffffff816d5a57 ffff8807ef78ffd8
[ 1277.189089] ffff8807ef78fa30 ffffffff8137359c ffff8807ef78fba0 ffff88079f822b40
[ 1277.189091] 0000000020000000 ffff8807ee32c800 ffff8807ef78fa70 ffffffff813735f8
[ 1277.189093] Call Trace:
[ 1277.189094] [<ffffffff816d5a57>] dump_stack+0x4e/0x82
[ 1277.189096] [<ffffffff8137359c>] check_preemption_disabled+0xec/0x110
[ 1277.189097] [<ffffffff813735f8>] __this_cpu_preempt_check+0x38/0x60
[ 1277.189098] [<ffffffff81610d65>] __ip_route_output_key+0x575/0x8c0
[ 1277.189100] [<ffffffff816110d7>] ip_route_output_flow+0x27/0x70
[ 1277.189101] [<ffffffff81616c80>] ? ip_copy_metadata+0x1a0/0x1a0
[ 1277.189102] [<ffffffff81640b15>] udp_sendmsg+0x825/0xa20
[ 1277.189104] [<ffffffff811b4aa9>] ? do_sys_poll+0x449/0x5d0
[ 1277.189105] [<ffffffff8164c695>] inet_sendmsg+0x85/0xc0
[ 1277.189106] [<ffffffff815c6e3c>] sock_sendmsg+0x9c/0xd0
[ 1277.189108] [<ffffffff813735f8>] ? __this_cpu_preempt_check+0x38/0x60
[ 1277.189109] [<ffffffff815c7550>] ? move_addr_to_kernel+0x40/0xa0
[ 1277.189111] [<ffffffff815c71ec>] ___sys_sendmsg+0x37c/0x390
[ 1277.189112] [<ffffffff8136613a>] ? string.isra.3+0x3a/0xd0
[ 1277.189113] [<ffffffff8136613a>] ? string.isra.3+0x3a/0xd0
[ 1277.189115] [<ffffffff81367b54>] ? vsnprintf+0x364/0x650
[ 1277.189116] [<ffffffff81367ee9>] ? snprintf+0x39/0x40
[ 1277.189118] [<ffffffff813735f8>] ? __this_cpu_preempt_check+0x38/0x60
[ 1277.189119] [<ffffffff815c7ff9>] __sys_sendmsg+0x49/0x90
[ 1277.189121] [<ffffffff815c8052>] SyS_sendmsg+0x12/0x20
[ 1277.189122] [<ffffffff816e4fd3>] tracesys+0xe1/0xe6
Cc: netdev@vger.kernel.org
Cc: davem@davemloft.net
Cc: edumazet@google.com
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Index: linux/net/ipv4/route.c
===================================================================
--- linux.orig/net/ipv4/route.c 2013-12-02 16:07:51.964573250 -0600
+++ linux/net/ipv4/route.c 2013-12-02 16:07:51.954573526 -0600
@@ -197,7 +197,7 @@ const __u8 ip_tos2prio[16] = {
EXPORT_SYMBOL(ip_tos2prio);
static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
-#define RT_CACHE_STAT_INC(field) __this_cpu_inc(rt_cache_stat.field)
+#define RT_CACHE_STAT_INC(field) raw_cpu_inc(rt_cache_stat.field)
#ifdef CONFIG_PROC_FS
static void *rt_cache_seq_start(struct seq_file *seq, loff_t *pos)
^ permalink raw reply
* [PATCH 24/41] net: Replace get_cpu_var through this_cpu_ptr
From: Christoph Lameter @ 2014-01-17 15:18 UTC (permalink / raw)
To: Tejun Heo
Cc: akpm, rostedt, linux-kernel, Ingo Molnar, Peter Zijlstra,
Thomas Gleixner, David S. Miller, netdev, Eric Dumazet
In-Reply-To: <20140117151812.770437629@linux.com>
[-- Attachment #1: this_net --]
[-- Type: text/plain, Size: 8883 bytes --]
[Patch depends on another patch in this series that introduces raw_cpu_ops]
Replace uses of get_cpu_var for address calculation through this_cpu_ptr.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Index: linux/net/core/dev.c
===================================================================
--- linux.orig/net/core/dev.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/core/dev.c 2013-12-02 16:07:45.254759699 -0600
@@ -2130,7 +2130,7 @@ static inline void __netif_reschedule(st
unsigned long flags;
local_irq_save(flags);
- sd = &__get_cpu_var(softnet_data);
+ sd = this_cpu_ptr(&softnet_data);
q->next_sched = NULL;
*sd->output_queue_tailp = q;
sd->output_queue_tailp = &q->next_sched;
@@ -2152,7 +2152,7 @@ void dev_kfree_skb_irq(struct sk_buff *s
unsigned long flags;
local_irq_save(flags);
- sd = &__get_cpu_var(softnet_data);
+ sd = this_cpu_ptr(&softnet_data);
skb->next = sd->completion_queue;
sd->completion_queue = skb;
raise_softirq_irqoff(NET_TX_SOFTIRQ);
@@ -3122,7 +3122,7 @@ static void rps_trigger_softirq(void *da
static int rps_ipi_queued(struct softnet_data *sd)
{
#ifdef CONFIG_RPS
- struct softnet_data *mysd = &__get_cpu_var(softnet_data);
+ struct softnet_data *mysd = this_cpu_ptr(&softnet_data);
if (sd != mysd) {
sd->rps_ipi_next = mysd->rps_ipi_list;
@@ -3149,7 +3149,7 @@ static bool skb_flow_limit(struct sk_buf
if (qlen < (netdev_max_backlog >> 1))
return false;
- sd = &__get_cpu_var(softnet_data);
+ sd = this_cpu_ptr(&softnet_data);
rcu_read_lock();
fl = rcu_dereference(sd->flow_limit);
@@ -3291,7 +3291,7 @@ EXPORT_SYMBOL(netif_rx_ni);
static void net_tx_action(struct softirq_action *h)
{
- struct softnet_data *sd = &__get_cpu_var(softnet_data);
+ struct softnet_data *sd = this_cpu_ptr(&softnet_data);
if (sd->completion_queue) {
struct sk_buff *clist;
@@ -3711,7 +3711,7 @@ EXPORT_SYMBOL(netif_receive_skb);
static void flush_backlog(void *arg)
{
struct net_device *dev = arg;
- struct softnet_data *sd = &__get_cpu_var(softnet_data);
+ struct softnet_data *sd = this_cpu_ptr(&softnet_data);
struct sk_buff *skb, *tmp;
rps_lock(sd);
@@ -4157,7 +4157,7 @@ void __napi_schedule(struct napi_struct
unsigned long flags;
local_irq_save(flags);
- ____napi_schedule(&__get_cpu_var(softnet_data), n);
+ ____napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
}
EXPORT_SYMBOL(__napi_schedule);
@@ -4285,7 +4285,7 @@ EXPORT_SYMBOL(netif_napi_del);
static void net_rx_action(struct softirq_action *h)
{
- struct softnet_data *sd = &__get_cpu_var(softnet_data);
+ struct softnet_data *sd = this_cpu_ptr(&softnet_data);
unsigned long time_limit = jiffies + 2;
int budget = netdev_budget;
void *have;
Index: linux/net/core/drop_monitor.c
===================================================================
--- linux.orig/net/core/drop_monitor.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/core/drop_monitor.c 2013-12-02 16:07:45.254759699 -0600
@@ -147,7 +147,7 @@ static void trace_drop_common(struct sk_
unsigned long flags;
local_irq_save(flags);
- data = &__get_cpu_var(dm_cpu_data);
+ data = this_cpu_ptr(&dm_cpu_data);
spin_lock(&data->lock);
dskb = data->skb;
Index: linux/net/core/skbuff.c
===================================================================
--- linux.orig/net/core/skbuff.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/core/skbuff.c 2013-12-02 16:07:45.254759699 -0600
@@ -371,7 +371,7 @@ static void *__netdev_alloc_frag(unsigne
unsigned long flags;
local_irq_save(flags);
- nc = &__get_cpu_var(netdev_alloc_cache);
+ nc = this_cpu_ptr(&netdev_alloc_cache);
if (unlikely(!nc->frag.page)) {
refill:
for (order = NETDEV_FRAG_PAGE_MAX_ORDER; ;) {
Index: linux/net/ipv4/tcp_output.c
===================================================================
--- linux.orig/net/ipv4/tcp_output.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/ipv4/tcp_output.c 2013-12-02 16:07:45.254759699 -0600
@@ -815,7 +815,7 @@ void tcp_wfree(struct sk_buff *skb)
/* queue this socket to tasklet queue */
local_irq_save(flags);
- tsq = &__get_cpu_var(tsq_tasklet);
+ tsq = this_cpu_ptr(&tsq_tasklet);
list_add(&tp->tsq_node, &tsq->head);
tasklet_schedule(&tsq->tasklet);
local_irq_restore(flags);
Index: linux/net/ipv6/syncookies.c
===================================================================
--- linux.orig/net/ipv6/syncookies.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/ipv6/syncookies.c 2013-12-02 16:07:45.254759699 -0600
@@ -67,7 +67,7 @@ static u32 cookie_hash(const struct in6_
net_get_random_once(syncookie6_secret, sizeof(syncookie6_secret));
- tmp = __get_cpu_var(ipv6_cookie_scratch);
+ tmp = this_cpu_ptr(ipv6_cookie_scratch);
/*
* we have 320 bits of information to hash, copy in the remaining
Index: linux/net/rds/ib_rdma.c
===================================================================
--- linux.orig/net/rds/ib_rdma.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/rds/ib_rdma.c 2013-12-02 16:07:45.254759699 -0600
@@ -267,7 +267,7 @@ static inline struct rds_ib_mr *rds_ib_r
unsigned long *flag;
preempt_disable();
- flag = &__get_cpu_var(clean_list_grace);
+ flag = this_cpu_ptr(&clean_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(&pool->clean_list);
if (ret)
Index: linux/include/net/netfilter/nf_conntrack.h
===================================================================
--- linux.orig/include/net/netfilter/nf_conntrack.h 2013-12-02 16:07:45.264759422 -0600
+++ linux/include/net/netfilter/nf_conntrack.h 2013-12-02 16:07:45.254759699 -0600
@@ -235,7 +235,7 @@ extern s32 (*nf_ct_nat_offset)(const str
DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
static inline struct nf_conn *nf_ct_untracked_get(void)
{
- return &__raw_get_cpu_var(nf_conntrack_untracked);
+ return raw_cpu_ptr(&nf_conntrack_untracked);
}
void nf_ct_untracked_status_or(unsigned long bits);
Index: linux/include/net/snmp.h
===================================================================
--- linux.orig/include/net/snmp.h 2013-12-02 16:07:45.264759422 -0600
+++ linux/include/net/snmp.h 2013-12-02 16:07:45.254759699 -0600
@@ -170,7 +170,7 @@ struct linux_xfrm_mib {
#define SNMP_ADD_STATS64_BH(mib, field, addend) \
do { \
- __typeof__(*mib[0]) *ptr = __this_cpu_ptr((mib)[0]); \
+ __typeof__(*mib[0]) *ptr = raw_cpu_ptr((mib)[0]); \
u64_stats_update_begin(&ptr->syncp); \
ptr->mibs[field] += addend; \
u64_stats_update_end(&ptr->syncp); \
@@ -192,7 +192,7 @@ struct linux_xfrm_mib {
#define SNMP_UPD_PO_STATS64_BH(mib, basefield, addend) \
do { \
__typeof__(*mib[0]) *ptr; \
- ptr = __this_cpu_ptr((mib)[0]); \
+ ptr = raw_cpu_ptr((mib)[0]); \
u64_stats_update_begin(&ptr->syncp); \
ptr->mibs[basefield##PKTS]++; \
ptr->mibs[basefield##OCTETS] += addend; \
Index: linux/net/ipv4/route.c
===================================================================
--- linux.orig/net/ipv4/route.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/ipv4/route.c 2013-12-02 16:09:22.000000000 -0600
@@ -1306,7 +1306,7 @@ static bool rt_cache_route(struct fib_nh
if (rt_is_input_route(rt)) {
p = (struct rtable **)&nh->nh_rth_input;
} else {
- p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
+ p = (struct rtable **)raw_cpu_ptr(nh->nh_pcpu_rth_output);
}
orig = *p;
@@ -1932,7 +1932,7 @@ static struct rtable *__mkroute_output(c
do_cache = false;
goto add;
}
- prth = __this_cpu_ptr(nh->nh_pcpu_rth_output);
+ prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
}
rth = rcu_dereference(*prth);
if (rt_cache_valid(rth)) {
Index: linux/net/ipv4/tcp.c
===================================================================
--- linux.orig/net/ipv4/tcp.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/ipv4/tcp.c 2013-12-02 16:07:45.254759699 -0600
@@ -2981,7 +2981,7 @@ struct tcp_md5sig_pool *tcp_get_md5sig_p
local_bh_disable();
p = ACCESS_ONCE(tcp_md5sig_pool);
if (p)
- return __this_cpu_ptr(p);
+ return raw_cpu_ptr(p);
local_bh_enable();
return NULL;
Index: linux/net/ipv4/syncookies.c
===================================================================
--- linux.orig/net/ipv4/syncookies.c 2013-12-02 16:07:45.264759422 -0600
+++ linux/net/ipv4/syncookies.c 2013-12-02 16:07:45.254759699 -0600
@@ -40,7 +40,7 @@ static u32 cookie_hash(__be32 saddr, __b
net_get_random_once(syncookie_secret, sizeof(syncookie_secret));
- tmp = __get_cpu_var(ipv4_cookie_scratch);
+ tmp = this_cpu_ptr(ipv4_cookie_scratch);
memcpy(tmp + 4, syncookie_secret[c], sizeof(syncookie_secret[c]));
tmp[0] = (__force u32)saddr;
tmp[1] = (__force u32)daddr;
^ permalink raw reply
* Re: [PATCH V3 net-next 1/3] ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET
From: Florent Fourcot @ 2014-01-17 15:18 UTC (permalink / raw)
To: netdev, Hannes Frederic Sowa
In-Reply-To: <52D91F37.8060602@enst-bretagne.fr>
>>
>> I am not sure here, do you write the flow_label on the listening socket?
>>
>
> Hum, yes. You are right, this is bad.
>
> One alternative is not so simple. Perhaps could we store the flowlabel
> in inet_request_sock? It will be available in the tcp_v6_send_synack
> function, even in case of retransmission.
Ok, it is possible to use ireq->pktopts, without adding any memory
overhead for other users. I will send a V4.
Thanks Hannes,
Florent.
^ permalink raw reply
* Re: [Xen-devel] [PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: David Vrabel @ 2014-01-17 15:40 UTC (permalink / raw)
To: Wei Liu; +Cc: annie li, ian.campbell, netdev, xen-devel, andrew.bennieston,
davem
In-Reply-To: <20140117120810.GA11681@zion.uk.xensource.com>
On 17/01/14 12:08, Wei Liu wrote:
> On Fri, Jan 17, 2014 at 02:25:40PM +0800, annie li wrote:
>>
>> On 2014/1/16 19:10, David Vrabel wrote:
>>> On 15/01/14 23:57, Annie Li wrote:
>>>> This patch implements two things:
>>>>
>>>> * release grant reference and skb for rx path, this fixex resource leaking.
>>>> * clean up grant transfer code kept from old netfront(2.6.18) which grants
>>>> pages for access/map and transfer. But grant transfer is deprecated in current
>>>> netfront, so remove corresponding release code for transfer.
>>>>
>>>> gnttab_end_foreign_access_ref may fail when the grant entry is currently used
>>>> for reading or writing. But this patch does not cover this and improvement for
>>>> this failure may be implemented in a separate patch.
>>> I don't think replacing a resource leak with a security bug is a good idea.
>>>
>>> If you would prefer not to fix the gnttab_end_foreign_access() call, I
>>> think you can fix this in netfront by taking a reference to the page
>>> before calling gnttab_end_foreign_access(). This will ensure the page
>>> isn't freed until the subsequent kfree_skb(), or the gref is released by
>>> the foreign domain (whichever is later).
>>
>> Taking a reference to the page before calling
>> gnttab_end_foreign_access() delays the free work until kfree_skb().
>> Simply adding put_page before kfree_skb() does not make things
>> different from gnttab_end_foreign_access_ref(), and the pages will
>> be freed by kfree_skb(), problem will be hit in
>> gnttab_handle_deferred() when freeing pages which already be freed.
>>
>
> I think David's idea is:
>
> get_page
> gnttab_end_foreign_access
> kfree_skb
>
> The get_page is to offset put_page in gnttab_end_foreign_access. You
> don't need to put page before kfree_skb.
Yes.
David
^ permalink raw reply
* Re: [Xen-devel] [PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: annie li @ 2014-01-17 15:43 UTC (permalink / raw)
To: Wei Liu
Cc: David Vrabel, ian.campbell, netdev, xen-devel, andrew.bennieston,
davem
In-Reply-To: <20140117140246.GB11681@zion.uk.xensource.com>
On 2014-1-17 22:02, Wei Liu wrote:
> On Fri, Jan 17, 2014 at 08:32:29PM +0800, annie li wrote:
>> On 2014-1-17 20:08, Wei Liu wrote:
>>> On Fri, Jan 17, 2014 at 02:25:40PM +0800, annie li wrote:
>>>> On 2014/1/16 19:10, David Vrabel wrote:
>>>>> On 15/01/14 23:57, Annie Li wrote:
>>>>>> This patch implements two things:
>>>>>>
>>>>>> * release grant reference and skb for rx path, this fixex resource leaking.
>>>>>> * clean up grant transfer code kept from old netfront(2.6.18) which grants
>>>>>> pages for access/map and transfer. But grant transfer is deprecated in current
>>>>>> netfront, so remove corresponding release code for transfer.
>>>>>>
>>>>>> gnttab_end_foreign_access_ref may fail when the grant entry is currently used
>>>>>> for reading or writing. But this patch does not cover this and improvement for
>>>>>> this failure may be implemented in a separate patch.
>>>>> I don't think replacing a resource leak with a security bug is a good idea.
>>>>>
>>>>> If you would prefer not to fix the gnttab_end_foreign_access() call, I
>>>>> think you can fix this in netfront by taking a reference to the page
>>>>> before calling gnttab_end_foreign_access(). This will ensure the page
>>>>> isn't freed until the subsequent kfree_skb(), or the gref is released by
>>>>> the foreign domain (whichever is later).
>>>> Taking a reference to the page before calling
>>>> gnttab_end_foreign_access() delays the free work until kfree_skb().
>>>> Simply adding put_page before kfree_skb() does not make things
>>>> different from gnttab_end_foreign_access_ref(), and the pages will
>>>> be freed by kfree_skb(), problem will be hit in
>>>> gnttab_handle_deferred() when freeing pages which already be freed.
>>>>
>>> I think David's idea is:
>>>
>>> get_page
>>> gnttab_end_foreign_access
>>> kfree_skb
>>>
>>> The get_page is to offset put_page in gnttab_end_foreign_access. You
>>> don't need to put page before kfree_skb.
>> Yes, this is what I described as following about David's patch.
>>
>>>> So put_page is required in gnttab_end_foreign_access(), this will
>>>> ensure either free is taken by kfree_skb or gnttab_handle_deferred.
>>>> This involves changes in blkfront/pcifront/tpmfront(just like your
>>>> patch), this way ensure page is released when ref is end.
>> But this would has some issue in netfront tx path. Netfront ends all
> What issue with tx path? Your patch only touches rx skbs, doesn't it?
No, I am trying to implement 2 patches. One is my original patch which
fix rx leaking, another is to improve gnttab_end_foreign_access, it
would involve not only tx path, but also blkfront/pcifront/tpmfront
since they use gnttab_end_foreign_access in their source code.
>
>> grant reference of one skb first and then release this skb. If the
>> gnttab_end_foreign_access_ref fails in gnttab_end_foreign_access(),
>> this frag page and corresponding grant reference will be put in
>> entry and release work will be done in the timer routine. If some
> I understand up to this point.
>
>> frag pages of one skb is free in this timer routine, then
>> dev_kfree_skb_irq will free pages which have been freed.
> Why is dev_kfree_skb_irq involved? It is used in tx path not rx path.
This is involved in second patch as David suggested, it ensures page
would be released when grant access is end and avoid situation where
page is released but grant reference is still mapped.
> Even if we look at dev_kfree_skb_irq, it calls __kfree_skb for dropped
> packet eventually, which should do the right thing if we don't mess up
> ref counts.
I think you are right, I mixed it with get_skb just now. Either
__kfree_skb or gnttab_end_foreign_access() does the free work.
Thanks
Annie
>
> Wei.
>
>> So I prefer following way I mentioned, suggestions?
>>
>>>> Another solution I am thinking is calling
>>>> gnttab_end_foreign_access() with page parameter as NULL, then
>>>> gnttab_end_foreign_access will only do ending grant reference work
>>>> and releasing page work is done by kfree_skb().
>> Thanks
>> Annie
^ permalink raw reply
* [PATCH] [PATCH net-next v2] net: stmmac: fix NULL pointer dereference in stmmac_get_tx_hwtstamp
From: Bruce Liu @ 2014-01-17 15:47 UTC (permalink / raw)
To: peppe.cavallaro; +Cc: netdev, linux-kernel, Bruce Liu
When timestamping is enabled, stmmac_tx_clean will call
stmmac_get_tx_hwtstamp to get tx TS.
But the skb can be NULL because the last of its tx_skbuff is NULL
if this packet frame is filled in more than one descriptors.
To fix the issue, change the code:
- Store TX skb to the tx_skbuff[] of frame's last segment.
- Check skb is not NULL in stmmac_get_tx_hwtstamp.
Signed-off-by: Bruce Liu <damuzi000@gmail.com>
---
drivers/net/ethernet/stmicro/stmmac/chain_mode.c | 3 +-
drivers/net/ethernet/stmicro/stmmac/ring_mode.c | 2 +-
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 35 +++++++++++----------
3 files changed, 21 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
index d234ab5..72d282b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
@@ -51,6 +51,7 @@ static unsigned int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int csum)
priv->hw->desc->prepare_tx_desc(desc, 1, bmax, csum, STMMAC_CHAIN_MODE);
while (len != 0) {
+ priv->tx_skbuff[entry] = NULL;
entry = (++priv->cur_tx) % txsize;
desc = priv->dma_tx + entry;
@@ -62,7 +63,6 @@ static unsigned int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int csum)
priv->hw->desc->prepare_tx_desc(desc, 0, bmax, csum,
STMMAC_CHAIN_MODE);
priv->hw->desc->set_tx_owner(desc);
- priv->tx_skbuff[entry] = NULL;
len -= bmax;
i++;
} else {
@@ -73,7 +73,6 @@ static unsigned int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int csum)
priv->hw->desc->prepare_tx_desc(desc, 0, len, csum,
STMMAC_CHAIN_MODE);
priv->hw->desc->set_tx_owner(desc);
- priv->tx_skbuff[entry] = NULL;
len = 0;
}
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
index 1ef9d8a..a96c7c2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
@@ -58,6 +58,7 @@ static unsigned int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int csum)
priv->hw->desc->prepare_tx_desc(desc, 1, bmax, csum,
STMMAC_RING_MODE);
wmb();
+ priv->tx_skbuff[entry] = NULL;
entry = (++priv->cur_tx) % txsize;
if (priv->extend_desc)
@@ -73,7 +74,6 @@ static unsigned int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int csum)
STMMAC_RING_MODE);
wmb();
priv->hw->desc->set_tx_owner(desc);
- priv->tx_skbuff[entry] = NULL;
} else {
desc->des2 = dma_map_single(priv->device, skb->data,
nopaged_len, DMA_TO_DEVICE);
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 797b56a..5cf52ad 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -332,7 +332,7 @@ static void stmmac_get_tx_hwtstamp(struct stmmac_priv *priv,
return;
/* exit if skb doesn't support hw tstamp */
- if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS)))
+ if (likely(!skb || !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS)))
return;
if (priv->adv_ts)
@@ -1161,21 +1161,24 @@ static void dma_free_tx_skbufs(struct stmmac_priv *priv)
int i;
for (i = 0; i < priv->dma_tx_size; i++) {
- if (priv->tx_skbuff[i] != NULL) {
- struct dma_desc *p;
- if (priv->extend_desc)
- p = &((priv->dma_etx + i)->basic);
- else
- p = priv->dma_tx + i;
+ struct dma_desc *p;
- if (priv->tx_skbuff_dma[i])
- dma_unmap_single(priv->device,
- priv->tx_skbuff_dma[i],
- priv->hw->desc->get_tx_len(p),
- DMA_TO_DEVICE);
+ if (priv->extend_desc)
+ p = &((priv->dma_etx + i)->basic);
+ else
+ p = priv->dma_tx + i;
+
+ if (priv->tx_skbuff_dma[i]) {
+ dma_unmap_single(priv->device,
+ priv->tx_skbuff_dma[i],
+ priv->hw->desc->get_tx_len(p),
+ DMA_TO_DEVICE);
+ priv->tx_skbuff_dma[i] = 0;
+ }
+
+ if (priv->tx_skbuff[i] != NULL) {
dev_kfree_skb_any(priv->tx_skbuff[i]);
priv->tx_skbuff[i] = NULL;
- priv->tx_skbuff_dma[i] = 0;
}
}
}
@@ -1844,8 +1847,6 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
first = desc;
- priv->tx_skbuff[entry] = skb;
-
/* To program the descriptors according to the size of the frame */
if (priv->mode == STMMAC_RING_MODE) {
is_jumbo = priv->hw->ring->is_jumbo_frm(skb->len,
@@ -1873,6 +1874,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
int len = skb_frag_size(frag);
+ priv->tx_skbuff[entry] = NULL;
entry = (++priv->cur_tx) % txsize;
if (priv->extend_desc)
desc = (struct dma_desc *)(priv->dma_etx + entry);
@@ -1882,7 +1884,6 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
desc->des2 = skb_frag_dma_map(priv->device, frag, 0, len,
DMA_TO_DEVICE);
priv->tx_skbuff_dma[entry] = desc->des2;
- priv->tx_skbuff[entry] = NULL;
priv->hw->desc->prepare_tx_desc(desc, 0, len, csum_insertion,
priv->mode);
wmb();
@@ -1890,6 +1891,8 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
wmb();
}
+ priv->tx_skbuff[entry] = skb;
+
/* Finalize the latest segment. */
priv->hw->desc->close_tx_desc(desc);
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCH v2] ipv6: send Change Status Report after DAD is completed
From: Hannes Frederic Sowa @ 2014-01-17 16:02 UTC (permalink / raw)
To: Flavio Leitner; +Cc: netdev, Hideaki YOSHIFUJI
In-Reply-To: <1389907679-15346-1-git-send-email-fbl@redhat.com>
On Thu, Jan 16, 2014 at 07:27:59PM -0200, Flavio Leitner wrote:
> The RFC 3810 defines two type of messages for multicast
> listeners. The "Current State Report" message, as the name
> implies, refreshes the *current* state to the querier.
> Since the querier sends Query messages periodically, there
> is no need to retransmit the report.
>
> On the other hand, any change should be reported immediately
> using "State Change Report" messages. Since it's an event
> triggered by a change and that it can be affected by packet
> loss, the rfc states it should be retransmitted [RobVar] times
> to make sure routers will receive timely.
>
> Currently, we are sending "Current State Reports" after
> DAD is completed. Before that, we send messages using
> unspecified address (::) which should be silently discarded
> by routers.
>
> This patch changes to send "State Change Report" messages
> after DAD is completed fixing the behavior to be RFC compliant
> and also to pass TAHI IPv6 testsuite.
>
> Signed-off-by: Flavio Leitner <fbl@redhat.com>
I don't see any obvious problems and looks spec conformant, thanks!
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
^ permalink raw reply
* Re: Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Tom Herbert @ 2014-01-17 16:03 UTC (permalink / raw)
To: Jason Wang
Cc: Stefan Hajnoczi, Zhi Yong Wu, Linux Netdev List, Eric Dumazet,
David S. Miller, Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell
In-Reply-To: <52D8CF65.1090100@redhat.com>
On Thu, Jan 16, 2014 at 10:36 PM, Jason Wang <jasowang@redhat.com> wrote:
> On 01/17/2014 01:08 PM, Tom Herbert wrote:
>> On Thu, Jan 16, 2014 at 7:26 PM, Jason Wang <jasowang@redhat.com> wrote:
>>> On 01/17/2014 01:12 AM, Tom Herbert wrote:
>>>> On Thu, Jan 16, 2014 at 12:52 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Thu, Jan 16, 2014 at 04:34:10PM +0800, Zhi Yong Wu wrote:
>>>>>> CC: stefanha, MST, Rusty Russel
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>>> Date: Thu, Jan 16, 2014 at 12:23 PM
>>>>>> Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
>>>>>> To: Zhi Yong Wu <zwu.kernel@gmail.com>
>>>>>> Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
>>>>>> davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>>>>
>>>>>>
>>>>>> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>>>>>>> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>>>>>>>
>>>>>>> HI, folks
>>>>>>>
>>>>>>> The patchset is trying to integrate aRFS support to virtio_net. In this case,
>>>>>>> aRFS will be used to select the RX queue. To make sure that it's going ahead
>>>>>>> in the correct direction, although it is still one RFC and isn't tested, it's
>>>>>>> post out ASAP. Any comment are appreciated, thanks.
>>>>>>>
>>>>>>> If anyone is interested in playing with it, you can get this patchset from my
>>>>>>> dev git on github:
>>>>>>> git://github.com/wuzhy/kernel.git virtnet_rfs
>>>>>>>
>>>>>>> Zhi Yong Wu (3):
>>>>>>> virtio_pci: Introduce one new config api vp_get_vq_irq()
>>>>>>> virtio_net: Introduce one dummy function virtnet_filter_rfs()
>>>>>>> virtio-net: Add accelerated RFS support
>>>>>>>
>>>>>>> drivers/net/virtio_net.c | 67 ++++++++++++++++++++++++++++++++++++++++-
>>>>>>> drivers/virtio/virtio_pci.c | 11 +++++++
>>>>>>> include/linux/virtio_config.h | 12 +++++++
>>>>>>> 3 files changed, 89 insertions(+), 1 deletions(-)
>>>>>>>
>>>>>> Please run get_maintainter.pl before sending the patch. You'd better
>>>>>> at least cc virtio maintainer/list for this.
>>>>>>
>>>>>> The core aRFS method is a noop in this RFC which make this series no
>>>>>> much sense to discuss. You should at least mention the big picture
>>>>>> here in the cover letter. I suggest you should post a RFC which can
>>>>>> run and has expected result or you can just raise a thread for the
>>>>>> design discussion.
>>>>>>
>>>>>> And this method has been discussed before, you can search "[net-next
>>>>>> RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
>>>>>> for a very old prototype implemented by me. It can work and looks like
>>>>>> most of this RFC have already done there.
>>>>>>
>>>>>> A basic question is whether or not we need this, not all the mq cards
>>>>>> use aRFS (see ixgbe ATR). And whether or not it can bring extra
>>>>>> overheads? For virtio, we want to reduce the vmexits as much as
>>>>>> possible but this aRFS seems introduce a lot of more of this. Making a
>>>>>> complex interfaces just for an virtual device may not be good, simple
>>>>>> method may works for most of the cases.
>>>>>>
>>>>>> We really should consider to offload this to real nic. VMDq and L2
>>>>>> forwarding offload may help in this case.
>>>> Adding flow director support would be a good step, Zhi's patches for
>>>> support in tun have been merged, so support in virtio-net would be a
>>>> good follow on. But, flow-director does have some limitations and
>>>> performance issues of it's own (forced pairing between TX and RX
>>>> queues, lookup on every TX packet).
>>> True. But the pairing was designed to work without guest involving since
>>> we really want to reduce the vmexits from guest. And lookup on every TX
>>> packets could be released to every N packets. But I agree exposing the
>>> API to guest may bring lots of flexibility.
>>>> In the case of virtualization,
>>>> aRFS, RSS, ntuple filtering, LRO, etc. can be implemented as software
>>>> emulations and so far seems to be wins in most cases. Extending these
>>>> down into the stack so that they can leverage HW mechanisms is a good
>>>> goal for best performance. It's probably generally true that most of
>>>> the offloads commonly available for NICs we'll want in virtualization
>>>> path. Of course, we need to deomonstrate that they provide real
>>>> performance benefit in this use case.
>>> Yes, we need a prototype to see how much it can help.
>>>> I believe tying in aRFS (or flow director) into a real aRFS is just a
>>>> matter of programming the RFS table properly. This is not the complex
>>>> side of the interface, I believe this already works with the tun
>>>> patches.
>>> Right, what we may needs is
>>>
>>> - exposing new tun ioctls for qemu adding or removing a flow
>>> - new virtqueue command for guest driver to adding or removing a flow
>>> (btw, current control virtqueue is really slow, we may need to improve it).
>>> - an agreement of host and guest to use the same hash method, or just
>>> compute software hash in host and pass it to guest (which needs extra
>>> API to do)
>> The model to get RX hash from a device is well known, the guest can
>> use that to reflect information about a flow back to the host, and for
>> performance we might piggyback RX queue selection on the TX
>> descriptors of a flow. Probably some limitations with real HW, but I
>> assume would have less issues in SW.
>
> It may work but may need extending the current virtio-net TX descriptor
> or extra API such as vnet header.
>>
>> IMO, if we have a flow state on the host we should *never* need to
>> perform any hash computation on TX (a host is not a switch :-) ), we
>> may want to have some mirrored flow state in the kernel for these
>> flows which are indexed by the hash provided in TX.
>
> The problem is host may have several different type cards, so the it was
> not guaranteed that they can provide the same rxhash.
RIght, that property has to be taken into account with all the uses of
rxhash, we can never compute a hash on the host with the expectation
it to match a HW rxhash (this was a problem with the original tun flow
mapping code). What is done is to cache the rxhash in the flow state
(e.g. TCP). This is the value that is used to program the RFS table
and can be sent back to the device to map match the flow. With a
guest, the HW hash can be propagated all the way to the guest flow
state, aRFS is one way to reflect this back to the kernel and
potentially all the way to the device.
>>
>>> - change guest driver to use aRFS
>>>
>>> Some of the above has been implemented in my old RFC.
>> Looks pretty similar to Zhi's tun work. Are you planning to refresh
>> those patches?
>
> I have the plan. But there's another concern:
>
> During my testing ( and also tested by some IBM engineers in the past),
> we find it's better for a single vhost thread to handle both rx and tx
> for a single flow. Using two different vhost threads to handle a flow
> may damage the performance in most of the cases. That's why we enforce
> the pairing of rx and tx in tun currently. But looks like aRFS can't
> guarantee this. If we want to enforce this paring through XPS/irq
> affinity, there's no need for aRFS.
>>
>>>>> Zhi Yong and I had an IRC chat. I wanted to post my questions on the
>>>>> list - it's still the same concern I had in the old email thread that
>>>>> Jason mentioned.
>>>>>
>>>>> In order for virtio-net aRFS to make sense there needs to be an overall
>>>>> plan for pushing flow mapping information down to the physical NIC.
>>>>> That's the only way to actually achieve the benefit of steering:
>>>>> processing the packet on the CPU where the application is running.
>>>>>
>>>> I don't think this is necessarily true. Per flow steering amongst
>>>> virtual queues should be beneficial in itself. virtio-net can leverage
>>>> RFS or aRFS where it's available.
>>>>
>>>>> If it's not possible or too hard to implement aRFS down the entire
>>>>> stack, we won't be able to process the packet on the right CPU.
>>>>> Then we might as well not bother with aRFS and just distribute uniformly
>>>>> across the rx virtqueues.
>>>>>
>>>>> Please post an outline of how rx packets will be steered up the stack so
>>>>> we can discuss whether aRFS can bring any benefit.
>>>>>
>>>> 1. The aRFS interface for the guest to specify which virtual queue to
>>>> receive a packet on is fairly straight forward.
>>>> 2. To hook into RFS, we need to match the virtual queue to the real
>>>> CPU it will processed on, and then program the RFS table for that flow
>>>> and CPU.
>>>> 3. NIC aRFS keys off the RFS tables so it can program the HW with the
>>>> correct queue for the CPU.
>>>>
>>>>> Stefan
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* [PATCH net-next v3] net: introduce SO_BPF_EXTENSIONS
From: Michal Sekletar @ 2014-01-17 16:09 UTC (permalink / raw)
To: netdev; +Cc: Michal Sekletar, David Miller
For user space packet capturing libraries such as libpcap, there's
currently only one way to check which BPF extensions are supported
by the kernel, that is, commit aa1113d9f85d ("net: filter: return
-EINVAL if BPF_S_ANC* operation is not supported"). For querying all
extensions at once this might be rather inconvenient.
Therefore, this patch introduces a new option which can be used as
an argument for getsockopt(), and allows one to obtain information
about which BPF extensions are supported by the current kernel.
As David Miller suggests, we do not need to define any bits right
now and status quo can just return 0 in order to state that this
versions supports SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET. Later
additions to BPF extensions need to add their bits to the
bpf_tell_extensions() function, as documented in the comment.
Signed-off-by: Michal Sekletar <msekleta@redhat.com>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
---
Dave, this will create a minor merge conflict with the net
tree when net-next is merged into Linus tree. The socket.h
in parisc architecture has for SO_MAX_PACING_RATE 0x4048
which should be merged to take the newly submitted 0x4028.
Thanks!
arch/alpha/include/uapi/asm/socket.h | 2 ++
arch/avr32/include/uapi/asm/socket.h | 2 ++
arch/cris/include/uapi/asm/socket.h | 2 ++
arch/frv/include/uapi/asm/socket.h | 2 ++
arch/ia64/include/uapi/asm/socket.h | 2 ++
arch/m32r/include/uapi/asm/socket.h | 2 ++
arch/mips/include/uapi/asm/socket.h | 2 ++
arch/mn10300/include/uapi/asm/socket.h | 2 ++
arch/parisc/include/uapi/asm/socket.h | 2 ++
arch/powerpc/include/uapi/asm/socket.h | 2 ++
arch/s390/include/uapi/asm/socket.h | 2 ++
arch/sparc/include/uapi/asm/socket.h | 2 ++
arch/xtensa/include/uapi/asm/socket.h | 2 ++
include/linux/filter.h | 11 +++++++++++
include/uapi/asm-generic/socket.h | 2 ++
net/core/sock.c | 4 ++++
16 files changed, 43 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index e3a1491..3de1394 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -85,4 +85,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index cbf902e..6e6cd15 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -78,4 +78,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index 13829aa..ed94e5e 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -80,6 +80,8 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 5d42997..ca2c6e6 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -78,5 +78,7 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index c25302f..a1b49ba 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 5296665..6c9a24b 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -78,4 +78,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 0df9787..a14baa2 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -96,4 +96,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 71dedca..6aa3ce1 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -78,4 +78,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index f33113a..a586f61 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -77,4 +77,6 @@
#define SO_MAX_PACING_RATE 0x4048
+#define SO_BPF_EXTENSIONS 0x4029
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index fa69832..a9c3e2e 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -85,4 +85,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index c286c2e..e031332 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -84,4 +84,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 0f21e9a..54d9608 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -74,6 +74,8 @@
#define SO_MAX_PACING_RATE 0x0031
+#define SO_BPF_EXTENSIONS 0x0032
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 7db5c22..39acec0 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -89,4 +89,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ff4e40c..1a95a2d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -83,6 +83,17 @@ static inline void bpf_jit_free(struct sk_filter *fp)
#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
#endif
+static inline int bpf_tell_extensions(void)
+{
+ /* When adding new BPF extension it is necessary to enumerate
+ * it here, so userspace software which wants to know what is
+ * supported can do so by inspecting return value of this
+ * function
+ */
+
+ return 0;
+}
+
enum {
BPF_S_RET_K = 1,
BPF_S_RET_A,
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 38f14d0..ea0796b 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -80,4 +80,6 @@
#define SO_MAX_PACING_RATE 47
+#define SO_BPF_EXTENSIONS 48
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index b3f7ee3..0c127dc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1167,6 +1167,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sock_flag(sk, SOCK_FILTER_LOCKED);
break;
+ case SO_BPF_EXTENSIONS:
+ v.val = bpf_tell_extensions();
+ break;
+
case SO_SELECT_ERR_QUEUE:
v.val = sock_flag(sk, SOCK_SELECT_ERR_QUEUE);
break;
--
1.8.4.2
^ permalink raw reply related
* [PATCH V4 net-next 2/3] ipv6: add a flag to get the flow label used remotly
From: Florent Fourcot @ 2014-01-17 16:15 UTC (permalink / raw)
To: netdev; +Cc: Florent Fourcot
In-Reply-To: <1389975305-12477-1-git-send-email-florent.fourcot@enst-bretagne.fr>
This information is already available via IPV6_FLOWINFO
of IPV6_2292PKTOPTIONS, and them a filtering to get the flow label
information. But it is probably logical and easier for users to add this
here, and to control both sent/received flow label values with the
IPV6_FLOWLABEL_MGR option.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
---
include/net/ipv6.h | 3 ++-
include/uapi/linux/in6.h | 1 +
net/ipv6/ip6_flowlabel.c | 8 +++++++-
net/ipv6/ipv6_sockglue.c | 5 ++++-
4 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6d80f51..78d3d51 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -253,7 +253,8 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions *opt_space,
struct ipv6_txoptions *fopt);
void fl6_free_socklist(struct sock *sk);
int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen);
-int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq);
+int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq,
+ int flags);
int ip6_flowlabel_init(void);
void ip6_flowlabel_cleanup(void);
diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
index 02c0cd6..633b93c 100644
--- a/include/uapi/linux/in6.h
+++ b/include/uapi/linux/in6.h
@@ -86,6 +86,7 @@ struct in6_flowlabel_req {
#define IPV6_FL_F_CREATE 1
#define IPV6_FL_F_EXCL 2
#define IPV6_FL_F_REFLECT 4
+#define IPV6_FL_F_REMOTE 8
#define IPV6_FL_S_NONE 0
#define IPV6_FL_S_EXCL 1
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index 55823f1..01bf252 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -481,11 +481,17 @@ static inline void fl_link(struct ipv6_pinfo *np, struct ipv6_fl_socklist *sfl,
spin_unlock_bh(&ip6_sk_fl_lock);
}
-int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq)
+int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq,
+ int flags)
{
struct ipv6_pinfo *np = inet6_sk(sk);
struct ipv6_fl_socklist *sfl;
+ if (flags & IPV6_FL_F_REMOTE) {
+ freq->flr_label = np->rcv_flowinfo & IPV6_FLOWLABEL_MASK;
+ return 0;
+ }
+
if (np->repflow) {
freq->flr_label = np->flow_label;
return 0;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 2855b00..7024a87 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -1221,6 +1221,7 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
case IPV6_FLOWLABEL_MGR:
{
struct in6_flowlabel_req freq;
+ int flags;
if (len < sizeof(freq))
return -EINVAL;
@@ -1232,9 +1233,11 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
return -EINVAL;
len = sizeof(freq);
+ flags = freq.flr_flags;
+
memset(&freq, 0, sizeof(freq));
- val = ipv6_flowlabel_opt_get(sk, &freq);
+ val = ipv6_flowlabel_opt_get(sk, &freq, flags);
if (val < 0)
return val;
--
1.8.5.2
^ permalink raw reply related
* [PATCH V4 net-next 1/3] ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET
From: Florent Fourcot @ 2014-01-17 16:15 UTC (permalink / raw)
To: netdev; +Cc: Florent Fourcot
With this option, the socket will reply with the flow label value read
on received packets.
The goal is to have a connection with the same flow label in both
direction of the communication.
Changelog of V4:
* Do not erase the flow label on the listening socket. Use pktopts to
store the received value
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
---
include/linux/ipv6.h | 1 +
include/uapi/linux/in6.h | 1 +
net/ipv6/ip6_flowlabel.c | 21 +++++++++++++++++++++
net/ipv6/tcp_ipv6.c | 12 +++++++++++-
4 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 7e1ded0..1084304 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -191,6 +191,7 @@ struct ipv6_pinfo {
/* sockopt flags */
__u16 recverr:1,
sndflow:1,
+ repflow:1,
pmtudisc:3,
ipv6only:1,
srcprefs:3, /* 001: prefer temporary address
diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
index f94f1d0..02c0cd6 100644
--- a/include/uapi/linux/in6.h
+++ b/include/uapi/linux/in6.h
@@ -85,6 +85,7 @@ struct in6_flowlabel_req {
#define IPV6_FL_F_CREATE 1
#define IPV6_FL_F_EXCL 2
+#define IPV6_FL_F_REFLECT 4
#define IPV6_FL_S_NONE 0
#define IPV6_FL_S_EXCL 1
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index cbc9351..55823f1 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -486,6 +486,11 @@ int ipv6_flowlabel_opt_get(struct sock *sk, struct in6_flowlabel_req *freq)
struct ipv6_pinfo *np = inet6_sk(sk);
struct ipv6_fl_socklist *sfl;
+ if (np->repflow) {
+ freq->flr_label = np->flow_label;
+ return 0;
+ }
+
rcu_read_lock_bh();
for_each_sk_fl_rcu(np, sfl) {
@@ -527,6 +532,15 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
switch (freq.flr_action) {
case IPV6_FL_A_PUT:
+ if (freq.flr_flags & IPV6_FL_F_REFLECT) {
+ if (sk->sk_protocol != IPPROTO_TCP)
+ return -ENOPROTOOPT;
+ if (!np->repflow)
+ return -ESRCH;
+ np->flow_label = 0;
+ np->repflow = 0;
+ return 0;
+ }
spin_lock_bh(&ip6_sk_fl_lock);
for (sflp = &np->ipv6_fl_list;
(sfl = rcu_dereference(*sflp))!=NULL;
@@ -567,6 +581,13 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
return -ESRCH;
case IPV6_FL_A_GET:
+ if (freq.flr_flags & IPV6_FL_F_REFLECT) {
+ if (sk->sk_protocol != IPPROTO_TCP)
+ return -ENOPROTOOPT;
+ np->repflow = 1;
+ return 0;
+ }
+
if (freq.flr_label & ~IPV6_FLOWLABEL_MASK)
return -EINVAL;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index ffd5fa8..47d71ff 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -483,6 +483,9 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
&ireq->ir_v6_rmt_addr);
fl6->daddr = ireq->ir_v6_rmt_addr;
+ if (np->repflow && (ireq->pktopts != NULL))
+ fl6->flowlabel = ip6_flowlabel(ipv6_hdr(ireq->pktopts));
+
skb_set_queue_mapping(skb, queue_mapping);
err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
err = net_xmit_eval(err);
@@ -1013,7 +1016,8 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!isn) {
if (ipv6_opt_accepted(sk, skb) ||
np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
- np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim) {
+ np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+ np->repflow) {
atomic_inc(&skb->users);
ireq->pktopts = skb;
}
@@ -1138,6 +1142,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
newnp->mcast_oif = inet6_iif(skb);
newnp->mcast_hops = ipv6_hdr(skb)->hop_limit;
newnp->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(skb));
+ if (np->repflow)
+ newnp->flow_label = ip6_flowlabel(ipv6_hdr(skb));
/*
* No need to charge this sock to the relevant IPv6 refcnt debug socks count
@@ -1218,6 +1224,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
newnp->mcast_oif = inet6_iif(skb);
newnp->mcast_hops = ipv6_hdr(skb)->hop_limit;
newnp->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(skb));
+ if (np->repflow)
+ newnp->flow_label = ip6_flowlabel(ipv6_hdr(skb));
/* Clone native IPv6 options from listening socket (if any)
@@ -1429,6 +1437,8 @@ ipv6_pktoptions:
np->mcast_hops = ipv6_hdr(opt_skb)->hop_limit;
if (np->rxopt.bits.rxflow || np->rxopt.bits.rxtclass)
np->rcv_flowinfo = ip6_flowinfo(ipv6_hdr(opt_skb));
+ if (np->repflow)
+ np->flow_label = ip6_flowlabel(ipv6_hdr(opt_skb));
if (ipv6_opt_accepted(sk, opt_skb)) {
skb_set_owner_r(opt_skb, sk);
opt_skb = xchg(&np->pktoptions, opt_skb);
--
1.8.5.2
^ permalink raw reply related
* [PATCH v4 net-next 3/3] ipv6: add flowlabel_consistency sysctl
From: Florent Fourcot @ 2014-01-17 16:15 UTC (permalink / raw)
To: netdev; +Cc: Florent Fourcot
In-Reply-To: <1389975305-12477-1-git-send-email-florent.fourcot@enst-bretagne.fr>
With the introduction of IPV6_FL_F_REFLECT, there is no guarantee of
flow label unicity. This patch introduces a new sysctl to protect the old
behaviour, enable by default.
Changelog of V3:
* rename ip6_flowlabel_consistency to flowlabel_consistency
* use net_info_ratelimited()
* checkpatch cleanups
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
---
Documentation/networking/ip-sysctl.txt | 8 ++++++++
include/net/netns/ipv6.h | 1 +
net/ipv6/af_inet6.c | 1 +
net/ipv6/ip6_flowlabel.c | 7 +++++++
net/ipv6/sysctl_net_ipv6.c | 8 ++++++++
5 files changed, 25 insertions(+)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c97932c..5de0374 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1118,6 +1118,14 @@ bindv6only - BOOLEAN
Default: FALSE (as specified in RFC3493)
+flowlabel_consistency - BOOLEAN
+ Protect the consistency (and unicity) of flow label.
+ You have to disable it to use IPV6_FL_F_REFLECT flag on the
+ flow label manager.
+ TRUE: enabled
+ FALSE: disabled
+ Default: TRUE
+
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 592fecd..21edaf1 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -27,6 +27,7 @@ struct netns_sysctl_ipv6 {
int ip6_rt_gc_elasticity;
int ip6_rt_mtu_expires;
int ip6_rt_min_advmss;
+ int flowlabel_consistency;
int icmpv6_time;
int anycast_src_echo_reply;
};
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index c921d5d..d935889 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -775,6 +775,7 @@ static int __net_init inet6_net_init(struct net *net)
net->ipv6.sysctl.bindv6only = 0;
net->ipv6.sysctl.icmpv6_time = 1*HZ;
+ net->ipv6.sysctl.flowlabel_consistency = 1;
atomic_set(&net->ipv6.rt_genid, 0);
err = ipv6_init_mibs(net);
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index 01bf252..dfa41bb 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -588,8 +588,15 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
case IPV6_FL_A_GET:
if (freq.flr_flags & IPV6_FL_F_REFLECT) {
+ struct net *net = sock_net(sk);
+ if (net->ipv6.sysctl.flowlabel_consistency) {
+ net_info_ratelimited("Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable\n");
+ return -EPERM;
+ }
+
if (sk->sk_protocol != IPPROTO_TCP)
return -ENOPROTOOPT;
+
np->repflow = 1;
return 0;
}
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index b51b268..7f405a1 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -31,6 +31,13 @@ static struct ctl_table ipv6_table_template[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "flowlabel_consistency",
+ .data = &init_net.ipv6.sysctl.flowlabel_consistency,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
{ }
};
@@ -59,6 +66,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
goto out;
ipv6_table[0].data = &net->ipv6.sysctl.bindv6only;
ipv6_table[1].data = &net->ipv6.sysctl.anycast_src_echo_reply;
+ ipv6_table[2].data = &net->ipv6.sysctl.flowlabel_consistency;
ipv6_route_table = ipv6_route_sysctl_init(net);
if (!ipv6_route_table)
--
1.8.5.2
^ permalink raw reply related
* Re: [PATCH RFC 4/6] net: rfkill: gpio: add device tree support
From: Arnd Bergmann @ 2014-01-17 16:47 UTC (permalink / raw)
To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
Cc: Chen-Yu Tsai, Johannes Berg, David S. Miller,
devicetree-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-sunxi-/JYPxA39Uh5TLH3MbocFFw,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Maxime Ripard
In-Reply-To: <1389941251-32692-5-git-send-email-wens-jdAy2FN1RRM@public.gmane.org>
On Friday 17 January 2014, Chen-Yu Tsai wrote:
> diff --git a/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt b/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
> new file mode 100644
> index 0000000..8a07ea4
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/rfkill/rfkill-gpio.txt
> @@ -0,0 +1,26 @@
> +GPIO controlled RFKILL devices
> +
> +Required properties:
> +- compatible : Must be "rfkill-gpio".
> +- rfkill-name : Name of RFKILL device
> +- rfkill-type : Type of RFKILL device: 1 for WiFi, 2 for BlueTooth
> +- NAME_shutdown-gpios : GPIO phandle to shutdown control
> + (phandle must be the second)
> +- NAME_reset-gpios : GPIO phandle to reset control
> +
> +NAME must match the rfkill-name property. NAME_shutdown-gpios or
> +NAME_reset-gpios, or both, must be defined.
> +
I don't understand this part. Why do you include the name in the
gpios property, rather than just hardcoding the property strings
to "shutdown-gpios" and "reset-gpios"?
The description of hte "rfkill-name" property seems to suggest
that you can only have one logical RFKILL device per device node,
so he names would not be ambiguous.
Arnd
^ permalink raw reply
* Re: [RFC PATCH net-next 3/3] virtio-net: Add accelerated RFS support
From: Zhi Yong Wu @ 2014-01-17 16:54 UTC (permalink / raw)
To: Ben Hutchings, Stefan Hajnoczi
Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
Zhi Yong Wu
In-Reply-To: <1389914188.11912.146.camel@bwh-desktop.uk.level5networks.com>
On Fri, Jan 17, 2014 at 7:16 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Fri, 2014-01-17 at 06:00 +0800, Zhi Yong Wu wrote:
>> On Fri, Jan 17, 2014 at 5:31 AM, Ben Hutchings
>> <bhutchings@solarflare.com> wrote:
>> > On Wed, 2014-01-15 at 22:20 +0800, Zhi Yong Wu wrote:
>> > [...]
>> >> +static int virtnet_init_rx_cpu_rmap(struct virtnet_info *vi)
>> >> +{
>> >> + int rc = 0;
>> >> +
>> >> +#ifdef CONFIG_RFS_ACCEL
>> >> + struct virtio_device *vdev = vi->vdev;
>> >> + unsigned int irq;
>> >> + int i;
>> >> +
>> >> + if (!vi->affinity_hint_set)
>> >> + goto out;
>> >> +
>> >> + vi->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(vi->max_queue_pairs);
>> >> + if (!vi->dev->rx_cpu_rmap) {
>> >> + rc = -ENOMEM;
>> >> + goto out;
>> >> + }
>> >> +
>> >> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> >> + irq = virtqueue_get_vq_irq(vdev, vi->rq[i].vq);
>> >> + if (irq == -1)
>> >> + goto failed;
>> >
>> > Jumping into an if-statement is confusing. Also do you really want to
>> > return 0 in this case?
>> No, If it fail to get irq, i want it to exit as soon as possible,
>> otherwise it will cause irq_cpu_rmap_add() to be invoked with one
>> incorrect argument irq.
>
> Well currently this goto does result in returning 0, as rc has not been
> changed after its initialisation to 0.
Yes, this goto will result in returning 0. I am thinking if it it
appropriate, but definitely the current code has one bug here. I
thought that when this NIC doesn't enable MSI-X support, and if
CONFIG_RFS_ACCEL is defined, its aRFS cpu_rmap table will not be
allocated here and aRFS will be ignored automatically, but this will
trigger set_rps_cpu() to hit memory issue.
>
>> By the way, do you have thought about if it makes sense to add aRFS
>> support to virtio_net? For [patch 2/3], what do you think of those
>> missing stuff listed by me?
>> For how indirect table is implemented in sfc NIC, do you have any doc
>> to share with me? thanks.
>
> Going through that list:
>
>> 1.) guest virtio_net driver should have one filter table and its
>> entries can be expired periodically;
>
> In sfc, we keep a count how many entries have been inserted in each NAPI
> context. Whenever the NAPI poll function is about to call
> napi_complete() and the count for that context has reached a trigger
> level, it will scan some quota of filter entries for expiry.
thanks for your sharing.
>
>> 2.) guest virtio_net driver should pass rx queue index and filter
>> info down to the emulated virtio_net NIC in QEMU.
>> 3.) the emulated virtio_net NIC should have its indirect table to
>> store the flow to rx queue mapping.
>> 4.) the emulated virtio_net NIC should classify the rx packet to
>> selected queue by applying the filter.
>
> I think the most efficient way to do this would be to put a hash table
> in some shared memory that both guest and host can read and write. The
> virtio control path would only be used to set up and tear down the
> table. I don't know whether virtio allows for that.
Yes, i agree with you, and don't know what stefan think if it. CC
stefanha. He is virtio expert. Stefan, Can you give some advice about
this?
>
> However, to take advantage of ARFS on a physical net driver, it would be
> necessary to send a control request for part 2.
aRFS on a physical net driver? What is this physical net driver? I
thought that in order to enable aRFS, guest virtio_net driver should
send a control request to its emulated virtio_net NIC.
>
>> 5.) update virtio spec.
>> Do i miss anything? If yes, please correct me.
>> For 3.) and 4.), do you have any doc about how they are implemented in
>> physical NICs? e.g. mlx4_en or sfc, etc.
>
> The Programmer's Reference Manuals for Solarflare controllers are only
> available under NDA. I can describe the hardware filtering briefly, but
> actually I don't think it's very relevant to virtio_net.
>
> There is a typical RSS hash indirection table (128 entries), but for
> ARFS we use a different RX filter table which has 8K entries
> (RX_FILTER_TBL0 on SFC4000/SFC9000 family).
>
> Solarflare controllers support user-level networking, which requires
> perfect filtering to deliver each application's flows into that
> application's dedicated RX queue(s). Lookups in this larger filter
> table are still hash-based, but each entry specifies a TCP/IP or UDP/IP
> 4-tuple or local 2-tuple to match. ARFS uses the 4-tuple type only.
>
> To allow for hash collisions, a secondary hash function generates an
> increment to be added to the initial table index repeatedly for hash
> chaining. There is a control register which tells the controller the
> maximum hash chain length to search for each IP filter type; after this
> it will fall back to checking MAC filters and then default filters.
>
> On the SFC9100 family, filter updates and lookups are implemented by
> firmware and the driver doesn't manage the filter table itself, but I
> know it is still a hash table of perfect filters.
>
> For ARFS, perfect filtering is not needed. I think it would be
> preferable to use a fairly big hash table and make insertion fail in
> case of a collision. Since the backend for virtio_net will do RX queue
> selection in software, the entire table of queue indices should fit into
> its L1 cache.
It is exactly in detail, thanks a lot for your nice sharing. It is
very helpful to me. I will come back to carefully read this after we
reach the agreement about if aRFS should be integrated into virtio_net
and what its big picture is.
>
> Ben.
>
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply
* [PATCH v2 net-next 01/12] bonding: remove bond->lock from bond_arp_rcv
From: Veaceslav Falico @ 2014-01-17 16:58 UTC (permalink / raw)
To: netdev; +Cc: Veaceslav Falico, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389977940-17084-1-git-send-email-vfalico@redhat.com>
We're always called with rcu_read_lock() held (bond_arp_rcv() is only
called from bond_handle_frame(), which is rx_handler and always called
under rcu from __netif_receive_skb_core() ).
The slave active/passive and/or bonding params can change in-flight, however
we don't really care about that - we only modify the last time packet was
received, which is harmless.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
drivers/net/bonding/bond_main.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f00dd45..479ca56 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2292,8 +2292,6 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
return RX_HANDLER_ANOTHER;
- read_lock(&bond->lock);
-
if (!slave_do_arp_validate(bond, slave))
goto out_unlock;
@@ -2350,7 +2348,6 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
bond_validate_arp(bond, slave, tip, sip);
out_unlock:
- read_unlock(&bond->lock);
if (arp != (struct arphdr *)skb->data)
kfree(arp);
return RX_HANDLER_ANOTHER;
--
1.8.4
^ permalink raw reply related
* [PATCH v2 net-next 02/12] bonding: permit using arp_validate with non-ab modes
From: Veaceslav Falico @ 2014-01-17 16:58 UTC (permalink / raw)
To: netdev; +Cc: Veaceslav Falico, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389977940-17084-1-git-send-email-vfalico@redhat.com>
Currently it's diabled because it's sometimes hard, in typical configs, to
make it work - because of the nature how the loadbalance modes work - as
it's hard to deliver valid arp replies to correct slaves by the switch.
However we still can use arp_validation in loadbalance in several other
configs, per example with arp_validate == 2 for backup with one broadcast
domain, without the switch(es) doing any balancing - this way we'd be (a
bit more) sure that the slave is up.
So, enable it to let users decide which one works/suits them best.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
Documentation/networking/bonding.txt | 6 +++---
drivers/net/bonding/bond_main.c | 4 ----
drivers/net/bonding/bond_options.c | 6 ------
3 files changed, 3 insertions(+), 13 deletions(-)
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index a4d925e..3620690 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -270,9 +270,9 @@ arp_ip_target
arp_validate
Specifies whether or not ARP probes and replies should be
- validated in the active-backup mode. This causes the ARP
- monitor to examine the incoming ARP requests and replies, and
- only consider a slave to be up if it is receiving the
+ validated in any mode that supports arp monitoring. This causes
+ the ARP monitor to examine the incoming ARP requests and replies,
+ and only consider a slave to be up if it is receiving the
appropriate ARP traffic.
Possible values are:
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 479ca56..cc30618 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4198,10 +4198,6 @@ static int bond_check_params(struct bond_params *params)
}
if (arp_validate) {
- if (bond_mode != BOND_MODE_ACTIVEBACKUP) {
- pr_err("arp_validate only supported in active-backup mode\n");
- return -EINVAL;
- }
if (!arp_interval) {
pr_err("arp_validate requires arp_interval\n");
return -EINVAL;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 945a666..f1c4fba 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -436,12 +436,6 @@ int bond_option_arp_validate_set(struct bonding *bond, int arp_validate)
return -EINVAL;
}
- if (bond->params.mode != BOND_MODE_ACTIVEBACKUP) {
- pr_err("%s: arp_validate only supported in active-backup mode.\n",
- bond->dev->name);
- return -EINVAL;
- }
-
pr_info("%s: setting arp_validate to %s (%d).\n",
bond->dev->name, arp_validate_tbl[arp_validate].modename,
arp_validate);
--
1.8.4
^ permalink raw reply related
* [PATCH v2 net-next 03/12] bonding: always update last_arp_rx on packet recieve
From: Veaceslav Falico @ 2014-01-17 16:58 UTC (permalink / raw)
To: netdev; +Cc: Veaceslav Falico, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389977940-17084-1-git-send-email-vfalico@redhat.com>
Currently we're updating the last_arp_rx only when we've validate the
packet, however afterwards we use it as 'ANY last packet received', but not
only validated ARPs.
Fix this by updating it in case of any packet received. It won't break the
arp_validation=0 because we, anyway, return the correct slave->dev->last_rx in
slave_last_rx().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
drivers/net/bonding/bond_main.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index cc30618..909f164 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2289,6 +2289,8 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
__be32 sip, tip;
int alen;
+ slave->last_arp_rx = jiffies;
+
if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
return RX_HANDLER_ANOTHER;
--
1.8.4
^ permalink raw reply related
* [PATCH v2 net-next 04/12] bonding: always set recv_probe to bond_arp_rcv in arp monitor
From: Veaceslav Falico @ 2014-01-17 16:58 UTC (permalink / raw)
To: netdev; +Cc: Veaceslav Falico, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389977940-17084-1-git-send-email-vfalico@redhat.com>
Currently we only set bond_arp_rcv() if we're using arp_validate, however
this makes us skip updating last_arp_rx if we're not validating incoming
ARPs - thus, if arp_validate is off, last_arp_rx will never be updated.
Fix this by always setting up recv_probe = bond_arp_rcv, even if we're not
using arp_validate.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
drivers/net/bonding/bond_main.c | 3 +--
drivers/net/bonding/bond_options.c | 12 ++----------
2 files changed, 3 insertions(+), 12 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 909f164..07ae82d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3077,8 +3077,7 @@ static int bond_open(struct net_device *bond_dev)
if (bond->params.arp_interval) { /* arp interval, in milliseconds. */
queue_delayed_work(bond->wq, &bond->arp_work, 0);
- if (bond->params.arp_validate)
- bond->recv_probe = bond_arp_rcv;
+ bond->recv_probe = bond_arp_rcv;
}
if (bond->params.mode == BOND_MODE_8023AD) {
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index f1c4fba..9d6d231 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -283,13 +283,11 @@ int bond_option_arp_interval_set(struct bonding *bond, int arp_interval)
* is called.
*/
if (!arp_interval) {
- if (bond->params.arp_validate)
- bond->recv_probe = NULL;
+ bond->recv_probe = NULL;
cancel_delayed_work_sync(&bond->arp_work);
} else {
/* arp_validate can be set only in active-backup mode */
- if (bond->params.arp_validate)
- bond->recv_probe = bond_arp_rcv;
+ bond->recv_probe = bond_arp_rcv;
cancel_delayed_work_sync(&bond->mii_work);
queue_delayed_work(bond->wq, &bond->arp_work, 0);
}
@@ -440,12 +438,6 @@ int bond_option_arp_validate_set(struct bonding *bond, int arp_validate)
bond->dev->name, arp_validate_tbl[arp_validate].modename,
arp_validate);
- if (bond->dev->flags & IFF_UP) {
- if (!arp_validate)
- bond->recv_probe = NULL;
- else if (bond->params.arp_interval)
- bond->recv_probe = bond_arp_rcv;
- }
bond->params.arp_validate = arp_validate;
return 0;
--
1.8.4
^ permalink raw reply related
* [PATCH v2 net-next 05/12] bonding: extend arp_validate to be able to receive unvalidated arp-only traffic
From: Veaceslav Falico @ 2014-01-17 16:58 UTC (permalink / raw)
To: netdev; +Cc: Veaceslav Falico, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389977940-17084-1-git-send-email-vfalico@redhat.com>
Currently we can either receive any traffic as a proff of slave being up,
or only *validated* arp traffic (i.e. with src/dst ip checked).
Add an option to be able to specify if we want to receive non-validated arp
traffic only.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
drivers/net/bonding/bond_main.c | 3 +++
drivers/net/bonding/bonding.h | 11 +++++++++++
2 files changed, 14 insertions(+)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 07ae82d..532a452 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -246,6 +246,9 @@ const struct bond_parm_tbl arp_validate_tbl[] = {
{ "active", BOND_ARP_VALIDATE_ACTIVE},
{ "backup", BOND_ARP_VALIDATE_BACKUP},
{ "all", BOND_ARP_VALIDATE_ALL},
+{ "arp", BOND_ARP_VALIDATE_ARP},
+{ "active_arp", BOND_ARP_VALIDATE_ACTIVE_ARP},
+{ "backup_arp", BOND_ARP_VALIDATE_BACKUP_ARP},
{ NULL, -1},
};
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 955dc48..1fbbf04 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -318,6 +318,11 @@ static inline bool bond_is_active_slave(struct slave *slave)
#define BOND_ARP_VALIDATE_BACKUP (1 << BOND_STATE_BACKUP)
#define BOND_ARP_VALIDATE_ALL (BOND_ARP_VALIDATE_ACTIVE | \
BOND_ARP_VALIDATE_BACKUP)
+#define BOND_ARP_VALIDATE_ARP (BOND_ARP_VALIDATE_ALL + 1)
+#define BOND_ARP_VALIDATE_ACTIVE_ARP (BOND_ARP_VALIDATE_ACTIVE | \
+ BOND_ARP_VALIDATE_ARP)
+#define BOND_ARP_VALIDATE_BACKUP_ARP (BOND_ARP_VALIDATE_BACKUP | \
+ BOND_ARP_VALIDATE_ARP)
static inline int slave_do_arp_validate(struct bonding *bond,
struct slave *slave)
@@ -325,6 +330,12 @@ static inline int slave_do_arp_validate(struct bonding *bond,
return bond->params.arp_validate & (1 << bond_slave_state(slave));
}
+static inline int slave_do_arp_validate_only(struct bonding *bond,
+ struct slave *slave)
+{
+ return bond->params.arp_validate & BOND_ARP_VALIDATE_ARP;
+}
+
/* Get the oldest arp which we've received on this slave for bond's
* arp_targets.
*/
--
1.8.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox