* [PATCH] ipvs: Keep skb->sk when allocating headroom on tunnel xmit
From: Calvin Owens @ 2014-11-05 0:37 UTC (permalink / raw)
To: Simon Horman, Julian Anastasov, Wensong Zhang
Cc: lvs-devel, linux-kernel, netdev, agartrell, kernel-team,
Calvin Owens
ip_vs_prepare_tunneled_skb() ignores ->sk when allocating a new
skb, either unconditionally setting ->sk to NULL or allowing
the uninitialized ->sk from a newly allocated skb to leak through
to the caller.
This patch properly copies ->sk and increments its reference count.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
net/netfilter/ipvs/ip_vs_xmit.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 437a366..bd90bf8 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -846,6 +846,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
goto error;
+ if (skb->sk)
+ skb_set_owner_w(new_skb, skb->sk);
consume_skb(skb);
skb = new_skb;
}
--
2.1.1
^ permalink raw reply related
* Re: [PATCH] bridge: include in6.h in if_bridge.h for struct in6_addr
From: Cong Wang @ 2014-11-05 0:39 UTC (permalink / raw)
To: Gregory Fong
Cc: Linux Kernel Network Developers, linux-api, LKML, carlos, eblake,
Kumar Gala, Florian Fainelli, David Miller
In-Reply-To: <1415128881-30183-1-git-send-email-gregory.0xf0@gmail.com>
On Tue, Nov 4, 2014 at 11:21 AM, Gregory Fong <gregory.0xf0@gmail.com> wrote:
> if_bridge.h uses struct in6_addr ip6, but wasn't including the in6.h
> header. Thomas Backlund originally sent a patch to do this, but this
> revealed a redefinition issue: https://lkml.org/lkml/2013/1/13/116
>
> The redefinition issue should have been fixed by the following Linux
> commits:
> ee262ad827f89e2dc7851ec2986953b5b125c6bc inet: defines IPPROTO_* needed for module alias generation
> cfd280c91253cc28e4919e349fa7a813b63e71e8 net: sync some IP headers with glibc
>
> and the following glibc commit:
> 6c82a2f8d7c8e21e39237225c819f182ae438db3 Coordinate IPv6 definitions for Linux and glibc
>
> so actually include the header now.
>
> Reported-by: Colin Guthrie <colin@mageia.org>
> Reported-by: Christiaan Welvaart <cjw@daneel.dyndns.org>
> Reported-by: Thomas Backlund <tmb@mageia.org>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Thanks for working on it!
^ permalink raw reply
* [PATCH net] ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs
From: Daniel Borkmann @ 2014-11-05 0:58 UTC (permalink / raw)
To: davem; +Cc: lw1a2.jing, fw, hannes, netdev, Eric Dumazet, David L Stevens
It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():
skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less unreliable to fail.
However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size will be much smaller.
The IGMP case doesn't have this issue as commit 57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in the
cb[], but therefore takes the MTU check not into account anymore.
Add and use skb_nofrag_tailroom() for both cases.
Reported-by: lw1a2.jing@gmail.com
Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David L Stevens <david.stevens@oracle.com>
---
In skb_nofrag_tailroom(), we could actually omit the !skb->dev check,
but I leave that rather as a possible cleanup item for net-next.
include/linux/netdevice.h | 15 +++++++++++++++
net/ipv4/igmp.c | 6 +-----
net/ipv6/mcast.c | 3 +--
3 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 74fd5d3..e4f4cfa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2262,6 +2262,21 @@ do { \
compute_pseudo(skb, proto)); \
} while (0)
+/**
+ * skb_nofrag_tailroom - bytes at buffer end still fitting into MTU
+ * @skb: buffer to check
+ *
+ * Return the number of bytes of free space at the tail of an sk_buff
+ * that still fit into the device MTU.
+ */
+static inline int skb_nofrag_tailroom(const struct sk_buff *skb)
+{
+ if (!skb->dev)
+ return skb_tailroom(skb);
+
+ return clamp_t(int, skb->dev->mtu - skb->len, 0, skb_tailroom(skb));
+}
+
static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
unsigned short type,
const void *daddr, const void *saddr,
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index fb70e3e..a750dfb 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -318,8 +318,6 @@ igmp_scount(struct ip_mc_list *pmc, int type, int gdeleted, int sdeleted)
return scount;
}
-#define igmp_skb_size(skb) (*(unsigned int *)((skb)->cb))
-
static struct sk_buff *igmpv3_newpack(struct net_device *dev, int size)
{
struct sk_buff *skb;
@@ -341,7 +339,6 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, int size)
return NULL;
}
skb->priority = TC_PRIO_CONTROL;
- igmp_skb_size(skb) = size;
rt = ip_route_output_ports(net, &fl4, NULL, IGMPV3_ALL_MCR, 0,
0, 0,
@@ -423,8 +420,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ip_mc_list *pmc,
return skb;
}
-#define AVAILABLE(skb) ((skb) ? ((skb)->dev ? igmp_skb_size(skb) - (skb)->len : \
- skb_tailroom(skb)) : 0)
+#define AVAILABLE(skb) ((skb) ? skb_nofrag_tailroom(skb) : 0)
static struct sk_buff *add_grec(struct sk_buff *skb, struct ip_mc_list *pmc,
int type, int gdeleted, int sdeleted)
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 9648de2..1bc18f9 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1690,8 +1690,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
return skb;
}
-#define AVAILABLE(skb) ((skb) ? ((skb)->dev ? (skb)->dev->mtu - (skb)->len : \
- skb_tailroom(skb)) : 0)
+#define AVAILABLE(skb) ((skb) ? skb_nofrag_tailroom(skb) : 0)
static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc,
int type, int gdeleted, int sdeleted, int crsend)
--
1.9.3
^ permalink raw reply related
* Re: [PATCH 02/13] net_sched: introduce qdisc_peek() helper function
From: Cong Wang @ 2014-11-05 1:05 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Linux Kernel Network Developers
In-Reply-To: <20141104104515.2e6433c6@uryu.home.lan>
On Tue, Nov 4, 2014 at 10:45 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 4 Nov 2014 09:56:25 -0800
> Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>> +static inline void qdisc_warn_nonwc(void *func, struct Qdisc *qdisc)
>> +{
>> + if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>> + pr_warn("%pf: %s qdisc %X: is non-work-conserving?\n",
>> + func, qdisc->ops->id, qdisc->handle >> 16);
>> + qdisc->flags |= TCQ_F_WARN_NONWC;
>> + }
>> +}
>> +
>
> Inilining this and creating N copies of same message is not a step forward.
Hmm, I think gcc merges same string literals when building Linux kernel?
But I never verify this.
^ permalink raw reply
* Re: [PATCH net] ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs
From: Eric Dumazet @ 2014-11-05 1:06 UTC (permalink / raw)
To: Daniel Borkmann
Cc: davem, lw1a2.jing, fw, hannes, netdev, Eric Dumazet,
David L Stevens
In-Reply-To: <1415149113-32668-1-git-send-email-dborkman@redhat.com>
On Wed, 2014-11-05 at 01:58 +0100, Daniel Borkmann wrote:
> It has been reported that generating an MLD listener report on
> devices with large MTUs (e.g. 9000) and a high number of IPv6
> addresses can trigger a skb_over_panic():
>
> skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
> head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
> dev:port1
> ------------[ cut here ]------------
> kernel BUG at net/core/skbuff.c:100!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: ixgbe(O)
> CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
> [...]
> Call Trace:
> <IRQ>
> [<ffffffff80578226>] ? skb_put+0x3a/0x3b
> [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
> [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
> [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
> [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
> [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
> [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
> [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
> [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
> [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
> [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
>
> mld_newpack() skb allocations are usually requested with dev->mtu
> in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
> we have changed the limit in order to be less unreliable to fail.
>
> However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
> macros, which determine if we may end up doing an skb_put() for
> adding another record. To avoid possible fragmentation, we check
> the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
> assumption as the actual max allocation size will be much smaller.
>
> The IGMP case doesn't have this issue as commit 57e1ab6eaddc
> ("igmp: refine skb allocations") stores the allocation size in the
> cb[], but therefore takes the MTU check not into account anymore.
> Add and use skb_nofrag_tailroom() for both cases.
>
> Reported-by: lw1a2.jing@gmail.com
> Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: David L Stevens <david.stevens@oracle.com>
> ---
> In skb_nofrag_tailroom(), we could actually omit the !skb->dev check,
> but I leave that rather as a possible cleanup item for net-next.
Hmm... we have a proliferation of such things.
Could you take a look at sk_stream_alloc_skb(), skb->reserved_tailroom,
and skb_availroom() ?
Thanks !
^ permalink raw reply
* Re: [PATCH v2 net] tcp: zero retrans_stamp if all retrans were acked
From: Eric Dumazet @ 2014-11-05 1:20 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Neal Cardwell, Netdev, Yuchung Cheng, Eric Dumazet
In-Reply-To: <54593C59.6070004@redhat.com>
On Tue, 2014-11-04 at 18:51 -0200, Marcelo Ricardo Leitner wrote:
> And thank you guys for all the assistance on it. Btw, would you send me that
> packetdrill script? I'm curious to see how such corner case could be written
> on it.
One of the script I saw was :
You might have to adapt preconditions (tcp_rmem[]/tcp_wmem[])
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// Set a 10s timeout
+.000 setsockopt(3, SOL_TCP, TCP_USER_TIMEOUT, [10000], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0
+.000 < S 0:0(0) win 32792 <mss 1460,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 6>
+.010 < . 1:1(0) ack 1 win 257
+.000 accept(3, ..., ...) = 4
+.000 write(4, ..., 1000) = 1000
+.000 > P. 1:1001(1000) ack 1
+.625 > P. 1:1001(1000) ack 1
+.020 < . 1:1(0) ack 1001 win 257
// Purposely write more after the specified timeout for testing
+11.0 write(4, ..., 1000) = 1000
+.000 > P. 1001:2001(1000) ack 1
+1.25 > P. 1001:2001(1000) ack 1
// socket is killed when the 2nd RTO fires at +2.50 w/o this patch
// so the next write returns ETIMEOUT
+2.60 write(4, ..., 1000) = 1000
^ permalink raw reply
* Re: [PATCH 00/13] net_sched: misc cleanups and improvements
From: Cong Wang @ 2014-11-05 1:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <1415125250.25370.2.camel@edumazet-glaptop2.roam.corp.google.com>
On Tue, Nov 4, 2014 at 10:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> If you do not care of writing good changelogs and explain why you want
> all this, why should I care spending time to review all this stuff ?
>
> What is the plan, beyond all this code churn ?
Why $subject isn't obvious?
I don't understand why you are only mean to me (and probably David Taht too).
Sit down and search the history, you will find:
commit c6be2a10ac2f810bdd01e978c93a8ef65b46120b
Merge: 7fd2561 44cc8ed
Author: David S. Miller <davem@davemloft.net>
Date: Wed Oct 29 15:59:43 2014 -0400
Merge branch 'xen-netback-next'
David Vrabel says:
====================
xen-netback: minor cleanups
Two minor xen-netback cleanups originally from Zoltan.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tell me, why this changelog in anyway could be better than me? which
was accepted (of course not by you). You can argue I have 13 patches,
but, there is no big difference between 13 cleanup's and 2 cleanup's,
cleanup's are cleanup's, that's it.
Seriously, think about why it should when it's just cleanup's, be practical.
^ permalink raw reply
* Re: [PATCH 00/13] net_sched: misc cleanups and improvements
From: Eric Dumazet @ 2014-11-05 1:47 UTC (permalink / raw)
To: Cong Wang; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <CAM_iQpXZBz=Ha2h3mCwnZKa3Etx7Ya9NxAWe_p8T+VjRUs5g0g@mail.gmail.com>
On Tue, 2014-11-04 at 17:25 -0800, Cong Wang wrote:
> Seriously, think about why it should when it's just cleanup's, be practical.
I seriously ask you to not do cleanups then.
Some people are working adding real stuff here, this code changing every
month is slowing them a lot.
^ permalink raw reply
* Re: [0/3] net: Kill skb_copy_datagram_const_iovec
From: Al Viro @ 2014-11-05 1:53 UTC (permalink / raw)
To: David Miller
Cc: herbert, netdev, linux-kernel, bcrl, Steve French, Sage Weil,
Nicholas A. Bellinger
In-Reply-To: <20141104054513.GB7996@ZenIV.linux.org.uk>
On Tue, Nov 04, 2014 at 05:45:13AM +0000, Al Viro wrote:
> Hell knows; I hadn't finished digging through that zoo - got sidetracked back
> then. *IF* all such places either use a throwaway copy or assume that iovec
> gets modified, we can do the following: switch the access to iovecs to
> iov_iter primitives, with the first kind of callers creating a throwaway
> iov_iter and the second just feeding the same iov_iter to e.g.
> kernel_recvmsg(). iovec will remain constant, iov_iter will be advanced.
> Moreover, in a lot of cases of first kind will be able to get rid of
> throwaway iov_iter (and of manually advancing it), effectively converting
> to the second one.
All right, now I _have_ finished that. See the resulting notes below.
TL;DR version: looks like hypothesis above is correct, modulo 2 places,
both buggy - cifs smb_send_kvec() apparently relies on ->sendmsg() leaving the
iovec unchanged and so does of the ceph_tcp_sendmsg() callers
(write_partial_kvec()). For TCP that's not always true. Another apparent
bug caught in process is iscsi iscsit_do_tx_data() - assumes that iovec is
being consumed by sendmsg(). I don't see how that could not be a bug - TCP
sockets can get there and tcp_sendmsg() normally *doesn't* modify the iovec.
Sometimes it does, unfortunately for other two places...
Maintainers Cc'd...
Full version follows:
-----------------------------------------------------------------------------
->sendmsg(): there's such method in struct proto_ops and in struct proto; the
latter is called by (some of) the former, in cases when ->sendmsg() isn't the
same for the entire family.
Instances of proto ->sendmsg() are, on several occasions, called directly; some
of those calls are from another such instance (with unchanged payload). There
are two exceptions to that - in tipc_accept() and tipc_connect() we call such
instances with empty payload. All calls via method are from proto_ops
->sendmsg() instances, payload unchanged.
Instances of proto_ops ->sendmsg() are almost never called directly. All
exceptions are from another such instance with unchanged payload.
There are two places that call proto_ops ->sendmsg() via method -
__sock_sendmsg_nosec() in net/socket.c, and handle_tx() in drivers/vhost/net.c.
The latter appears to be playing somewhat unusual games with passing NULL iocb,
making it impossible to use the former... Everybody else in the kernel goes
through __sock_sendmsg_nosec(), though - it's a chokepoint for sendmsg path.
vhost callsite is somewhat worrying - granted, most of the ->sendmsg()
instances don't give a damn about iocb at all. The rest, though...
E.g. what happens if we do VHOST_NET_SET_BACKEND with backend.fd being an
AF_UNIX socket? AFAICS, if it ever gets to that ->sendmsg() call afterwards,
we'll get an oops when e.g. unix_dgram_sendmsg() calls kiocb_to_siocb(NULL).
The same goes for AF_NETLINK; AF_TIPC is even more interesting, since there
NULL iocb is used as "we are in weird callchain, socket is already locked"
flag (for tipc_{accept,connect}() callsites). What's going on there?
[After having talked with mst: drivers/vhost/net.c checks that it's
AF_PACKET/SOCK_RAW (packet_sendmsg()) *or* comes from tun.c (tun_sendmsg())
or from macvtap.c (macvtap_sendmsg()) and all of those ignore iocb completely]
FWIW, the situation with iocb is
* AF_UNIX and AF_NETLINK use it to get to sock_iocb
* AF_TIPC uses it as a flag - it never dereferences the damn thing and
only compares it with NULL, to determine whether it's in a normal call chain,
or in tipc_accept/tipc_connect one...
* everything else ignores it completely, either directly or after
passing it to something that ignores it (rxrpc has unusually deep chain, but it
still ends up ignoring the sucker).
->recvmsg(): again, present in proto_ops and proto, in the same relationship
to each other. The situation is a bit simpler there: there is only one direct
caller of an instance of struct proto ->recvmsg() and it is in another such
instance, arguments unchanged. All callers via method are in the instances of
struct proto_ops ->recvmsg(), message-related arguments unchanged. No direct
callers of struct proto_ops recvmsg, three call sites via method - regular one
in __sock_recvmsg_nosec() plus two in drivers/vhost/net.c. The latter have
NULL iocb and are saved from oopsing in ->recvmsg() by the same logics that
saves vhost on sendmsg side. iocb is ignored by everything except AF_UNIX
and AF_NETLINK (those use it for sock_iocb) and neither can be reached from
vhost path.
So it boils down to the following: drivers/vhost/net.c aside, everything goes
through __sock_sendmsg_nosec() on the sendmsg side and __sock_recvmsg_nosec()
on recvmsg one.
Call chains leading to __sock_sendmsg_nosec():
__sock_sendmsg_nosec()
<- sock_sendmsg_nosec()
<- ___sys_sendmsg()
<- __sys_sendmsg()
<- sys_compat_sendmsg()
<- sys_sendmsg()
<- __sys_sendmmsg()
<- sys_compat_senmmmsg()
<- sys_sendmmsg()
<- __sock_sendmsg()
<- do_sock_write()
<- sock_aio_write()
== ->aio_write()
<- sock_sendmsg()
<- svc_sendto() [no iovec at all]
<- ___sys_sendmsg() [see above]
<- sys_sendto()
<- kernel_sendmsg()
All syscalls (and there's quite a tangled mess with sys_socketcall,
assorted ARM wrappers, etc.) end up with iovec discarded.
Ditto for ->aio_write() callers - they all free the iovec soon after
->aio_write() returns and never look at it before freeing.
Call chains leading to __sock_recvmsg_nosec():
__sock_recvmsg_nosec()
<- sock_recvmsg_nosec()
<- ___sys_recvmsg()
<- __sys_recvmsg()
<- sys_compat_recvmsg()
<- sys_recvmsg()
<- __sys_recvmmsg()
<- sys_compat_recvmmsg()
<- sys_recvmmsg()
<- __sock_recvmsg()
<- do_sock_read()
<- sock_aio_read()
== ->aio_read()
<- sock_recvmsg() [why is it not static, BTW?]
<- ___sys_recvmsg() [see above]
<- sys_recvfrom()
<- kernel_recvmsg()
Again, both the sycalls and ->aio_read() callers end up discarding
iovec.
All of that leaves us with kernel_{send,recv}msg() as the next-order
chokepoints.
kernel_recvmsg() callers:
drbd drbd_recv_short() - single-element iovec discarded
nbd sock_xmit() - single-element iovec discarded; would be better off
with advancing iov_iter.
isdn l1oip_socket_thread() - single-element iovec discarded
lustre ksocknal_lib_recv_iov() - iovec copied (with unhappy comment),
copy passed to kernel_recvmsg() and discarded
lustre ksocknal_lib_recv_kiov() - ditto. Would be much better off
with bvec-based iov_iter
lustre libcfs_sock_read() - single-element iovec discarded; would be
better off with advancing iov_iter.
iscsi iscsit_do_rx_data() - assumes that iovec is being consumed.
AFAICS, it's guaranteed to be TCP and tcp_recvmsg() appears to act that way,
so it's probably OK...
usbip usbip_recv() - single-element iovec discarded; would be better
off with advancing iov_iter
cifs cifs_readv_from_socket() - iovec copied, copy passed to
kernel_recvmsg() and discarded; *definitely* would be better off with advancing
iov_iter.
dlm receive_from_sock() - 1- or 2-element iovec discarded.
ncpfs _recv() - single-element iovec discarded.
ocfs2 o2net_recv_tcp_msg() - single-element iovec discarded.
ceph ceph_tcp_recvmsg() - single-element iovec discarded; at least one
of the loops using it would be better off with advancing iov_iter.
ipvs ip_vs_receive() - single-element iovec discarded.
sunrpc svc_udp_recvfrom() - no payload (MSG_PEEK, that one)
tipc tipc_receive_from_sock() - single-element iovec discarded.
sunrpc svc_recvfrom() - confusing; looks like iovec is a throwaway
one, though (and we might be better off if we could use an iov_iter of bvec
sort instead).
kernel_sendmsg() callers:
drbd drbd_send() - single-element iovec discarded; would be better
off with advancing iov_iter
nbd sock_xmit() - ditto
isdn l1oip_socket_send() - single-element iovec discarded
iscsi iscsi_sw_tcp_xmit_segment() - single-element iovec discarded
lustre ksocknal_lib_send_iov() - iovec copied (with unhappy comment),
copy passed to kernel_sendmsg() and discarded
lustre ksocknal_lib_send_kiov - ditto. Would be much better off
with bvec iov_iter.
lustre libcfs_sock_write() - single-element iovec discarded; better
off with advancing iov_iter
iscsi iscsit_do_tx_data() - assumes that iovec is being consumed.
I don't see how that could not be a bug - TCP sockets can get there.
usbip stub_send_ret_submit() - iovec is built and discarded;
short write is treated as an error
usbip stub_send_ret_unlink() - ditto
usbip vhci_send_cmd_submit() - ditto
usbip vhci_send_cmd_unlink() - ditto
cifs smb_send_kvec() - apparently relies on ->sendmsg() leaving the
iovec unchanged. Looks like a bug - tcp_sendmsg() might drain iovec in some
cases.
dlm sctp_send_shutdown() - no payload
dlm sctp_init_assoc() - single-element iovec discarded
ncpfs do_send() - single-element iovec discarded
ocfs2 o2net_send_tcp_msg() - callers pass it a throwaway iovec
bnep bnep_send() - single-element iovec discarded
bnep_tx_frame() - short iovec is built and discarded
cmtp_send_frame() - single-element iovec discarded
hidp_send_frame() - single-element iovec discarded
rfcomm_send_frame() - single-element iovec discarded
rfcomm_send_test() - short iovec is built and discarded
core sock_no_sendpage*() - single-element iovec discarded
ipvs ip_vs_send_async() - single-element iovec discarded
rds rds_tcp_sendmsg() - single-element iovec discarded
rxrpc rxrpc_busy() - single-element iovec discarded
rxrpc rxrpc_process_call() - short iovec is built and discarded
rxrpc rxrpc_abort_connection() - short iovec is built and discarded
rxrpc rxrpc_reject_packets() - assumes that sendmsg drains iovec;
may be a bug.
rxrpc rxrpc_send_packet() - single-element iovec discarded
rxrpc rxkad_issue_challenge() - short iovec is built and discarded
rxrpc rxkad_send_response - short iovec is built and discarded
sunrpc xs_send_kvec() - single-element iovec discarded; really asks
or iov_iter...
tipc tipc_send_to_sock() - single-element iovec discarded; for some
reason it seems to believe that short writes never happen...
ceph ceph_tcp_sendmsg() - one caller appears to discard iovec, another
(write_partial_kvec()) apparently assumes that iovec is unchanged by sendmsg.
Not guaranteed to be true for TCP, AFAICAS.
^ permalink raw reply
* Re: [PATCH 7/7] can: m_can: workaround for transmit data less than 4 bytes
From: Dong Aisheng @ 2014-11-05 2:03 UTC (permalink / raw)
To: Marc Kleine-Budde
Cc: linux-can, wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <5458AB65.7000500@pengutronix.de>
On Tue, Nov 04, 2014 at 11:33:09AM +0100, Marc Kleine-Budde wrote:
> On 11/04/2014 10:27 AM, Dong Aisheng wrote:
> >>>> It should be possible to change the for loop to go always to 8, or
> >>>> simply unroll the loop:
> >>>>
> >>>> /* errata description goes here */
> >>>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), *(u32 *)(cf->data + 0));
> >>>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), *(u32 *)(cf->data + 4));
> >>>>
> >>>
> >>> Yes, i tried to fix it as follows.
> >>>
> >>> /* FIXME: we meet an IC issue that we have to write the full 8
> >>> * bytes (whatever value for the second word) in Message RAM to
> >>> * avoid bit error for transmit data less than 4 bytes
> >>> */
> >>> if (cf->len <= 4) {
> >>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0),
> >>> *(u32 *)(cf->data + 0));
> >>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1),
> >>> *(u32 *)(cf->data + 4));
> >>> } else {
> >>> for (i = 0; i < cf->len; i += 4)
> >>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(i / 4),
> >>> *(u32 *)(cf->data + i));
> >>>
> >>> Will update the patch.
> >>
> >> Both branches of the above if are doing the same thing, I think you can
> >> replace the while if ... else ... for with this:
> >>
> >
> > Not the same thing.
> > The later one will cover payload up to 64 bytes.
>
> Doh! I'm not used to CAN-FD, yet :) However, I'll apply this fix before
> adding the CAN-FD support.
>
> >> /* errata description goes here */
> >> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), *(u32 *)(cf->data + 0));
> >> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), *(u32 *)(cf->data + 4));
> >>
> >> However if writing to DATA(0) and DATA(1) once in the open() function is
> >> enough this code should stay as it is.
> >
> > I tried put them into open() function and the quick test showed it worked.
> >
> > Do you think it's ok to put things into open() function for this issue
> > as follows?
> >
> > diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> > index 065e4f1..ca55988 100644
> > --- a/drivers/net/can/m_can/m_can.c
> > +++ b/drivers/net/can/m_can/m_can.c
> > @@ -901,6 +901,15 @@ static void m_can_chip_config(struct net_device *dev)
> > /* set bittiming params */
> > m_can_set_bittiming(dev);
> >
> > + /* We meet an IC issue that we have to write the full 8
>
> At least on the *insert SoC name here*, an issue with the Message RAM
> was discovered. Sending CAN frames with dlc less than 4 bytes will lead
> to bit errors, when the first 8 bytes of the Message RAM have not been
> initialized (i.e. written to). To work around this issue, the first 8
> bytes are initialized here.
>
Looks good.
Will do like that.
> > + * bytes (whatever value for the second word) in Message RAM to
> > + * avoid bit error for transmit data less than 4 bytes at the first
> > + * time. By initializing the first 8 bytes of tx buffer before using
> > + * it can avoid such issue.
> > + */
> > + m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), 0x0);
> > + m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), 0x0);
> > +
> > m_can_config_endisable(priv, false);
> > }
>
> Can you trigger the issue when sending CAN-FD frames with dlc > 8 && dlc
> < 64?
>
No, i did not see the issue with dlc > 8 && dlc < 64.
Regards
Dong Aisheng
> Marc
>
> --
> Pengutronix e.K. | Marc Kleine-Budde |
> Industrial Linux Solutions | Phone: +49-231-2826-924 |
> Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
> Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
>
^ permalink raw reply
* Re: [PATCH 7/7] can: m_can: workaround for transmit data less than 4 bytes
From: Dong Aisheng @ 2014-11-05 2:07 UTC (permalink / raw)
To: Oliver Hartkopp
Cc: Marc Kleine-Budde, linux-can, wg, varkabhadram, netdev,
linux-arm-kernel
In-Reply-To: <5458D0FA.1040504@hartkopp.net>
On Tue, Nov 04, 2014 at 02:13:30PM +0100, Oliver Hartkopp wrote:
>
>
> On 04.11.2014 11:33, Marc Kleine-Budde wrote:
> >On 11/04/2014 10:27 AM, Dong Aisheng wrote:
>
>
> >>+ /* We meet an IC issue that we have to write the full 8
> >
> >At least on the *insert SoC name here*, an issue with the Message RAM
> >was discovered. Sending CAN frames with dlc less than 4 bytes will lead
> >to bit errors, when the first 8 bytes of the Message RAM have not been
> >initialized (i.e. written to). To work around this issue, the first 8
> >bytes are initialized here.
>
> Yes. Also put the current IP revision (3.0.x) into the comment.
> Did inform the Bosch guys from this issue - or is it already in some errata sheet?
>
Good idea, will add it.
I will try if we can talk Bosch guys about this issue.
Regards
Dong Aisheng
> >
> >>+ * bytes (whatever value for the second word) in Message RAM to
> >>+ * avoid bit error for transmit data less than 4 bytes at the first
> >>+ * time. By initializing the first 8 bytes of tx buffer before using
> >>+ * it can avoid such issue.
> >>+ */
> >>+ m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), 0x0);
> >>+ m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), 0x0);
> >>+
> >> m_can_config_endisable(priv, false);
> >> }
> >
> >Can you trigger the issue when sending CAN-FD frames with dlc > 8 && dlc
> >< 64?
>
> Just a nitpick:
>
> DLC can just be 0 .. 15
>
> and the length (struct canfd_frame.len) can be from 0 .. 64
>
> See:
>
> http://lxr.free-electrons.com/source/include/uapi/linux/can.h#L83
>
> That's the reason for all these helpers
>
> http://lxr.free-electrons.com/source/drivers/net/can/dev.c#L36
>
> that hide the evil "DLC" from userspace now and make 'len' a usable
> loop variable as we were able to use the former dlc for classic CAN
> :-)
>
> Regards,
> Oliver
>
^ permalink raw reply
* [PATCH net-next] r8152: disable the tasklet by default
From: Hayes Wang @ 2014-11-05 2:17 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
Let the tasklet only be enabled after open(), and be disabled for
the other situation. The tasklet is only necessary after open() for
tx/rx, so it could be disabled by default.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 8ded08e..fd41675 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -2943,6 +2943,8 @@ static int rtl8152_open(struct net_device *netdev)
netif_warn(tp, ifup, netdev, "intr_urb submit failed: %d\n",
res);
free_all_mem(tp);
+ } else {
+ tasklet_enable(&tp->tl);
}
mutex_unlock(&tp->control);
@@ -2958,6 +2960,7 @@ static int rtl8152_close(struct net_device *netdev)
struct r8152 *tp = netdev_priv(netdev);
int res = 0;
+ tasklet_disable(&tp->tl);
clear_bit(WORK_ENABLE, &tp->flags);
usb_kill_urb(tp->intr_urb);
cancel_delayed_work_sync(&tp->schedule);
@@ -2975,9 +2978,7 @@ static int rtl8152_close(struct net_device *netdev)
*/
rtl_runtime_suspend_enable(tp, false);
- tasklet_disable(&tp->tl);
tp->rtl_ops.down(tp);
- tasklet_enable(&tp->tl);
mutex_unlock(&tp->control);
@@ -3887,12 +3888,15 @@ static int rtl8152_probe(struct usb_interface *intf,
else
device_set_wakeup_enable(&udev->dev, false);
+ tasklet_disable(&tp->tl);
+
netif_info(tp, probe, netdev, "%s\n", DRIVER_VERSION);
return 0;
out1:
usb_set_intfdata(intf, NULL);
+ tasklet_kill(&tp->tl);
out:
free_netdev(netdev);
return ret;
--
1.9.3
^ permalink raw reply related
* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Herbert Xu @ 2014-11-05 2:22 UTC (permalink / raw)
To: Al Viro
Cc: David S. Miller, netdev, Linux Kernel Mailing List,
Benjamin LaHaise
In-Reply-To: <20141104151353.GG7996@ZenIV.linux.org.uk>
On Tue, Nov 04, 2014 at 03:13:55PM +0000, Al Viro wrote:
>
> Think of it that way: every sendmsg/recvmsg path leading to memcpy_fromiovec
> and its friends (including the open-coded ones) would need to be changed
> at some point. Assuming we do not end up passing struct iov_iter * as
> an extra argument through a fairly large part of net/* (and that would
> be prohibitively hard and messy, not to mention the effects on the stack
> footprint, etc.), the most obvious strategy is to have that thing passed
> where msg_iov/msg_iovlen are - in struct msghdr. *IF* we go that way,
> it makes a whole lot of sense to start with a bunch of cleanups that
> will make sense on their own (most of callers of skb_copy_datagram_iovec
> do look like skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); might
> as well give it an inlined helper) and will reduce the amount of places
> where ->msg_iov is used. With such cleanups standing on their own and
> being splittable from the rest of the queue. And leaving us with fewer
> places in code that deal with ->msg_iov and need to be dealt with.
I think your solution is great. However, I don't see how my four
patches impede in anyway the work that you're doing. I presume
your first patch will make skb_copy_datagram_msg just a wrapper
around skb_copy_datagram_iovec.
Since I'm not removing skb_copy_datagram_iovec (it can't be removed
until all users are gone) you can still do that and when you're
ready to switch over to iov_iter you can just move the wrapper over
to skb_copy_datagram_iter. No?
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 2/4] tun: Use iovec iterators
From: YOSHIFUJI Hideaki @ 2014-11-05 2:49 UTC (permalink / raw)
To: Herbert Xu, Al Viro, David S. Miller, netdev,
Linux Kernel Mailing List, Benjamin LaHaise
Cc: hideaki.yoshifuji
In-Reply-To: <20141104083721.GA12691@gondor.apana.org.au>
Hi,
Herbert Xu wrote:
> Oops, this patch had a left-over skb_pull which made it broken.
> Here is a fixed version.
>
> tun: Use iovec iterators
>
> This patch removes the use of skb_copy_datagram_const_iovec in
> favour of the iovec iterator-based skb_copy_datagram_iter.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 9dd3746..ff955cdb 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
:
> @@ -1244,23 +1245,25 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> if (tun->flags & TUN_VNET_HDR)
> vnet_hdr_sz = tun->vnet_hdr_sz;
>
> + total = skb->len + vlan_hlen + vnet_hdr_sz;
> +
> if (!(tun->flags & TUN_NO_PI)) {
> - if ((len -= sizeof(pi)) < 0)
> + if (iov_iter_count(iter) < sizeof(pi))
> return -EINVAL;
>
> - if (len < skb->len + vlan_hlen + vnet_hdr_sz) {
> + if (iov_iter_count(iter) < total) {
I guess this should be: sizeof(pi) + total
--
Hideaki Yoshifuji <hideaki.yoshifuji@miraclelinux.com>
Technical Division, MIRACLE LINUX CORPORATION
^ permalink raw reply
* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: David Miller @ 2014-11-05 3:27 UTC (permalink / raw)
To: herbert; +Cc: viro, netdev, linux-kernel, bcrl
In-Reply-To: <20141105022251.GA19136@gondor.apana.org.au>
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 5 Nov 2014 10:22:51 +0800
> On Tue, Nov 04, 2014 at 03:13:55PM +0000, Al Viro wrote:
>>
>> Think of it that way: every sendmsg/recvmsg path leading to memcpy_fromiovec
>> and its friends (including the open-coded ones) would need to be changed
>> at some point. Assuming we do not end up passing struct iov_iter * as
>> an extra argument through a fairly large part of net/* (and that would
>> be prohibitively hard and messy, not to mention the effects on the stack
>> footprint, etc.), the most obvious strategy is to have that thing passed
>> where msg_iov/msg_iovlen are - in struct msghdr. *IF* we go that way,
>> it makes a whole lot of sense to start with a bunch of cleanups that
>> will make sense on their own (most of callers of skb_copy_datagram_iovec
>> do look like skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); might
>> as well give it an inlined helper) and will reduce the amount of places
>> where ->msg_iov is used. With such cleanups standing on their own and
>> being splittable from the rest of the queue. And leaving us with fewer
>> places in code that deal with ->msg_iov and need to be dealt with.
>
> I think your solution is great. However, I don't see how my four
> patches impede in anyway the work that you're doing. I presume
> your first patch will make skb_copy_datagram_msg just a wrapper
> around skb_copy_datagram_iovec.
>
> Since I'm not removing skb_copy_datagram_iovec (it can't be removed
> until all users are gone) you can still do that and when you're
> ready to switch over to iov_iter you can just move the wrapper over
> to skb_copy_datagram_iter. No?
Agreed, I think both efforts can proceed in parallel.
Al, is this the helper you are talking about?
diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index 1be8228..a08057d 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -163,7 +163,7 @@ mISDN_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
memcpy(skb_push(skb, MISDN_HEADER_LEN), mISDN_HEAD_P(skb),
MISDN_HEADER_LEN);
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
mISDN_sock_cmsg(sk, msg, skb);
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 6c9c16d..25234d9 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -981,7 +981,7 @@ static int pppoe_recvmsg(struct kiocb *iocb, struct socket *sock,
if (skb) {
total_len = min_t(size_t, total_len, skb->len);
- error = skb_copy_datagram_iovec(skb, 0, m->msg_iov, total_len);
+ error = skb_copy_datagram_msghdr(skb, m, total_len);
if (error == 0) {
consume_skb(skb);
return total_len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5ad9675..19fe8cc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -17,6 +17,7 @@
#include <linux/kernel.h>
#include <linux/kmemcheck.h>
#include <linux/compiler.h>
+#include <linux/socket.h>
#include <linux/time.h>
#include <linux/bug.h>
#include <linux/cache.h>
@@ -2637,6 +2638,11 @@ unsigned int datagram_poll(struct file *file, struct socket *sock,
struct poll_table_struct *wait);
int skb_copy_datagram_iovec(const struct sk_buff *from, int offset,
struct iovec *to, int size);
+static inline int skb_copy_datagram_msghdr(const struct sk_buff *from,
+ struct msghdr *msg, int size)
+{
+ return skb_copy_datagram_iovec(from, 0, msg->msg_iov, size);
+}
int skb_copy_and_csum_datagram_iovec(struct sk_buff *skb, int hlen,
struct iovec *iov);
int skb_copy_datagram_from_iovec(struct sk_buff *skb, int offset,
diff --git a/net/atm/common.c b/net/atm/common.c
index 6a76515..7e42bbe 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -554,7 +554,7 @@ int vcc_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
msg->msg_flags |= MSG_TRUNC;
}
- error = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ error = skb_copy_datagram_msghdr(skb, msg, copied);
if (error)
return error;
sock_recv_ts_and_drops(msg, sk, skb);
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index c35c3f4..a91075c 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1634,7 +1634,7 @@ static int ax25_recvmsg(struct kiocb *iocb, struct socket *sock,
msg->msg_flags |= MSG_TRUNC;
}
- skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ skb_copy_datagram_msghdr(skb, msg, copied);
if (msg->msg_name) {
ax25_digi digi;
diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index 339c74a..a68dd75 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -237,7 +237,7 @@ int bt_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
}
skb_reset_transport_header(skb);
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err == 0) {
sock_recv_ts_and_drops(msg, sk, skb);
@@ -328,7 +328,7 @@ int bt_sock_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
}
chunk = min_t(unsigned int, skb->len, size);
- if (skb_copy_datagram_iovec(skb, 0, msg->msg_iov, chunk)) {
+ if (skb_copy_datagram_msghdr(skb, msg, chunk)) {
skb_queue_head(&sk->sk_receive_queue, skb);
if (!copied)
copied = -EFAULT;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 115f149..45d4fba 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -878,7 +878,7 @@ static int hci_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
}
skb_reset_transport_header(skb);
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
switch (hci_pi(sk)->channel) {
case HCI_CHANNEL_RAW:
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index 43f750e..67e63b6 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -293,7 +293,7 @@ static int caif_seqpkt_recvmsg(struct kiocb *iocb, struct socket *sock,
copylen = len;
}
- ret = skb_copy_datagram_iovec(skb, 0, m->msg_iov, copylen);
+ ret = skb_copy_datagram_msghdr(skb, m, copylen);
if (ret)
goto out_free;
diff --git a/net/core/sock.c b/net/core/sock.c
index 15e0c67..220c791 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2457,7 +2457,7 @@ int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
msg->msg_flags |= MSG_TRUNC;
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto out_free_skb;
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 5ab6627..7ccf58f 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -896,7 +896,7 @@ verify_sock_status:
else if (len < skb->len)
msg->msg_flags |= MSG_TRUNC;
- if (skb_copy_datagram_iovec(skb, 0, msg->msg_iov, len)) {
+ if (skb_copy_datagram_msghdr(skb, msg, len)) {
/* Exception. Bailout! */
len = -EFAULT;
break;
diff --git a/net/ieee802154/dgram.c b/net/ieee802154/dgram.c
index ef2ad8a..7017055 100644
--- a/net/ieee802154/dgram.c
+++ b/net/ieee802154/dgram.c
@@ -324,7 +324,7 @@ static int dgram_recvmsg(struct kiocb *iocb, struct sock *sk,
}
/* FIXME: skip headers if necessary ?! */
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto done;
diff --git a/net/ieee802154/raw.c b/net/ieee802154/raw.c
index 9d1f648..5dd893a 100644
--- a/net/ieee802154/raw.c
+++ b/net/ieee802154/raw.c
@@ -195,7 +195,7 @@ static int raw_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto done;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c373a9a..d643882 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -424,7 +424,7 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
msg->msg_flags |= MSG_TRUNC;
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto out_free_skb;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 57f7c98..a6e0197 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -875,7 +875,7 @@ int ping_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
}
/* Don't bother checking the checksum */
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto done;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 739db31..2f4bb30 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -718,7 +718,7 @@ static int raw_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto done;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39ec0c3..3638679 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1377,7 +1377,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
/* XXX -- need to support SO_PEEK_OFF */
skb_queue_walk(&sk->sk_write_queue, skb) {
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, skb->len);
+ err = skb_copy_datagram_msghdr(skb, msg, skb->len);
if (err)
break;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 2cdc383..4bd84fd 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -351,7 +351,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
msg->msg_flags |= MSG_TRUNC;
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto out_free_skb;
@@ -445,7 +445,7 @@ int ipv6_recv_rxpmtu(struct sock *sk, struct msghdr *msg, int len,
msg->msg_flags |= MSG_TRUNC;
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto out_free_skb;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 075a0fb..5d37aa1 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -486,11 +486,11 @@ static int rawv6_recvmsg(struct kiocb *iocb, struct sock *sk,
}
if (skb_csum_unnecessary(skb)) {
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
} else if (msg->msg_flags&MSG_TRUNC) {
if (__skb_checksum_complete(skb))
goto csum_copy_err;
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
} else {
err = skb_copy_and_csum_datagram_iovec(skb, 0, msg->msg_iov);
if (err == -EINVAL)
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 92fafd4..54db6dc 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -1396,7 +1396,7 @@ static int irda_recvmsg_dgram(struct kiocb *iocb, struct socket *sock,
copied = size;
msg->msg_flags |= MSG_TRUNC;
}
- skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ skb_copy_datagram_msghdr(skb, msg, copied);
skb_free_datagram(sk, skb);
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 1847ec4..f09a848 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3654,7 +3654,7 @@ static int pfkey_recvmsg(struct kiocb *kiocb,
}
skb_reset_transport_header(skb);
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto out_free;
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 369a982..2913198 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -528,7 +528,7 @@ static int l2tp_ip_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto done;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 0edb263..8613881 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -672,7 +672,7 @@ static int l2tp_ip6_recvmsg(struct kiocb *iocb, struct sock *sk,
copied = len;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto done;
diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index b704a93..5f3c0f5 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -208,7 +208,7 @@ static int pppol2tp_recvmsg(struct kiocb *iocb, struct socket *sock,
else if (len < skb->len)
msg->msg_flags |= MSG_TRUNC;
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, len);
+ err = skb_copy_datagram_msghdr(skb, msg, len);
if (likely(err == 0))
err = len;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index f1de72d..123cffd 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2401,7 +2401,7 @@ static int netlink_recvmsg(struct kiocb *kiocb, struct socket *sock,
}
skb_reset_transport_header(data_skb);
- err = skb_copy_datagram_iovec(data_skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(data_skb, msg, copied);
if (msg->msg_name) {
DECLARE_SOCKADDR(struct sockaddr_nl *, addr, msg->msg_name);
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index 1b06a1f..6bc3556 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1167,7 +1167,7 @@ static int nr_recvmsg(struct kiocb *iocb, struct socket *sock,
msg->msg_flags |= MSG_TRUNC;
}
- er = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ er = skb_copy_datagram_msghdr(skb, msg, copied);
if (er < 0) {
skb_free_datagram(sk, skb);
release_sock(sk);
diff --git a/net/nfc/llcp_sock.c b/net/nfc/llcp_sock.c
index 51f077a..e962c07 100644
--- a/net/nfc/llcp_sock.c
+++ b/net/nfc/llcp_sock.c
@@ -832,7 +832,7 @@ static int llcp_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
copied = min_t(unsigned int, rlen, len);
cskb = skb;
- if (skb_copy_datagram_iovec(cskb, 0, msg->msg_iov, copied)) {
+ if (skb_copy_datagram_msghdr(cskb, msg, copied)) {
if (!(flags & MSG_PEEK))
skb_queue_head(&sk->sk_receive_queue, skb);
return -EFAULT;
diff --git a/net/nfc/rawsock.c b/net/nfc/rawsock.c
index 11c3544..4467b2c 100644
--- a/net/nfc/rawsock.c
+++ b/net/nfc/rawsock.c
@@ -269,7 +269,7 @@ static int rawsock_recvmsg(struct kiocb *iocb, struct socket *sock,
copied = len;
}
- rc = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ rc = skb_copy_datagram_msghdr(skb, msg, copied);
skb_free_datagram(sk, skb);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 87d20f4..390b776 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2953,7 +2953,7 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
msg->msg_flags |= MSG_TRUNC;
}
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
if (err)
goto out_free;
diff --git a/net/phonet/datagram.c b/net/phonet/datagram.c
index 290352c..b4835bc 100644
--- a/net/phonet/datagram.c
+++ b/net/phonet/datagram.c
@@ -150,7 +150,7 @@ static int pn_recvmsg(struct kiocb *iocb, struct sock *sk,
copylen = len;
}
- rval = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copylen);
+ rval = skb_copy_datagram_msghdr(skb, msg, copylen);
if (rval) {
rval = -EFAULT;
goto out;
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 70a547e..f544658 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -1296,7 +1296,7 @@ copy:
else
len = skb->len;
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, len);
+ err = skb_copy_datagram_msghdr(skb, msg, len);
if (!err)
err = (flags & MSG_TRUNC) ? skb->len : len;
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index a85c1a0..b660504 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1249,7 +1249,7 @@ static int rose_recvmsg(struct kiocb *iocb, struct socket *sock,
msg->msg_flags |= MSG_TRUNC;
}
- skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ skb_copy_datagram_msghdr(skb, msg, copied);
if (msg->msg_name) {
struct sockaddr_rose *srose;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 634a2ab..0fca34c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -2095,7 +2095,7 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
if (copied > len)
copied = len;
- err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ err = skb_copy_datagram_msghdr(skb, msg, copied);
event = sctp_skb2event(skb);
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 5ad4418..fad0702 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1335,7 +1335,7 @@ static int x25_recvmsg(struct kiocb *iocb, struct socket *sock,
/* Currently, each datagram always contains a complete record */
msg->msg_flags |= MSG_EOR;
- rc = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ rc = skb_copy_datagram_msghdr(skb, msg, copied);
if (rc)
goto out_free_dgram;
^ permalink raw reply related
* Re: [PATCH 2/4] tun: Use iovec iterators
From: Herbert Xu @ 2014-11-05 3:41 UTC (permalink / raw)
To: YOSHIFUJI Hideaki
Cc: viro, davem, netdev, linux-kernel, bcrl, hideaki.yoshifuji
In-Reply-To: <5459902C.1010102@miraclelinux.com>
YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> wrote:
>>
>> - if (len < skb->len + vlan_hlen + vnet_hdr_sz) {
>> + if (iov_iter_count(iter) < total) {
>
> I guess this should be: sizeof(pi) + total
Good catch! Here is a third update:
tun: Use iovec iterators
This patch removes the use of skb_copy_datagram_const_iovec in
favour of the iovec iterator-based skb_copy_datagram_iter.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9dd3746..b4ac4d5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -71,6 +71,7 @@
#include <net/rtnetlink.h>
#include <net/sock.h>
#include <linux/seq_file.h>
+#include <linux/uio.h>
#include <asm/uaccess.h>
@@ -1230,11 +1231,11 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
static ssize_t tun_put_user(struct tun_struct *tun,
struct tun_file *tfile,
struct sk_buff *skb,
- const struct iovec *iv, int len)
+ struct iov_iter *iter)
{
struct tun_pi pi = { 0, skb->protocol };
- ssize_t total = 0;
- int vlan_offset = 0, copied;
+ ssize_t total;
+ int vlan_offset;
int vlan_hlen = 0;
int vnet_hdr_sz = 0;
@@ -1244,23 +1245,25 @@ static ssize_t tun_put_user(struct tun_struct *tun,
if (tun->flags & TUN_VNET_HDR)
vnet_hdr_sz = tun->vnet_hdr_sz;
+ total = skb->len + vlan_hlen + vnet_hdr_sz;
+
if (!(tun->flags & TUN_NO_PI)) {
- if ((len -= sizeof(pi)) < 0)
+ if (iov_iter_count(iter) < sizeof(pi))
return -EINVAL;
- if (len < skb->len + vlan_hlen + vnet_hdr_sz) {
+ total += sizeof(pi);
+ if (iov_iter_count(iter) < total) {
/* Packet will be striped */
pi.flags |= TUN_PKT_STRIP;
}
- if (memcpy_toiovecend(iv, (void *) &pi, 0, sizeof(pi)))
+ if (copy_to_iter(&pi, sizeof(pi), iter))
return -EFAULT;
- total += sizeof(pi);
}
if (vnet_hdr_sz) {
struct virtio_net_hdr gso = { 0 }; /* no info leak */
- if ((len -= vnet_hdr_sz) < 0)
+ if (iov_iter_count(iter) < vnet_hdr_sz)
return -EINVAL;
if (skb_is_gso(skb)) {
@@ -1299,17 +1302,12 @@ static ssize_t tun_put_user(struct tun_struct *tun,
gso.flags = VIRTIO_NET_HDR_F_DATA_VALID;
} /* else everything is zero */
- if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total,
- sizeof(gso))))
+ if (copy_to_iter(&gso, sizeof(gso), iter))
return -EFAULT;
- total += vnet_hdr_sz;
}
- copied = total;
- len = min_t(int, skb->len + vlan_hlen, len);
- total += skb->len + vlan_hlen;
if (vlan_hlen) {
- int copy, ret;
+ int ret;
struct {
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
@@ -1320,36 +1318,32 @@ static ssize_t tun_put_user(struct tun_struct *tun,
vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
- copy = min_t(int, vlan_offset, len);
- ret = skb_copy_datagram_const_iovec(skb, 0, iv, copied, copy);
- len -= copy;
- copied += copy;
- if (ret || !len)
+ ret = skb_copy_datagram_iter(skb, 0, iter, vlan_offset);
+ if (ret || !iov_iter_count(iter))
goto done;
- copy = min_t(int, sizeof(veth), len);
- ret = memcpy_toiovecend(iv, (void *)&veth, copied, copy);
- len -= copy;
- copied += copy;
- if (ret || !len)
+ ret = copy_to_iter(&veth, sizeof(veth), iter);
+ if (ret || !iov_iter_count(iter))
goto done;
}
- skb_copy_datagram_const_iovec(skb, vlan_offset, iv, copied, len);
+ skb_copy_datagram_iter(skb, vlan_offset, iter, skb->len - vlan_offset);
done:
tun->dev->stats.tx_packets++;
- tun->dev->stats.tx_bytes += len;
+ tun->dev->stats.tx_bytes += skb->len + vlan_hlen;
return total;
}
static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
- const struct iovec *iv, ssize_t len, int noblock)
+ const struct iovec *iv, unsigned long segs,
+ ssize_t len, int noblock)
{
struct sk_buff *skb;
ssize_t ret = 0;
int peeked, err, off = 0;
+ struct iov_iter iter;
tun_debug(KERN_INFO, tun, "tun_do_read\n");
@@ -1362,11 +1356,12 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
/* Read frames from queue */
skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
&peeked, &off, &err);
- if (skb) {
- ret = tun_put_user(tun, tfile, skb, iv, len);
- kfree_skb(skb);
- } else
- ret = err;
+ if (!skb)
+ return ret;
+
+ iov_iter_init(&iter, READ, iv, segs, len);
+ ret = tun_put_user(tun, tfile, skb, &iter);
+ kfree_skb(skb);
return ret;
}
@@ -1387,7 +1382,7 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
goto out;
}
- ret = tun_do_read(tun, tfile, iv, len,
+ ret = tun_do_read(tun, tfile, iv, count, len,
file->f_flags & O_NONBLOCK);
ret = min_t(ssize_t, ret, len);
if (ret > 0)
@@ -1488,7 +1483,7 @@ static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
SOL_PACKET, TUN_TX_TIMESTAMP);
goto out;
}
- ret = tun_do_read(tun, tfile, m->msg_iov, total_len,
+ ret = tun_do_read(tun, tfile, m->msg_iov, m->msg_iovlen, total_len,
flags & MSG_DONTWAIT);
if (ret > total_len) {
m->msg_flags |= MSG_TRUNC;
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related
* Re: [PATCH 02/13] net_sched: introduce qdisc_peek() helper function
From: Herbert Xu @ 2014-11-05 3:49 UTC (permalink / raw)
To: Cong Wang; +Cc: stephen, netdev
In-Reply-To: <CAM_iQpXj6Pk3moYVKzj8YhmPz74fBgvVt4RxRMgW=+-p5MKnEg@mail.gmail.com>
Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, Nov 4, 2014 at 10:45 AM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>> On Tue, 4 Nov 2014 09:56:25 -0800
>> Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>>> +static inline void qdisc_warn_nonwc(void *func, struct Qdisc *qdisc)
>>> +{
>>> + if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>>> + pr_warn("%pf: %s qdisc %X: is non-work-conserving?\n",
>>> + func, qdisc->ops->id, qdisc->handle >> 16);
>>> + qdisc->flags |= TCQ_F_WARN_NONWC;
>>> + }
>>> +}
>>> +
>>
>> Inilining this and creating N copies of same message is not a step forward.
>
> Hmm, I think gcc merges same string literals when building Linux kernel?
> But I never verify this.
In general you should try to avoid inlining code that's not in
the fast path as that leads to binary code size bloat. As errors
shouldn't be in the fast path this function should be inlined.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Al Viro @ 2014-11-05 3:55 UTC (permalink / raw)
To: David Miller; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141104.222727.275422581392488793.davem@davemloft.net>
On Tue, Nov 04, 2014 at 10:27:27PM -0500, David Miller wrote:
> Al, is this the helper you are talking about?
Mostly, except that I kept it 4-argument (and used skb_copy_datagram_msg()
for name). Matter of taste - the ones you've missed because of that are
net/appletalk/ddp.c:1761: err = skb_copy_datagram_iovec(skb, offset, msg->msg_iov, copied);
net/ipv4/tcp.c:1836: err = skb_copy_datagram_iovec(skb, offset,
net/ipv4/udp.c:1284: err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),
net/ipv6/udp.c:427: err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),
net/ipx/af_ipx.c:1808: rc = skb_copy_datagram_iovec(skb, sizeof(struct ipxhdr), msg->msg_iov,
net/iucv/af_iucv.c:1358: if (skb_copy_datagram_iovec(cskb, offset, msg->msg_iov, copied)) {
net/llc/af_llc.c:822: int rc = skb_copy_datagram_iovec(skb, offset,
net/rxrpc/ar-recvmsg.c:183: ret = skb_copy_datagram_iovec(skb, offset, msg->msg_iov, copy);
net/tipc/socket.c:1375: res = skb_copy_datagram_iovec(buf, msg_hdr_sz(msg),
net/tipc/socket.c:1476: res = skb_copy_datagram_iovec(buf, msg_hdr_sz(msg) + offset,
net/unix/af_unix.c:1828: err = skb_copy_datagram_iovec(skb, skip, msg->msg_iov, size);
net/unix/af_unix.c:2033: if (skb_copy_datagram_iovec(skb, UNIXCB(skb).consumed + skip,
net/vmw_vsock/vmci_transport.c:1776: err = skb_copy_datagram_iovec(skb, sizeof(*dg), msg->msg_iov,
and back then I decided that 13 more converted instances might be worth keeping
it in 4-argument form...
What do you think of the trick with user_msghdr, BTW?
^ permalink raw reply
* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Al Viro @ 2014-11-05 4:12 UTC (permalink / raw)
To: David Miller; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141105035536.GO7996@ZenIV.linux.org.uk>
On Wed, Nov 05, 2014 at 03:55:36AM +0000, Al Viro wrote:
> What do you think of the trick with user_msghdr, BTW?
PS: where do you prefer the branches to be based off?
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net#master, mainline,
something else? I can certainly do that as patches over email, the
question is what's best used as base... FWIW, the analysis I've posted
was in 3.18-rc3 and it looks like it ought to be valid in net#master
as well.
^ permalink raw reply
* Re: net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: Lothar Waßmann @ 2014-11-05 5:21 UTC (permalink / raw)
To: David Miller
Cc: fabio.estevam, Frank.Li, netdev, linux-kernel, rmk+kernel,
linux-arm-kernel
In-Reply-To: <20141104.112858.356827532569349127.davem@davemloft.net>
Hi,
David Miller wrote:
> From: Lothar Waßmann <LW@KARO-electronics.de>
> Date: Tue, 4 Nov 2014 11:29:12 +0100
>
> > Hi David,
> >
> > Lothar Waßmann wrote:
> >> David Miller wrote:
> >> > From: Lothar Waßmann <LW@KARO-electronics.de>
> >> > Date: Thu, 30 Oct 2014 07:51:04 +0100
> >> >
> >> > >> Also, I don't thnk your DIV_ROUND_UP() eliminate for the loop
> >> > >> in swap_buffer() is valid. The whole point is that the current
> >> > >> code handles buffers which have a length which is not a multiple
> >> > >> of 4 properly, after your change it will no longer do so.
> >> > >>
> >> > > Do you really think so?
> >> >
> >> > Yes, because you're rounding down so you'll miss the final
> >> > partial word (if any).
> >> >
> >> Nope. DIV_ROUND_UP() would give '1' as upper bound for lengths from 1 to
> >> 4, '2' for lengths from 5 to 8 and so on.
> >>
> >> The loop with increment 4 and i < len does exactly the same.
> >> Try it for yourself, if you don't believe it.
> >>
> >>
> > Do you still think, the loop without DIV_ROUND_UP() is incorrect,
> > or can this patch be applied?
>
> I haven't had the time to fully re-look into the details, I'm busy
> with many other things at the moment.
>
> But looking at DIV_ROUND_UP() macro it rounds up. It gives an
> upper bound of 4 for any value 1 to 4. Unlike what you claim.
>
You're missing the 'DIV' part of DIV_ROUND_UP().
> Because it goes "(n + (d - 1)) / d"
>
> Which for 'd' of 4 gives:
>
> 1 --> 4
> 2 --> 4
> 3 --> 4
> 4 --> 4
>
'1', not '4'.
The loop has to be done once for each (probably partial) WORD of input
data, not for each BYTE.
Without dividing and incrementing by four the loop
counter will be 0 for the first word which is less than any length > 0.
Thus the loop will be run once for any number of bytes from 1 thru 4.
If incremented by 4 after the first loop, it will be run again for any
length > 5 and so forth.
Lothar Waßmann
--
___________________________________________________________
Ka-Ro electronics GmbH | Pascalstraße 22 | D - 52076 Aachen
Phone: +49 2408 1402-0 | Fax: +49 2408 1402-10
Geschäftsführer: Matthias Kaussen
Handelsregistereintrag: Amtsgericht Aachen, HRB 4996
www.karo-electronics.de | info@karo-electronics.de
___________________________________________________________
^ permalink raw reply
* How to make stack send broadcast ARP request when entry is STALE?
From: Ulf samuelsson @ 2014-11-05 6:48 UTC (permalink / raw)
To: Netdev
Have a problem with an HP router at a certain location, which
is configured to only answer to broadcast ARP requests.
That cannot be changed.
The first ARP request the kernel sends out, is a broadcast request,
which is fine, but after the reply, the kernel sends unicast requests,
which will not get any replies.
The ARP entry will after some time enter STALE state,
and if nothing is done it will time out, and be removed.
This process takes to long, and I have been told that it is
difficult to makes changes that will eventually remove it.
Have tried to change the state from STALE to INCOMPLETE, which failed,
and then tried to change the state to PROBE which also failed.
The stack is only sending out unicasts, and never broadcast.
Is there any way to get the stack to send out a broadcast ARP
without having to wait for the entry to be removed?
I think the recommended behaviour in IPv6 is to send out 3 unicasts
and if all fails, to send out broadcasts.
Anyone know any good literature on how the ARP + neigh state machine works
in the kernel.
I read in Herberts book about the Linux TCP/IP stack and it only discuss how to reply to
ARP requests and not anything on how to generate ARP requests.
Best Regards
Ulf Samuelsson
^ permalink raw reply
* [PATCH net-next 1/2] ipv6: Allow sending packets through tunnels with wildcard endpoints
From: Steffen Klassert @ 2014-11-05 7:02 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Currently we need the IP6_TNL_F_CAP_XMIT capabiltiy to transmit
packets through an ipv6 tunnel. This capability is set when the
tunnel gets configured, based on the tunnel endpoint addresses.
On tunnels with wildcard tunnel endpoints, we need to do the
capabiltiy checking on a per packet basis like it is done in
the receive path.
This patch extends ip6_tnl_xmit_ctl() to take local and remote
addresses as parameters to allow for per packet capabiltiy
checking.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
include/net/ip6_tunnel.h | 3 ++-
net/ipv6/ip6_gre.c | 2 +-
net/ipv6/ip6_tunnel.c | 23 +++++++++++++++--------
net/ipv6/ip6_vti.c | 10 ++++++++--
4 files changed, 26 insertions(+), 12 deletions(-)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a5593da..9326c41 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -65,7 +65,8 @@ void ip6_tnl_dst_reset(struct ip6_tnl *t);
void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst);
int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
const struct in6_addr *raddr);
-int ip6_tnl_xmit_ctl(struct ip6_tnl *t);
+int ip6_tnl_xmit_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
+ const struct in6_addr *raddr);
__u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw);
__u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr,
const struct in6_addr *raddr);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 12c3c8e..1fcf62e 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -902,7 +902,7 @@ static netdev_tx_t ip6gre_tunnel_xmit(struct sk_buff *skb,
struct net_device_stats *stats = &t->dev->stats;
int ret;
- if (!ip6_tnl_xmit_ctl(t))
+ if (!ip6_tnl_xmit_ctl(t, &t->parms.laddr, &t->parms.raddr))
goto tx_err;
switch (skb->protocol) {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 8c97cd1..a8f94ff 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -909,24 +909,28 @@ ip6_tnl_addr_conflict(const struct ip6_tnl *t, const struct ipv6hdr *hdr)
return ipv6_addr_equal(&t->parms.raddr, &hdr->saddr);
}
-int ip6_tnl_xmit_ctl(struct ip6_tnl *t)
+int ip6_tnl_xmit_ctl(struct ip6_tnl *t,
+ const struct in6_addr *laddr,
+ const struct in6_addr *raddr)
{
struct __ip6_tnl_parm *p = &t->parms;
int ret = 0;
struct net *net = t->net;
- if (p->flags & IP6_TNL_F_CAP_XMIT) {
+ if ((p->flags & IP6_TNL_F_CAP_XMIT) ||
+ ((p->flags & IP6_TNL_F_CAP_PER_PACKET) &&
+ (ip6_tnl_get_cap(t, laddr, raddr) & IP6_TNL_F_CAP_XMIT))) {
struct net_device *ldev = NULL;
rcu_read_lock();
if (p->link)
ldev = dev_get_by_index_rcu(net, p->link);
- if (unlikely(!ipv6_chk_addr(net, &p->laddr, ldev, 0)))
+ if (unlikely(!ipv6_chk_addr(net, laddr, ldev, 0)))
pr_warn("%s xmit: Local address not yet configured!\n",
p->name);
- else if (!ipv6_addr_is_multicast(&p->raddr) &&
- unlikely(ipv6_chk_addr(net, &p->raddr, NULL, 0)))
+ else if (!ipv6_addr_is_multicast(raddr) &&
+ unlikely(ipv6_chk_addr(net, raddr, NULL, 0)))
pr_warn("%s xmit: Routing loop! Remote address found on this node!\n",
p->name);
else
@@ -977,6 +981,10 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
if (!fl6->flowi6_mark)
dst = ip6_tnl_dst_check(t);
+
+ if (!ip6_tnl_xmit_ctl(t, &fl6->saddr, &fl6->daddr))
+ goto tx_err_link_failure;
+
if (!dst) {
ndst = ip6_route_output(net, NULL, fl6);
@@ -1086,8 +1094,7 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
int err;
tproto = ACCESS_ONCE(t->parms.proto);
- if ((tproto != IPPROTO_IPIP && tproto != 0) ||
- !ip6_tnl_xmit_ctl(t))
+ if (tproto != IPPROTO_IPIP && tproto != 0)
return -1;
if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
@@ -1131,7 +1138,7 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
tproto = ACCESS_ONCE(t->parms.proto);
if ((tproto != IPPROTO_IPV6 && tproto != 0) ||
- !ip6_tnl_xmit_ctl(t) || ip6_tnl_addr_conflict(t, ipv6h))
+ ip6_tnl_addr_conflict(t, ipv6h))
return -1;
offset = ip6_tnl_parse_tlv_enc_lim(skb, skb_network_header(skb));
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d440bb5..0e8e97e 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -416,6 +416,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
struct net_device_stats *stats = &t->dev->stats;
struct dst_entry *dst = skb_dst(skb);
struct net_device *tdev;
+ struct xfrm_state *x;
int err = -1;
if (!dst)
@@ -429,7 +430,12 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
goto tx_err_link_failure;
}
- if (!vti6_state_check(dst->xfrm, &t->parms.raddr, &t->parms.laddr))
+ x = dst->xfrm;
+ if (!vti6_state_check(x, &t->parms.raddr, &t->parms.laddr))
+ goto tx_err_link_failure;
+
+ if (!ip6_tnl_xmit_ctl(t, (const struct in6_addr *)&x->props.saddr,
+ (const struct in6_addr *)&x->id.daddr))
goto tx_err_link_failure;
tdev = dst->dev;
@@ -484,7 +490,7 @@ vti6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
ipv6h = ipv6_hdr(skb);
if ((t->parms.proto != IPPROTO_IPV6 && t->parms.proto != 0) ||
- !ip6_tnl_xmit_ctl(t) || vti6_addr_conflict(t, ipv6h))
+ vti6_addr_conflict(t, ipv6h))
goto tx_err;
xfrm_decode_session(skb, &fl, AF_INET6);
--
1.9.1
^ permalink raw reply related
* [PATCH net-next 2/2] ip6_tunnel: Add support for wildcard tunnel endpoints.
From: Steffen Klassert @ 2014-11-05 7:03 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20141105070248.GC6390@secunet.com>
This patch adds support for tunnels with local or
remote wildcard endpoints. With this we get a
NBMA tunnel mode like we have it for ipv4 and
sit tunnels.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv6/ip6_tunnel.c | 41 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a8f94ff..4550d08 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -183,6 +183,7 @@ ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_
unsigned int hash = HASH(remote, local);
struct ip6_tnl *t;
struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
+ struct in6_addr any;
for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
if (ipv6_addr_equal(local, &t->parms.laddr) &&
@@ -190,6 +191,22 @@ ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_
(t->dev->flags & IFF_UP))
return t;
}
+
+ memset(&any, 0, sizeof(any));
+ hash = HASH(&any, local);
+ for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
+ if (ipv6_addr_equal(local, &t->parms.laddr) &&
+ (t->dev->flags & IFF_UP))
+ return t;
+ }
+
+ hash = HASH(remote, &any);
+ for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
+ if (ipv6_addr_equal(remote, &t->parms.raddr) &&
+ (t->dev->flags & IFF_UP))
+ return t;
+ }
+
t = rcu_dereference(ip6n->tnls_wc[0]);
if (t && (t->dev->flags & IFF_UP))
return t;
@@ -979,7 +996,29 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
u8 proto;
int err = -1;
- if (!fl6->flowi6_mark)
+ /* NBMA tunnel */
+ if (ipv6_addr_any(&t->parms.raddr)) {
+ struct in6_addr *addr6;
+ struct neighbour *neigh;
+ int addr_type;
+
+ if (!skb_dst(skb))
+ goto tx_err_link_failure;
+
+ neigh = dst_neigh_lookup(skb_dst(skb),
+ &ipv6_hdr(skb)->daddr);
+ if (!neigh)
+ goto tx_err_link_failure;
+
+ addr6 = (struct in6_addr *)&neigh->primary_key;
+ addr_type = ipv6_addr_type(addr6);
+
+ if (addr_type == IPV6_ADDR_ANY)
+ addr6 = &ipv6_hdr(skb)->daddr;
+
+ memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));
+ neigh_release(neigh);
+ } else if (!fl6->flowi6_mark)
dst = ip6_tnl_dst_check(t);
if (!ip6_tnl_xmit_ctl(t, &fl6->saddr, &fl6->daddr))
--
1.9.1
^ permalink raw reply related
* Re: [PATCH] drivers: net: ethernet: xilinx: xilinx_emaclite: revert the original commit "1db3ddff1602edf2390b7667dcbaa0f71512e3ea"
From: Michal Simek @ 2014-11-05 7:19 UTC (permalink / raw)
To: Chen Gang, michal.simek, soren.brinkmann, davem, sthokal,
manuel.schoelling, f.fainelli, paul.gortmaker, ebiederm
Cc: netdev, linux-kernel@vger.kernel.org
In-Reply-To: <5458E605.6070801@gmail.com>
On 11/04/2014 03:43 PM, Chen Gang wrote:
> Microblaze is a fpga soft core, it can be customized easily, which may
> cause many various hardware version strings.
>
> So the original fix patch based on hard-coded compatible version strings
> is not a good idea (although it is correct for current issue). For it,
> there will be a new solving way soon (which based on the device tree).
>
> The original issue is related with qemu, so can only change the hardware
> version string in qemu for it, then keep the original driver no touch (
> qemu is for virtualization which has much easier life than real world).
>
>
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> ---
> drivers/net/ethernet/xilinx/xilinx_emaclite.c | 1 -
> 1 file changed, 1 deletion(-)
Acked-by: Michal Simek <michal.simek@xilinx.com>
Thanks,
Michal
^ permalink raw reply
* Re: [PATCH] net: eth: realtek: atp: checkpatch errors and warnings corrected
From: Roberto Medina @ 2014-11-05 7:44 UTC (permalink / raw)
To: David Laight, Joe Perches
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9E71E4@AcuExch.aculab.com>
On 11/04/2014 11:19 AM, David Laight wrote:
>
> That code (and all the ones below) are gibberish, neither the old or new
> sequences make any sense.
>
Thank you for your feedback I did it again and objdiff doesn't show any
changes (if I'm not mistaken).
> Almost as though you used the wrong instruction set!
>
> David
diff -Nurd
/home/vov/Git/linux-next/.tmp_objdiff/a641d0e/drivers/net/ethernet/realtek/atp.dis
/home/vov/Git/linux-next/.tmp_objdiff/5f19b70/drivers/net/ethernet/realtek/atp.dis
---
/home/vov/Git/linux-next/.tmp_objdiff/a641d0e/drivers/net/ethernet/realtek/atp.dis
2014-11-05 08:38:49.244100507 +0100
+++
/home/vov/Git/linux-next/.tmp_objdiff/5f19b70/drivers/net/ethernet/realtek/atp.dis
2014-11-05 08:39:14.654101606 +0100
@@ -1357,7 +1357,7 @@
Disassembly of section .bss:
-00000000 <num_tx_since_rx.36127>:
+00000000 <num_tx_since_rx.36172>:
: 00 00 add %al,(%eax)
...
Thanks,
Roberto
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox