* Re: [PATCH 2/4] tun: Use iovec iterators
From: YOSHIFUJI Hideaki @ 2014-11-05 2:49 UTC (permalink / raw)
To: Herbert Xu, Al Viro, David S. Miller, netdev,
Linux Kernel Mailing List, Benjamin LaHaise
Cc: hideaki.yoshifuji
In-Reply-To: <20141104083721.GA12691@gondor.apana.org.au>
Hi,
Herbert Xu wrote:
> Oops, this patch had a left-over skb_pull which made it broken.
> Here is a fixed version.
>
> tun: Use iovec iterators
>
> This patch removes the use of skb_copy_datagram_const_iovec in
> favour of the iovec iterator-based skb_copy_datagram_iter.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 9dd3746..ff955cdb 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
:
> @@ -1244,23 +1245,25 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> if (tun->flags & TUN_VNET_HDR)
> vnet_hdr_sz = tun->vnet_hdr_sz;
>
> + total = skb->len + vlan_hlen + vnet_hdr_sz;
> +
> if (!(tun->flags & TUN_NO_PI)) {
> - if ((len -= sizeof(pi)) < 0)
> + if (iov_iter_count(iter) < sizeof(pi))
> return -EINVAL;
>
> - if (len < skb->len + vlan_hlen + vnet_hdr_sz) {
> + if (iov_iter_count(iter) < total) {
I guess this should be: sizeof(pi) + total
--
Hideaki Yoshifuji <hideaki.yoshifuji@miraclelinux.com>
Technical Division, MIRACLE LINUX CORPORATION
^ permalink raw reply
* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Herbert Xu @ 2014-11-05 2:22 UTC (permalink / raw)
To: Al Viro
Cc: David S. Miller, netdev, Linux Kernel Mailing List,
Benjamin LaHaise
In-Reply-To: <20141104151353.GG7996@ZenIV.linux.org.uk>
On Tue, Nov 04, 2014 at 03:13:55PM +0000, Al Viro wrote:
>
> Think of it that way: every sendmsg/recvmsg path leading to memcpy_fromiovec
> and its friends (including the open-coded ones) would need to be changed
> at some point. Assuming we do not end up passing struct iov_iter * as
> an extra argument through a fairly large part of net/* (and that would
> be prohibitively hard and messy, not to mention the effects on the stack
> footprint, etc.), the most obvious strategy is to have that thing passed
> where msg_iov/msg_iovlen are - in struct msghdr. *IF* we go that way,
> it makes a whole lot of sense to start with a bunch of cleanups that
> will make sense on their own (most of callers of skb_copy_datagram_iovec
> do look like skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); might
> as well give it an inlined helper) and will reduce the amount of places
> where ->msg_iov is used. With such cleanups standing on their own and
> being splittable from the rest of the queue. And leaving us with fewer
> places in code that deal with ->msg_iov and need to be dealt with.
I think your solution is great. However, I don't see how my four
patches impede in anyway the work that you're doing. I presume
your first patch will make skb_copy_datagram_msg just a wrapper
around skb_copy_datagram_iovec.
Since I'm not removing skb_copy_datagram_iovec (it can't be removed
until all users are gone) you can still do that and when you're
ready to switch over to iov_iter you can just move the wrapper over
to skb_copy_datagram_iter. No?
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* [PATCH net-next] r8152: disable the tasklet by default
From: Hayes Wang @ 2014-11-05 2:17 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
Let the tasklet only be enabled after open(), and be disabled for
the other situation. The tasklet is only necessary after open() for
tx/rx, so it could be disabled by default.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 8ded08e..fd41675 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -2943,6 +2943,8 @@ static int rtl8152_open(struct net_device *netdev)
netif_warn(tp, ifup, netdev, "intr_urb submit failed: %d\n",
res);
free_all_mem(tp);
+ } else {
+ tasklet_enable(&tp->tl);
}
mutex_unlock(&tp->control);
@@ -2958,6 +2960,7 @@ static int rtl8152_close(struct net_device *netdev)
struct r8152 *tp = netdev_priv(netdev);
int res = 0;
+ tasklet_disable(&tp->tl);
clear_bit(WORK_ENABLE, &tp->flags);
usb_kill_urb(tp->intr_urb);
cancel_delayed_work_sync(&tp->schedule);
@@ -2975,9 +2978,7 @@ static int rtl8152_close(struct net_device *netdev)
*/
rtl_runtime_suspend_enable(tp, false);
- tasklet_disable(&tp->tl);
tp->rtl_ops.down(tp);
- tasklet_enable(&tp->tl);
mutex_unlock(&tp->control);
@@ -3887,12 +3888,15 @@ static int rtl8152_probe(struct usb_interface *intf,
else
device_set_wakeup_enable(&udev->dev, false);
+ tasklet_disable(&tp->tl);
+
netif_info(tp, probe, netdev, "%s\n", DRIVER_VERSION);
return 0;
out1:
usb_set_intfdata(intf, NULL);
+ tasklet_kill(&tp->tl);
out:
free_netdev(netdev);
return ret;
--
1.9.3
^ permalink raw reply related
* Re: [PATCH 7/7] can: m_can: workaround for transmit data less than 4 bytes
From: Dong Aisheng @ 2014-11-05 2:07 UTC (permalink / raw)
To: Oliver Hartkopp
Cc: Marc Kleine-Budde, linux-can, wg, varkabhadram, netdev,
linux-arm-kernel
In-Reply-To: <5458D0FA.1040504@hartkopp.net>
On Tue, Nov 04, 2014 at 02:13:30PM +0100, Oliver Hartkopp wrote:
>
>
> On 04.11.2014 11:33, Marc Kleine-Budde wrote:
> >On 11/04/2014 10:27 AM, Dong Aisheng wrote:
>
>
> >>+ /* We meet an IC issue that we have to write the full 8
> >
> >At least on the *insert SoC name here*, an issue with the Message RAM
> >was discovered. Sending CAN frames with dlc less than 4 bytes will lead
> >to bit errors, when the first 8 bytes of the Message RAM have not been
> >initialized (i.e. written to). To work around this issue, the first 8
> >bytes are initialized here.
>
> Yes. Also put the current IP revision (3.0.x) into the comment.
> Did inform the Bosch guys from this issue - or is it already in some errata sheet?
>
Good idea, will add it.
I will try if we can talk Bosch guys about this issue.
Regards
Dong Aisheng
> >
> >>+ * bytes (whatever value for the second word) in Message RAM to
> >>+ * avoid bit error for transmit data less than 4 bytes at the first
> >>+ * time. By initializing the first 8 bytes of tx buffer before using
> >>+ * it can avoid such issue.
> >>+ */
> >>+ m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), 0x0);
> >>+ m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), 0x0);
> >>+
> >> m_can_config_endisable(priv, false);
> >> }
> >
> >Can you trigger the issue when sending CAN-FD frames with dlc > 8 && dlc
> >< 64?
>
> Just a nitpick:
>
> DLC can just be 0 .. 15
>
> and the length (struct canfd_frame.len) can be from 0 .. 64
>
> See:
>
> http://lxr.free-electrons.com/source/include/uapi/linux/can.h#L83
>
> That's the reason for all these helpers
>
> http://lxr.free-electrons.com/source/drivers/net/can/dev.c#L36
>
> that hide the evil "DLC" from userspace now and make 'len' a usable
> loop variable as we were able to use the former dlc for classic CAN
> :-)
>
> Regards,
> Oliver
>
^ permalink raw reply
* Re: [PATCH 7/7] can: m_can: workaround for transmit data less than 4 bytes
From: Dong Aisheng @ 2014-11-05 2:03 UTC (permalink / raw)
To: Marc Kleine-Budde
Cc: linux-can, wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <5458AB65.7000500@pengutronix.de>
On Tue, Nov 04, 2014 at 11:33:09AM +0100, Marc Kleine-Budde wrote:
> On 11/04/2014 10:27 AM, Dong Aisheng wrote:
> >>>> It should be possible to change the for loop to go always to 8, or
> >>>> simply unroll the loop:
> >>>>
> >>>> /* errata description goes here */
> >>>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), *(u32 *)(cf->data + 0));
> >>>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), *(u32 *)(cf->data + 4));
> >>>>
> >>>
> >>> Yes, i tried to fix it as follows.
> >>>
> >>> /* FIXME: we meet an IC issue that we have to write the full 8
> >>> * bytes (whatever value for the second word) in Message RAM to
> >>> * avoid bit error for transmit data less than 4 bytes
> >>> */
> >>> if (cf->len <= 4) {
> >>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0),
> >>> *(u32 *)(cf->data + 0));
> >>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1),
> >>> *(u32 *)(cf->data + 4));
> >>> } else {
> >>> for (i = 0; i < cf->len; i += 4)
> >>> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(i / 4),
> >>> *(u32 *)(cf->data + i));
> >>>
> >>> Will update the patch.
> >>
> >> Both branches of the above if are doing the same thing, I think you can
> >> replace the while if ... else ... for with this:
> >>
> >
> > Not the same thing.
> > The later one will cover payload up to 64 bytes.
>
> Doh! I'm not used to CAN-FD, yet :) However, I'll apply this fix before
> adding the CAN-FD support.
>
> >> /* errata description goes here */
> >> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), *(u32 *)(cf->data + 0));
> >> m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), *(u32 *)(cf->data + 4));
> >>
> >> However if writing to DATA(0) and DATA(1) once in the open() function is
> >> enough this code should stay as it is.
> >
> > I tried put them into open() function and the quick test showed it worked.
> >
> > Do you think it's ok to put things into open() function for this issue
> > as follows?
> >
> > diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> > index 065e4f1..ca55988 100644
> > --- a/drivers/net/can/m_can/m_can.c
> > +++ b/drivers/net/can/m_can/m_can.c
> > @@ -901,6 +901,15 @@ static void m_can_chip_config(struct net_device *dev)
> > /* set bittiming params */
> > m_can_set_bittiming(dev);
> >
> > + /* We meet an IC issue that we have to write the full 8
>
> At least on the *insert SoC name here*, an issue with the Message RAM
> was discovered. Sending CAN frames with dlc less than 4 bytes will lead
> to bit errors, when the first 8 bytes of the Message RAM have not been
> initialized (i.e. written to). To work around this issue, the first 8
> bytes are initialized here.
>
Looks good.
Will do like that.
> > + * bytes (whatever value for the second word) in Message RAM to
> > + * avoid bit error for transmit data less than 4 bytes at the first
> > + * time. By initializing the first 8 bytes of tx buffer before using
> > + * it can avoid such issue.
> > + */
> > + m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(0), 0x0);
> > + m_can_fifo_write(priv, 0, M_CAN_FIFO_DATA(1), 0x0);
> > +
> > m_can_config_endisable(priv, false);
> > }
>
> Can you trigger the issue when sending CAN-FD frames with dlc > 8 && dlc
> < 64?
>
No, i did not see the issue with dlc > 8 && dlc < 64.
Regards
Dong Aisheng
> Marc
>
> --
> Pengutronix e.K. | Marc Kleine-Budde |
> Industrial Linux Solutions | Phone: +49-231-2826-924 |
> Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
> Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
>
^ permalink raw reply
* Re: [0/3] net: Kill skb_copy_datagram_const_iovec
From: Al Viro @ 2014-11-05 1:53 UTC (permalink / raw)
To: David Miller
Cc: herbert, netdev, linux-kernel, bcrl, Steve French, Sage Weil,
Nicholas A. Bellinger
In-Reply-To: <20141104054513.GB7996@ZenIV.linux.org.uk>
On Tue, Nov 04, 2014 at 05:45:13AM +0000, Al Viro wrote:
> Hell knows; I hadn't finished digging through that zoo - got sidetracked back
> then. *IF* all such places either use a throwaway copy or assume that iovec
> gets modified, we can do the following: switch the access to iovecs to
> iov_iter primitives, with the first kind of callers creating a throwaway
> iov_iter and the second just feeding the same iov_iter to e.g.
> kernel_recvmsg(). iovec will remain constant, iov_iter will be advanced.
> Moreover, in a lot of cases of first kind will be able to get rid of
> throwaway iov_iter (and of manually advancing it), effectively converting
> to the second one.
All right, now I _have_ finished that. See the resulting notes below.
TL;DR version: looks like hypothesis above is correct, modulo 2 places,
both buggy - cifs smb_send_kvec() apparently relies on ->sendmsg() leaving the
iovec unchanged and so does of the ceph_tcp_sendmsg() callers
(write_partial_kvec()). For TCP that's not always true. Another apparent
bug caught in process is iscsi iscsit_do_tx_data() - assumes that iovec is
being consumed by sendmsg(). I don't see how that could not be a bug - TCP
sockets can get there and tcp_sendmsg() normally *doesn't* modify the iovec.
Sometimes it does, unfortunately for other two places...
Maintainers Cc'd...
Full version follows:
-----------------------------------------------------------------------------
->sendmsg(): there's such method in struct proto_ops and in struct proto; the
latter is called by (some of) the former, in cases when ->sendmsg() isn't the
same for the entire family.
Instances of proto ->sendmsg() are, on several occasions, called directly; some
of those calls are from another such instance (with unchanged payload). There
are two exceptions to that - in tipc_accept() and tipc_connect() we call such
instances with empty payload. All calls via method are from proto_ops
->sendmsg() instances, payload unchanged.
Instances of proto_ops ->sendmsg() are almost never called directly. All
exceptions are from another such instance with unchanged payload.
There are two places that call proto_ops ->sendmsg() via method -
__sock_sendmsg_nosec() in net/socket.c, and handle_tx() in drivers/vhost/net.c.
The latter appears to be playing somewhat unusual games with passing NULL iocb,
making it impossible to use the former... Everybody else in the kernel goes
through __sock_sendmsg_nosec(), though - it's a chokepoint for sendmsg path.
vhost callsite is somewhat worrying - granted, most of the ->sendmsg()
instances don't give a damn about iocb at all. The rest, though...
E.g. what happens if we do VHOST_NET_SET_BACKEND with backend.fd being an
AF_UNIX socket? AFAICS, if it ever gets to that ->sendmsg() call afterwards,
we'll get an oops when e.g. unix_dgram_sendmsg() calls kiocb_to_siocb(NULL).
The same goes for AF_NETLINK; AF_TIPC is even more interesting, since there
NULL iocb is used as "we are in weird callchain, socket is already locked"
flag (for tipc_{accept,connect}() callsites). What's going on there?
[After having talked with mst: drivers/vhost/net.c checks that it's
AF_PACKET/SOCK_RAW (packet_sendmsg()) *or* comes from tun.c (tun_sendmsg())
or from macvtap.c (macvtap_sendmsg()) and all of those ignore iocb completely]
FWIW, the situation with iocb is
* AF_UNIX and AF_NETLINK use it to get to sock_iocb
* AF_TIPC uses it as a flag - it never dereferences the damn thing and
only compares it with NULL, to determine whether it's in a normal call chain,
or in tipc_accept/tipc_connect one...
* everything else ignores it completely, either directly or after
passing it to something that ignores it (rxrpc has unusually deep chain, but it
still ends up ignoring the sucker).
->recvmsg(): again, present in proto_ops and proto, in the same relationship
to each other. The situation is a bit simpler there: there is only one direct
caller of an instance of struct proto ->recvmsg() and it is in another such
instance, arguments unchanged. All callers via method are in the instances of
struct proto_ops ->recvmsg(), message-related arguments unchanged. No direct
callers of struct proto_ops recvmsg, three call sites via method - regular one
in __sock_recvmsg_nosec() plus two in drivers/vhost/net.c. The latter have
NULL iocb and are saved from oopsing in ->recvmsg() by the same logics that
saves vhost on sendmsg side. iocb is ignored by everything except AF_UNIX
and AF_NETLINK (those use it for sock_iocb) and neither can be reached from
vhost path.
So it boils down to the following: drivers/vhost/net.c aside, everything goes
through __sock_sendmsg_nosec() on the sendmsg side and __sock_recvmsg_nosec()
on recvmsg one.
Call chains leading to __sock_sendmsg_nosec():
__sock_sendmsg_nosec()
<- sock_sendmsg_nosec()
<- ___sys_sendmsg()
<- __sys_sendmsg()
<- sys_compat_sendmsg()
<- sys_sendmsg()
<- __sys_sendmmsg()
<- sys_compat_senmmmsg()
<- sys_sendmmsg()
<- __sock_sendmsg()
<- do_sock_write()
<- sock_aio_write()
== ->aio_write()
<- sock_sendmsg()
<- svc_sendto() [no iovec at all]
<- ___sys_sendmsg() [see above]
<- sys_sendto()
<- kernel_sendmsg()
All syscalls (and there's quite a tangled mess with sys_socketcall,
assorted ARM wrappers, etc.) end up with iovec discarded.
Ditto for ->aio_write() callers - they all free the iovec soon after
->aio_write() returns and never look at it before freeing.
Call chains leading to __sock_recvmsg_nosec():
__sock_recvmsg_nosec()
<- sock_recvmsg_nosec()
<- ___sys_recvmsg()
<- __sys_recvmsg()
<- sys_compat_recvmsg()
<- sys_recvmsg()
<- __sys_recvmmsg()
<- sys_compat_recvmmsg()
<- sys_recvmmsg()
<- __sock_recvmsg()
<- do_sock_read()
<- sock_aio_read()
== ->aio_read()
<- sock_recvmsg() [why is it not static, BTW?]
<- ___sys_recvmsg() [see above]
<- sys_recvfrom()
<- kernel_recvmsg()
Again, both the sycalls and ->aio_read() callers end up discarding
iovec.
All of that leaves us with kernel_{send,recv}msg() as the next-order
chokepoints.
kernel_recvmsg() callers:
drbd drbd_recv_short() - single-element iovec discarded
nbd sock_xmit() - single-element iovec discarded; would be better off
with advancing iov_iter.
isdn l1oip_socket_thread() - single-element iovec discarded
lustre ksocknal_lib_recv_iov() - iovec copied (with unhappy comment),
copy passed to kernel_recvmsg() and discarded
lustre ksocknal_lib_recv_kiov() - ditto. Would be much better off
with bvec-based iov_iter
lustre libcfs_sock_read() - single-element iovec discarded; would be
better off with advancing iov_iter.
iscsi iscsit_do_rx_data() - assumes that iovec is being consumed.
AFAICS, it's guaranteed to be TCP and tcp_recvmsg() appears to act that way,
so it's probably OK...
usbip usbip_recv() - single-element iovec discarded; would be better
off with advancing iov_iter
cifs cifs_readv_from_socket() - iovec copied, copy passed to
kernel_recvmsg() and discarded; *definitely* would be better off with advancing
iov_iter.
dlm receive_from_sock() - 1- or 2-element iovec discarded.
ncpfs _recv() - single-element iovec discarded.
ocfs2 o2net_recv_tcp_msg() - single-element iovec discarded.
ceph ceph_tcp_recvmsg() - single-element iovec discarded; at least one
of the loops using it would be better off with advancing iov_iter.
ipvs ip_vs_receive() - single-element iovec discarded.
sunrpc svc_udp_recvfrom() - no payload (MSG_PEEK, that one)
tipc tipc_receive_from_sock() - single-element iovec discarded.
sunrpc svc_recvfrom() - confusing; looks like iovec is a throwaway
one, though (and we might be better off if we could use an iov_iter of bvec
sort instead).
kernel_sendmsg() callers:
drbd drbd_send() - single-element iovec discarded; would be better
off with advancing iov_iter
nbd sock_xmit() - ditto
isdn l1oip_socket_send() - single-element iovec discarded
iscsi iscsi_sw_tcp_xmit_segment() - single-element iovec discarded
lustre ksocknal_lib_send_iov() - iovec copied (with unhappy comment),
copy passed to kernel_sendmsg() and discarded
lustre ksocknal_lib_send_kiov - ditto. Would be much better off
with bvec iov_iter.
lustre libcfs_sock_write() - single-element iovec discarded; better
off with advancing iov_iter
iscsi iscsit_do_tx_data() - assumes that iovec is being consumed.
I don't see how that could not be a bug - TCP sockets can get there.
usbip stub_send_ret_submit() - iovec is built and discarded;
short write is treated as an error
usbip stub_send_ret_unlink() - ditto
usbip vhci_send_cmd_submit() - ditto
usbip vhci_send_cmd_unlink() - ditto
cifs smb_send_kvec() - apparently relies on ->sendmsg() leaving the
iovec unchanged. Looks like a bug - tcp_sendmsg() might drain iovec in some
cases.
dlm sctp_send_shutdown() - no payload
dlm sctp_init_assoc() - single-element iovec discarded
ncpfs do_send() - single-element iovec discarded
ocfs2 o2net_send_tcp_msg() - callers pass it a throwaway iovec
bnep bnep_send() - single-element iovec discarded
bnep_tx_frame() - short iovec is built and discarded
cmtp_send_frame() - single-element iovec discarded
hidp_send_frame() - single-element iovec discarded
rfcomm_send_frame() - single-element iovec discarded
rfcomm_send_test() - short iovec is built and discarded
core sock_no_sendpage*() - single-element iovec discarded
ipvs ip_vs_send_async() - single-element iovec discarded
rds rds_tcp_sendmsg() - single-element iovec discarded
rxrpc rxrpc_busy() - single-element iovec discarded
rxrpc rxrpc_process_call() - short iovec is built and discarded
rxrpc rxrpc_abort_connection() - short iovec is built and discarded
rxrpc rxrpc_reject_packets() - assumes that sendmsg drains iovec;
may be a bug.
rxrpc rxrpc_send_packet() - single-element iovec discarded
rxrpc rxkad_issue_challenge() - short iovec is built and discarded
rxrpc rxkad_send_response - short iovec is built and discarded
sunrpc xs_send_kvec() - single-element iovec discarded; really asks
or iov_iter...
tipc tipc_send_to_sock() - single-element iovec discarded; for some
reason it seems to believe that short writes never happen...
ceph ceph_tcp_sendmsg() - one caller appears to discard iovec, another
(write_partial_kvec()) apparently assumes that iovec is unchanged by sendmsg.
Not guaranteed to be true for TCP, AFAICAS.
^ permalink raw reply
* Re: [PATCH 00/13] net_sched: misc cleanups and improvements
From: Eric Dumazet @ 2014-11-05 1:47 UTC (permalink / raw)
To: Cong Wang; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <CAM_iQpXZBz=Ha2h3mCwnZKa3Etx7Ya9NxAWe_p8T+VjRUs5g0g@mail.gmail.com>
On Tue, 2014-11-04 at 17:25 -0800, Cong Wang wrote:
> Seriously, think about why it should when it's just cleanup's, be practical.
I seriously ask you to not do cleanups then.
Some people are working adding real stuff here, this code changing every
month is slowing them a lot.
^ permalink raw reply
* Re: [PATCH 00/13] net_sched: misc cleanups and improvements
From: Cong Wang @ 2014-11-05 1:25 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <1415125250.25370.2.camel@edumazet-glaptop2.roam.corp.google.com>
On Tue, Nov 4, 2014 at 10:20 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> If you do not care of writing good changelogs and explain why you want
> all this, why should I care spending time to review all this stuff ?
>
> What is the plan, beyond all this code churn ?
Why $subject isn't obvious?
I don't understand why you are only mean to me (and probably David Taht too).
Sit down and search the history, you will find:
commit c6be2a10ac2f810bdd01e978c93a8ef65b46120b
Merge: 7fd2561 44cc8ed
Author: David S. Miller <davem@davemloft.net>
Date: Wed Oct 29 15:59:43 2014 -0400
Merge branch 'xen-netback-next'
David Vrabel says:
====================
xen-netback: minor cleanups
Two minor xen-netback cleanups originally from Zoltan.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tell me, why this changelog in anyway could be better than me? which
was accepted (of course not by you). You can argue I have 13 patches,
but, there is no big difference between 13 cleanup's and 2 cleanup's,
cleanup's are cleanup's, that's it.
Seriously, think about why it should when it's just cleanup's, be practical.
^ permalink raw reply
* Re: [PATCH v2 net] tcp: zero retrans_stamp if all retrans were acked
From: Eric Dumazet @ 2014-11-05 1:20 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Neal Cardwell, Netdev, Yuchung Cheng, Eric Dumazet
In-Reply-To: <54593C59.6070004@redhat.com>
On Tue, 2014-11-04 at 18:51 -0200, Marcelo Ricardo Leitner wrote:
> And thank you guys for all the assistance on it. Btw, would you send me that
> packetdrill script? I'm curious to see how such corner case could be written
> on it.
One of the script I saw was :
You might have to adapt preconditions (tcp_rmem[]/tcp_wmem[])
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// Set a 10s timeout
+.000 setsockopt(3, SOL_TCP, TCP_USER_TIMEOUT, [10000], 4) = 0
+.000 bind(3, ..., ...) = 0
+.000 listen(3, 1) = 0
+.000 < S 0:0(0) win 32792 <mss 1460,nop,wscale 7>
+.000 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 6>
+.010 < . 1:1(0) ack 1 win 257
+.000 accept(3, ..., ...) = 4
+.000 write(4, ..., 1000) = 1000
+.000 > P. 1:1001(1000) ack 1
+.625 > P. 1:1001(1000) ack 1
+.020 < . 1:1(0) ack 1001 win 257
// Purposely write more after the specified timeout for testing
+11.0 write(4, ..., 1000) = 1000
+.000 > P. 1001:2001(1000) ack 1
+1.25 > P. 1001:2001(1000) ack 1
// socket is killed when the 2nd RTO fires at +2.50 w/o this patch
// so the next write returns ETIMEOUT
+2.60 write(4, ..., 1000) = 1000
^ permalink raw reply
* Re: [PATCH net] ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs
From: Eric Dumazet @ 2014-11-05 1:06 UTC (permalink / raw)
To: Daniel Borkmann
Cc: davem, lw1a2.jing, fw, hannes, netdev, Eric Dumazet,
David L Stevens
In-Reply-To: <1415149113-32668-1-git-send-email-dborkman@redhat.com>
On Wed, 2014-11-05 at 01:58 +0100, Daniel Borkmann wrote:
> It has been reported that generating an MLD listener report on
> devices with large MTUs (e.g. 9000) and a high number of IPv6
> addresses can trigger a skb_over_panic():
>
> skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
> head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
> dev:port1
> ------------[ cut here ]------------
> kernel BUG at net/core/skbuff.c:100!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: ixgbe(O)
> CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
> [...]
> Call Trace:
> <IRQ>
> [<ffffffff80578226>] ? skb_put+0x3a/0x3b
> [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
> [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
> [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
> [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
> [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
> [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
> [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
> [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
> [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
> [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
>
> mld_newpack() skb allocations are usually requested with dev->mtu
> in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
> we have changed the limit in order to be less unreliable to fail.
>
> However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
> macros, which determine if we may end up doing an skb_put() for
> adding another record. To avoid possible fragmentation, we check
> the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
> assumption as the actual max allocation size will be much smaller.
>
> The IGMP case doesn't have this issue as commit 57e1ab6eaddc
> ("igmp: refine skb allocations") stores the allocation size in the
> cb[], but therefore takes the MTU check not into account anymore.
> Add and use skb_nofrag_tailroom() for both cases.
>
> Reported-by: lw1a2.jing@gmail.com
> Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: David L Stevens <david.stevens@oracle.com>
> ---
> In skb_nofrag_tailroom(), we could actually omit the !skb->dev check,
> but I leave that rather as a possible cleanup item for net-next.
Hmm... we have a proliferation of such things.
Could you take a look at sk_stream_alloc_skb(), skb->reserved_tailroom,
and skb_availroom() ?
Thanks !
^ permalink raw reply
* Re: [PATCH 02/13] net_sched: introduce qdisc_peek() helper function
From: Cong Wang @ 2014-11-05 1:05 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Linux Kernel Network Developers
In-Reply-To: <20141104104515.2e6433c6@uryu.home.lan>
On Tue, Nov 4, 2014 at 10:45 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 4 Nov 2014 09:56:25 -0800
> Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>> +static inline void qdisc_warn_nonwc(void *func, struct Qdisc *qdisc)
>> +{
>> + if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>> + pr_warn("%pf: %s qdisc %X: is non-work-conserving?\n",
>> + func, qdisc->ops->id, qdisc->handle >> 16);
>> + qdisc->flags |= TCQ_F_WARN_NONWC;
>> + }
>> +}
>> +
>
> Inilining this and creating N copies of same message is not a step forward.
Hmm, I think gcc merges same string literals when building Linux kernel?
But I never verify this.
^ permalink raw reply
* [PATCH net] ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs
From: Daniel Borkmann @ 2014-11-05 0:58 UTC (permalink / raw)
To: davem; +Cc: lw1a2.jing, fw, hannes, netdev, Eric Dumazet, David L Stevens
It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():
skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
<IRQ>
[<ffffffff80578226>] ? skb_put+0x3a/0x3b
[<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
[<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
[<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
[<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
[<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
[<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
[<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
[<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
[<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
[<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less unreliable to fail.
However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size will be much smaller.
The IGMP case doesn't have this issue as commit 57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in the
cb[], but therefore takes the MTU check not into account anymore.
Add and use skb_nofrag_tailroom() for both cases.
Reported-by: lw1a2.jing@gmail.com
Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David L Stevens <david.stevens@oracle.com>
---
In skb_nofrag_tailroom(), we could actually omit the !skb->dev check,
but I leave that rather as a possible cleanup item for net-next.
include/linux/netdevice.h | 15 +++++++++++++++
net/ipv4/igmp.c | 6 +-----
net/ipv6/mcast.c | 3 +--
3 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 74fd5d3..e4f4cfa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2262,6 +2262,21 @@ do { \
compute_pseudo(skb, proto)); \
} while (0)
+/**
+ * skb_nofrag_tailroom - bytes at buffer end still fitting into MTU
+ * @skb: buffer to check
+ *
+ * Return the number of bytes of free space at the tail of an sk_buff
+ * that still fit into the device MTU.
+ */
+static inline int skb_nofrag_tailroom(const struct sk_buff *skb)
+{
+ if (!skb->dev)
+ return skb_tailroom(skb);
+
+ return clamp_t(int, skb->dev->mtu - skb->len, 0, skb_tailroom(skb));
+}
+
static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
unsigned short type,
const void *daddr, const void *saddr,
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index fb70e3e..a750dfb 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -318,8 +318,6 @@ igmp_scount(struct ip_mc_list *pmc, int type, int gdeleted, int sdeleted)
return scount;
}
-#define igmp_skb_size(skb) (*(unsigned int *)((skb)->cb))
-
static struct sk_buff *igmpv3_newpack(struct net_device *dev, int size)
{
struct sk_buff *skb;
@@ -341,7 +339,6 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, int size)
return NULL;
}
skb->priority = TC_PRIO_CONTROL;
- igmp_skb_size(skb) = size;
rt = ip_route_output_ports(net, &fl4, NULL, IGMPV3_ALL_MCR, 0,
0, 0,
@@ -423,8 +420,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ip_mc_list *pmc,
return skb;
}
-#define AVAILABLE(skb) ((skb) ? ((skb)->dev ? igmp_skb_size(skb) - (skb)->len : \
- skb_tailroom(skb)) : 0)
+#define AVAILABLE(skb) ((skb) ? skb_nofrag_tailroom(skb) : 0)
static struct sk_buff *add_grec(struct sk_buff *skb, struct ip_mc_list *pmc,
int type, int gdeleted, int sdeleted)
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 9648de2..1bc18f9 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1690,8 +1690,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
return skb;
}
-#define AVAILABLE(skb) ((skb) ? ((skb)->dev ? (skb)->dev->mtu - (skb)->len : \
- skb_tailroom(skb)) : 0)
+#define AVAILABLE(skb) ((skb) ? skb_nofrag_tailroom(skb) : 0)
static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc,
int type, int gdeleted, int sdeleted, int crsend)
--
1.9.3
^ permalink raw reply related
* Re: [PATCH] bridge: include in6.h in if_bridge.h for struct in6_addr
From: Cong Wang @ 2014-11-05 0:39 UTC (permalink / raw)
To: Gregory Fong
Cc: Linux Kernel Network Developers, linux-api, LKML, carlos, eblake,
Kumar Gala, Florian Fainelli, David Miller
In-Reply-To: <1415128881-30183-1-git-send-email-gregory.0xf0@gmail.com>
On Tue, Nov 4, 2014 at 11:21 AM, Gregory Fong <gregory.0xf0@gmail.com> wrote:
> if_bridge.h uses struct in6_addr ip6, but wasn't including the in6.h
> header. Thomas Backlund originally sent a patch to do this, but this
> revealed a redefinition issue: https://lkml.org/lkml/2013/1/13/116
>
> The redefinition issue should have been fixed by the following Linux
> commits:
> ee262ad827f89e2dc7851ec2986953b5b125c6bc inet: defines IPPROTO_* needed for module alias generation
> cfd280c91253cc28e4919e349fa7a813b63e71e8 net: sync some IP headers with glibc
>
> and the following glibc commit:
> 6c82a2f8d7c8e21e39237225c819f182ae438db3 Coordinate IPv6 definitions for Linux and glibc
>
> so actually include the header now.
>
> Reported-by: Colin Guthrie <colin@mageia.org>
> Reported-by: Christiaan Welvaart <cjw@daneel.dyndns.org>
> Reported-by: Thomas Backlund <tmb@mageia.org>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Thanks for working on it!
^ permalink raw reply
* [PATCH] ipvs: Keep skb->sk when allocating headroom on tunnel xmit
From: Calvin Owens @ 2014-11-05 0:37 UTC (permalink / raw)
To: Simon Horman, Julian Anastasov, Wensong Zhang
Cc: lvs-devel, linux-kernel, netdev, agartrell, kernel-team,
Calvin Owens
ip_vs_prepare_tunneled_skb() ignores ->sk when allocating a new
skb, either unconditionally setting ->sk to NULL or allowing
the uninitialized ->sk from a newly allocated skb to leak through
to the caller.
This patch properly copies ->sk and increments its reference count.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
net/netfilter/ipvs/ip_vs_xmit.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 437a366..bd90bf8 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -846,6 +846,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
goto error;
+ if (skb->sk)
+ skb_set_owner_w(new_skb, skb->sk);
consume_skb(skb);
skb = new_skb;
}
--
2.1.1
^ permalink raw reply related
* Re: [Patch net-next v2] neigh: remove dynamic neigh table registration support
From: Cong Wang @ 2014-11-05 0:38 UTC (permalink / raw)
To: David Miller; +Cc: Linux Kernel Network Developers
In-Reply-To: <20141104.170251.950859328398615616.davem@davemloft.net>
On Tue, Nov 4, 2014 at 2:02 PM, David Miller <davem@davemloft.net> wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Mon, 3 Nov 2014 10:14:14 -0800
>
>> Currently there are only three neigh tables in the whole kernel:
>> arp table, ndisc table and decnet neigh table. What's more,
>> we don't support registering multiple tables per family.
>> Therefore we can just make these tables statically built-in.
>>
>> Cc: David S. Miller <davem@davemloft.net>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>> v2: remove useless #ifdef's
>> move the assignment to the end of neigh_table_init()
>
> neigh_table_clear should definitely NULL out the slot, otherwise
> we hold in there a pointer to module memory which is about to
> be released.
Good catch! I totally missed neigh_table_clear().
Thanks!
^ permalink raw reply
* Atlas 2.5% End Of Year Promotion Offer!!
From: Atlas Finance Loans @ 2014-11-04 19:54 UTC (permalink / raw)
In-Reply-To: <1785258737.69546.1415130873710.JavaMail.root@ninhbinh.gov.vn>
[-- Attachment #1: Type: text/plain, Size: 0 bytes --]
[-- Attachment #2: Atlas 2.5% Loan Offer.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 12096 bytes --]
^ permalink raw reply
* [PATCH v2 net-next] udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts
From: Rick Jones @ 2014-11-04 23:47 UTC (permalink / raw)
To: netdev; +Cc: davem
From: Rick Jones <rick.jones2@hp.com>
As NIC multicast filtering isn't perfect, and some platforms are
quite content to spew broadcasts, we should not trigger an event
for skb:kfree_skb when we do not have a match for such an incoming
datagram. We do though want to avoid sweeping the matter under the
rug entirely, so increment a suitable statistic.
This incorporates feedback from David L. Stevens, Karl Neiss and Eric
Dumazet.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
---
Noticed __udp4_lib_mcast_deliver showing-up in a perf dropped packet
profile on a system sitting on a network with a bunch of Windows boxes
sending what they are fond of sending.
Verified that the new UDP_MIB_IGNOREDMULTI increments when ignored
datagrams are encountered, but was unable to cross the i's and dot
the t's of perf because the perf built from the tree at the time
wasn't happy in general. Also hit a test system with some netperf
multicast UDP_STREAM and UDP_RR testing but that is the extent of
the testing performed.
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index df40137..30f541b 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -156,6 +156,7 @@ enum
UDP_MIB_RCVBUFERRORS, /* RcvbufErrors */
UDP_MIB_SNDBUFERRORS, /* SndbufErrors */
UDP_MIB_CSUMERRORS, /* InCsumErrors */
+ UDP_MIB_IGNOREDMULTI, /* IgnoredMulti */
__UDP_MIB_MAX
};
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 8e3eb39..5c5450c 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -181,6 +181,7 @@ static const struct snmp_mib snmp4_udp_list[] = {
SNMP_MIB_ITEM("RcvbufErrors", UDP_MIB_RCVBUFERRORS),
SNMP_MIB_ITEM("SndbufErrors", UDP_MIB_SNDBUFERRORS),
SNMP_MIB_ITEM("InCsumErrors", UDP_MIB_CSUMERRORS),
+ SNMP_MIB_ITEM("IgnoredMulti", UDP_MIB_IGNOREDMULTI),
SNMP_MIB_SENTINEL
};
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cd0db54..1215f89 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1647,7 +1647,8 @@ static void udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
struct udphdr *uh,
__be32 saddr, __be32 daddr,
- struct udp_table *udptable)
+ struct udp_table *udptable,
+ int proto)
{
struct sock *sk, *stack[256 / sizeof(struct sock *)];
struct hlist_nulls_node *node;
@@ -1656,6 +1657,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
int dif = skb->dev->ifindex;
unsigned int count = 0, offset = offsetof(typeof(*sk), sk_nulls_node);
unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
+ unsigned int inner_flushed = 0;
if (use_hash2) {
hash2_any = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum) &
@@ -1674,6 +1676,7 @@ start_lookup:
dif, hnum)) {
if (unlikely(count == ARRAY_SIZE(stack))) {
flush_stack(stack, count, skb, ~0);
+ inner_flushed = 1;
count = 0;
}
stack[count++] = sk;
@@ -1695,7 +1698,10 @@ start_lookup:
if (count) {
flush_stack(stack, count, skb, count - 1);
} else {
- kfree_skb(skb);
+ if (!inner_flushed)
+ UDP_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI,
+ proto == IPPROTO_UDPLITE);
+ consume_skb(skb);
}
return 0;
}
@@ -1780,7 +1786,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
} else {
if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
return __udp4_lib_mcast_deliver(net, skb, uh,
- saddr, daddr, udptable);
+ saddr, daddr, udptable, proto);
sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
}
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 1752cd0..679253d0 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -136,6 +136,7 @@ static const struct snmp_mib snmp6_udp6_list[] = {
SNMP_MIB_ITEM("Udp6RcvbufErrors", UDP_MIB_RCVBUFERRORS),
SNMP_MIB_ITEM("Udp6SndbufErrors", UDP_MIB_SNDBUFERRORS),
SNMP_MIB_ITEM("Udp6InCsumErrors", UDP_MIB_CSUMERRORS),
+ SNMP_MIB_ITEM("Udp6IgnoredMulti", UDP_MIB_IGNOREDMULTI),
SNMP_MIB_SENTINEL
};
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f6ba535..d80f21e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -771,7 +771,7 @@ static void udp6_csum_zero_error(struct sk_buff *skb)
*/
static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
const struct in6_addr *saddr, const struct in6_addr *daddr,
- struct udp_table *udptable)
+ struct udp_table *udptable, int proto)
{
struct sock *sk, *stack[256 / sizeof(struct sock *)];
const struct udphdr *uh = udp_hdr(skb);
@@ -781,6 +781,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
int dif = inet6_iif(skb);
unsigned int count = 0, offset = offsetof(typeof(*sk), sk_nulls_node);
unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
+ int inner_flushed = 0;
if (use_hash2) {
hash2_any = udp6_portaddr_hash(net, &in6addr_any, hnum) &
@@ -803,6 +804,7 @@ start_lookup:
(uh->check || udp_sk(sk)->no_check6_rx)) {
if (unlikely(count == ARRAY_SIZE(stack))) {
flush_stack(stack, count, skb, ~0);
+ inner_flushed = 1;
count = 0;
}
stack[count++] = sk;
@@ -821,7 +823,10 @@ start_lookup:
if (count) {
flush_stack(stack, count, skb, count - 1);
} else {
- kfree_skb(skb);
+ if (!inner_flushed)
+ UDP_INC_STATS_BH(net, UDP_MIB_IGNOREDMULTI,
+ proto == IPPROTO_UDPLITE);
+ consume_skb(skb);
}
return 0;
}
@@ -873,7 +878,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
*/
if (ipv6_addr_is_multicast(daddr))
return __udp6_lib_mcast_deliver(net, skb,
- saddr, daddr, udptable);
+ saddr, daddr, udptable, proto);
/* Unicast */
^ permalink raw reply related
* Re: [PATCH net 2/5] i40e: Implement ndo_gso_check()
From: Jesse Gross @ 2014-11-04 23:45 UTC (permalink / raw)
To: Joe Stringer
Cc: netdev, Sathya Perla, Jeff Kirsher, linux.nics, amirv,
shahed.shaikh, Dept-GELinuxNICDev, Tom Herbert,
Linux Kernel Mailing List
In-Reply-To: <1415138202-1197-3-git-send-email-joestringer@nicira.com>
On Tue, Nov 4, 2014 at 1:56 PM, Joe Stringer <joestringer@nicira.com> wrote:
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index c3a7f4a..21829b5 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev)
> +{
> + if ((skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) &&
> + (skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
> + skb->inner_protocol != htons(ETH_P_TEB) ||
> + skb_inner_mac_header(skb) - skb_transport_header(skb) > 64))
> + return false;
I think it may be possible to even support a few more things here.
According to the datasheet here:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf
This can actually support 64 bytes beyond the tunnel header, which
would make for a total of 80 bytes. It looks like it can also support
IPv4 or IPv6 beyond just Ethernet as the encapsulated protocol.
Intel guys, can you confirm that this is correct?
^ permalink raw reply
* Re: [PATCH 00/20] kselftest install target feature
From: Shuah Khan @ 2014-11-04 23:45 UTC (permalink / raw)
To: Kees Cook
Cc: Greg KH, Andrew Morton, Michal Marek, David S. Miller,
tranmanphong, David Herrmann, Hugh Dickins, bobby.prani,
Eric W. Biederman, Serge E. Hallyn, linux-kbuild, LKML, Linux API,
Network Development
In-Reply-To: <CAGXu5jK9-TpOF4SFX2bkU2UJMXk-3W+-_we0BTxu4Ga+7bYXSQ@mail.gmail.com>
On 11/04/2014 12:22 PM, Kees Cook wrote:
> On Tue, Nov 4, 2014 at 9:10 AM, Shuah Khan <shuahkh@osg.samsung.com> wrote:
>> This patch series adds a new kselftest_install make target
>> to enable selftest install. When make kselftest_install is
>> run, selftests are installed on the system. A new install
>> target is added to selftests Makefile which will install
>> targets for the tests that are specified in INSTALL_TARGETS.
>> During install, a script is generated to run tests that are
>> installed. This script will be installed in the selftest install
>> directory. Individual test Makefiles are changed to add to the
>> script. This will allow new tests to add install and run test
>> commands to the generated kselftest script.
>
> I'm all for making the self tests more available, but I don't think
> this is the right approach. My primary objection is that it creates a
> second way to run tests, and that means any changes and additions need
> to be updated in two places. I'd much rather just maintain the single
> "make" targets instead. Having "make" available on the target device
> doesn't seem too bad to me. Is there a reason that doesn't work for
> your situation?
Kees,
My primary objective is to provide a way to install selftests for a
specific kernel release. This will allow developers to run tests for
a specific release and look for regressions. Adding an install target
will also help support local execution of tests in a virtualized
environments. In some cases such as qemu, it is not practical to
expect the target to have support for "make". Once tests are installed
to be run outside the git environment, we need a master script that
can run the tests. Hence the need for a master script that can run
tests.
We have the ability to run all tests via make kselftest target or
run a specific test using the individual test's run_tests target.
Both of above are necessary to support running tests from the tree.
Embedding run_tests logic in the makefiles doesn't work very well
in the long run.
We also need a way to run them outside tree. I agree with you that
the way I added the script generation, duplicates the code in individual
run_tests targets and that changes/updates need to be made in both
places.
Would you be ok with the approach if I fixed the duplicating
problem? I can address the duplication concern easily.
>
> I would, however, like to see some better standardization of the test
> "framework" that we've got in there already. (For example, some
> failures fail the "make", some don't, there are various reporting
> methods for success/failure depending on the test, etc.)
This is being addressed and I have the framework in linux-kselftest
git next branch at the moment. I do think the above work is part of
addressing the larger framework issues such as being able to run tests
on a target system that might not have "make" support and makes it
easier to use.
thanks,
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978
^ permalink raw reply
* [PATCH net-next] net: Convert SEQ_START_TOKEN/seq_printf to seq_puts
From: Joe Perches @ 2014-11-04 23:37 UTC (permalink / raw)
To: netdev
Using a single fixed string is smaller code size than using
a format and many string arguments.
Reduces overall code size a little.
$ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
text data bss dec hex filename
34269 7012 14824 56105 db29 net/ipv4/igmp.o.new
34315 7012 14824 56151 db57 net/ipv4/igmp.o.old
30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new
30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old
11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new
11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old
Signed-off-by: Joe Perches <joe@perches.com>
---
Done by using printf with these formats and args
net/ipv4/igmp.c | 6 +-----
net/ipv6/ip6_flowlabel.c | 3 +--
net/ipv6/mcast.c | 6 +-----
3 files changed, 3 insertions(+), 12 deletions(-)
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 3f80513..1e4adae 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2687,11 +2687,7 @@ static int igmp_mcf_seq_show(struct seq_file *seq, void *v)
struct igmp_mcf_iter_state *state = igmp_mcf_seq_private(seq);
if (v == SEQ_START_TOKEN) {
- seq_printf(seq,
- "%3s %6s "
- "%10s %10s %6s %6s\n", "Idx",
- "Device", "MCA",
- "SRC", "INC", "EXC");
+ seq_puts(seq, "Idx Device MCA SRC INC EXC\n");
} else {
seq_printf(seq,
"%3d %6.6s 0x%08x "
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index c143437..7221021 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -770,8 +770,7 @@ static int ip6fl_seq_show(struct seq_file *seq, void *v)
{
struct ip6fl_iter_state *state = ip6fl_seq_private(seq);
if (v == SEQ_START_TOKEN) {
- seq_printf(seq, "%-5s %-1s %-6s %-6s %-6s %-8s %-32s %s\n",
- "Label", "S", "Owner", "Users", "Linger", "Expires", "Dst", "Opt");
+ seq_puts(seq, "Label S Owner Users Linger Expires Dst Opt\n");
} else {
struct ip6_flowlabel *fl = v;
seq_printf(seq,
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 9648de2..e04f184 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -2823,11 +2823,7 @@ static int igmp6_mcf_seq_show(struct seq_file *seq, void *v)
struct igmp6_mcf_iter_state *state = igmp6_mcf_seq_private(seq);
if (v == SEQ_START_TOKEN) {
- seq_printf(seq,
- "%3s %6s "
- "%32s %32s %6s %6s\n", "Idx",
- "Device", "Multicast Address",
- "Source Address", "INC", "EXC");
+ seq_puts(seq, "Idx Device Multicast Address Source Address INC EXC\n");
} else {
seq_printf(seq,
"%3d %6.6s %pi6 %pi6 %6lu %6lu\n",
^ permalink raw reply related
* [PATCH net-next] fast_hash: avoid indirect function calls
From: Hannes Frederic Sowa @ 2014-11-04 23:23 UTC (permalink / raw)
To: netdev; +Cc: kernel, dborkman, Thomas Graf
By default the arch_fast_hash hashing function pointers are initialized
to jhash(2). If during boot-up a CPU with SSE4.2 is detected they get
updated to the CRC32 ones. This dispatching scheme incurs a function
pointer lookup and indirect call for every hashing operation.
rhashtable as a user of arch_fast_hash e.g. stores pointers to hashing
functions in its structure, too, causing two indirect branches per
hashing operation.
Using alternative_call we can get away with one of those indirect branches.
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
Hi,
I targetted net-next because original implementation went in over netdev@
and networking is the main user.
Would it make sense to start suppressing the generation of local
functions for static inline functions which address is taken?
E.g. we could use extern inline in a few cases (dst_output is often used
as a function pointer but marked static inline). We could mark it as
extern inline and copy&paste the code to a .c file to prevent multiple
copies of machine code for this function. But because of the copy&paste I
did not in this case.
Bye,
Hannes
arch/x86/include/asm/hash.h | 51 ++++++++++++++++++++++++++++++++++++++++-----
arch/x86/lib/hash.c | 29 +++++++++++++++-----------
include/asm-generic/hash.h | 36 ++++++++++++++++++++++++++++++--
include/linux/hash.h | 34 ------------------------------
lib/Makefile | 2 +-
lib/hash.c | 39 ----------------------------------
6 files changed, 98 insertions(+), 93 deletions(-)
delete mode 100644 lib/hash.c
diff --git a/arch/x86/include/asm/hash.h b/arch/x86/include/asm/hash.h
index e8c58f8..a881d78 100644
--- a/arch/x86/include/asm/hash.h
+++ b/arch/x86/include/asm/hash.h
@@ -1,7 +1,48 @@
-#ifndef _ASM_X86_HASH_H
-#define _ASM_X86_HASH_H
+#ifndef __ASM_X86_HASH_H
+#define __ASM_X86_HASH_H
-struct fast_hash_ops;
-extern void setup_arch_fast_hash(struct fast_hash_ops *ops);
+#include <linux/cpufeature.h>
+#include <asm/alternative.h>
-#endif /* _ASM_X86_HASH_H */
+u32 __intel_crc4_2_hash(const void *data, u32 len, u32 seed);
+u32 __intel_crc4_2_hash2(const u32 *data, u32 len, u32 seed);
+
+/*
+ * non-inline versions of jhash so gcc does not need to generate
+ * duplicate code in every object file
+ */
+u32 __jhash(const void *data, u32 len, u32 seed);
+u32 __jhash2(const u32 *data, u32 len, u32 seed);
+
+/*
+ * for documentation of these functions please look into
+ * <include/asm-generic/hash.h>
+ */
+
+static inline u32 arch_fast_hash(const void *data, u32 len, u32 seed)
+{
+ u32 hash;
+
+ alternative_call(__jhash, __intel_crc4_2_hash, X86_FEATURE_XMM4_2,
+#ifdef CONFIG_X86_64
+ "=a" (hash), "D" (data), "S" (len), "d" (seed));
+#else
+ "=a" (hash), "a" (data), "d" (len), "c" (seed));
+#endif
+ return hash;
+}
+
+static inline u32 arch_fast_hash2(const u32 *data, u32 len, u32 seed)
+{
+ u32 hash;
+
+ alternative_call(__jhash2, __intel_crc4_2_hash2, X86_FEATURE_XMM4_2,
+#ifdef CONFIG_X86_64
+ "=a" (hash), "D" (data), "S" (len), "d" (seed));
+#else
+ "=a" (hash), "a" (data), "d" (len), "c" (seed));
+#endif
+ return hash;
+}
+
+#endif /* __ASM_X86_HASH_H */
diff --git a/arch/x86/lib/hash.c b/arch/x86/lib/hash.c
index ff4fa51..e143271 100644
--- a/arch/x86/lib/hash.c
+++ b/arch/x86/lib/hash.c
@@ -31,13 +31,13 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
-#include <linux/hash.h>
-#include <linux/init.h>
-
#include <asm/processor.h>
#include <asm/cpufeature.h>
#include <asm/hash.h>
+#include <linux/hash.h>
+#include <linux/jhash.h>
+
static inline u32 crc32_u32(u32 crc, u32 val)
{
#ifdef CONFIG_AS_CRC32
@@ -48,7 +48,7 @@ static inline u32 crc32_u32(u32 crc, u32 val)
return crc;
}
-static u32 intel_crc4_2_hash(const void *data, u32 len, u32 seed)
+u32 __intel_crc4_2_hash(const void *data, u32 len, u32 seed)
{
const u32 *p32 = (const u32 *) data;
u32 i, tmp = 0;
@@ -71,22 +71,27 @@ static u32 intel_crc4_2_hash(const void *data, u32 len, u32 seed)
return seed;
}
+EXPORT_SYMBOL(__intel_crc4_2_hash);
-static u32 intel_crc4_2_hash2(const u32 *data, u32 len, u32 seed)
+u32 __intel_crc4_2_hash2(const u32 *data, u32 len, u32 seed)
{
- const u32 *p32 = (const u32 *) data;
u32 i;
for (i = 0; i < len; i++)
- seed = crc32_u32(seed, *p32++);
+ seed = crc32_u32(seed, *data++);
return seed;
}
+EXPORT_SYMBOL(__intel_crc4_2_hash2);
-void __init setup_arch_fast_hash(struct fast_hash_ops *ops)
+u32 __jhash(const void *data, u32 len, u32 seed)
{
- if (cpu_has_xmm4_2) {
- ops->hash = intel_crc4_2_hash;
- ops->hash2 = intel_crc4_2_hash2;
- }
+ return jhash(data, len, seed);
+}
+EXPORT_SYMBOL(__jhash);
+
+u32 __jhash2(const u32 *data, u32 len, u32 seed)
+{
+ return jhash2(data, len, seed);
}
+EXPORT_SYMBOL(__jhash2);
diff --git a/include/asm-generic/hash.h b/include/asm-generic/hash.h
index b631284..3c82760 100644
--- a/include/asm-generic/hash.h
+++ b/include/asm-generic/hash.h
@@ -1,9 +1,41 @@
#ifndef __ASM_GENERIC_HASH_H
#define __ASM_GENERIC_HASH_H
-struct fast_hash_ops;
-static inline void setup_arch_fast_hash(struct fast_hash_ops *ops)
+#include <linux/jhash.h>
+
+/**
+ * arch_fast_hash - Caclulates a hash over a given buffer that can have
+ * arbitrary size. This function will eventually use an
+ * architecture-optimized hashing implementation if
+ * available, and trades off distribution for speed.
+ *
+ * @data: buffer to hash
+ * @len: length of buffer in bytes
+ * @seed: start seed
+ *
+ * Returns 32bit hash.
+ */
+static inline u32 arch_fast_hash(const void *data, u32 len, u32 seed)
+{
+ return jhash(data, len, seed);
+}
+
+/**
+ * arch_fast_hash2 - Caclulates a hash over a given buffer that has a
+ * size that is of a multiple of 32bit words. This
+ * function will eventually use an architecture-
+ * optimized hashing implementation if available,
+ * and trades off distribution for speed.
+ *
+ * @data: buffer to hash (must be 32bit padded)
+ * @len: number of 32bit words
+ * @seed: start seed
+ *
+ * Returns 32bit hash.
+ */
+static inline u32 arch_fast_hash2(const u32 *data, u32 len, u32 seed)
{
+ return jhash2(data, len, seed);
}
#endif /* __ASM_GENERIC_HASH_H */
diff --git a/include/linux/hash.h b/include/linux/hash.h
index d0494c3..6e8fb02 100644
--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -84,38 +84,4 @@ static inline u32 hash32_ptr(const void *ptr)
return (u32)val;
}
-struct fast_hash_ops {
- u32 (*hash)(const void *data, u32 len, u32 seed);
- u32 (*hash2)(const u32 *data, u32 len, u32 seed);
-};
-
-/**
- * arch_fast_hash - Caclulates a hash over a given buffer that can have
- * arbitrary size. This function will eventually use an
- * architecture-optimized hashing implementation if
- * available, and trades off distribution for speed.
- *
- * @data: buffer to hash
- * @len: length of buffer in bytes
- * @seed: start seed
- *
- * Returns 32bit hash.
- */
-extern u32 arch_fast_hash(const void *data, u32 len, u32 seed);
-
-/**
- * arch_fast_hash2 - Caclulates a hash over a given buffer that has a
- * size that is of a multiple of 32bit words. This
- * function will eventually use an architecture-
- * optimized hashing implementation if available,
- * and trades off distribution for speed.
- *
- * @data: buffer to hash (must be 32bit padded)
- * @len: number of 32bit words
- * @seed: start seed
- *
- * Returns 32bit hash.
- */
-extern u32 arch_fast_hash2(const u32 *data, u32 len, u32 seed);
-
#endif /* _LINUX_HASH_H */
diff --git a/lib/Makefile b/lib/Makefile
index 7512dc9..04e53dd 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -26,7 +26,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iovec.o clz_ctz.o \
bsearch.o find_last_bit.o find_next_bit.o llist.o memweight.o kfifo.o \
- percpu-refcount.o percpu_ida.o hash.o rhashtable.o
+ percpu-refcount.o percpu_ida.o rhashtable.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
obj-y += kstrtox.o
diff --git a/lib/hash.c b/lib/hash.c
deleted file mode 100644
index fea973f..0000000
--- a/lib/hash.c
+++ /dev/null
@@ -1,39 +0,0 @@
-/* General purpose hashing library
- *
- * That's a start of a kernel hashing library, which can be extended
- * with further algorithms in future. arch_fast_hash{2,}() will
- * eventually resolve to an architecture optimized implementation.
- *
- * Copyright 2013 Francesco Fusco <ffusco@redhat.com>
- * Copyright 2013 Daniel Borkmann <dborkman@redhat.com>
- * Copyright 2013 Thomas Graf <tgraf@redhat.com>
- * Licensed under the GNU General Public License, version 2.0 (GPLv2)
- */
-
-#include <linux/jhash.h>
-#include <linux/hash.h>
-#include <linux/cache.h>
-
-static struct fast_hash_ops arch_hash_ops __read_mostly = {
- .hash = jhash,
- .hash2 = jhash2,
-};
-
-u32 arch_fast_hash(const void *data, u32 len, u32 seed)
-{
- return arch_hash_ops.hash(data, len, seed);
-}
-EXPORT_SYMBOL_GPL(arch_fast_hash);
-
-u32 arch_fast_hash2(const u32 *data, u32 len, u32 seed)
-{
- return arch_hash_ops.hash2(data, len, seed);
-}
-EXPORT_SYMBOL_GPL(arch_fast_hash2);
-
-static int __init hashlib_init(void)
-{
- setup_arch_fast_hash(&arch_hash_ops);
- return 0;
-}
-early_initcall(hashlib_init);
--
1.9.3
^ permalink raw reply related
* Re: [PATCH net-next 3/7] bpf: add array type of eBPF maps
From: Alexei Starovoitov @ 2014-11-04 23:14 UTC (permalink / raw)
To: Daniel Borkmann
Cc: David S. Miller, Ingo Molnar, Andy Lutomirski,
Hannes Frederic Sowa, Eric Dumazet, Linux API,
Network Development, LKML
In-Reply-To: <5458A349.5020401@redhat.com>
On Tue, Nov 4, 2014 at 1:58 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> +
>> + memcpy(array->value + array->elem_size * index, value,
>> array->elem_size);
>
> What would protect this from concurrent updates?
nothing.
that's what I meant in commit log:
- map_update_elem() replaces elements in an non-atomic way
(for atomic updates hashtable type should be used instead)
The array map is like C array of structures.
Nothing protects concurrent access.
It's used in the cases where accuracy is not needed
or when there is no concurrent access.
To compute a histogram of events in tracing the array
of integers is used. Every integer is a counter. Program
increments it (may be without using xadd) and
user space periodically reads it back.
map_update_elem() is called by userspace once
to initialize it if zero-init is not enough.
Programs do lookup() and modify the values.
For array type update() method is used rarely,
delete() is never used and get_next() is needed
for completeness to browse maps through
common map API.
I'm guessing you're asking, because it may feel
that adding a lock() will help to make it more useful? ;)
It's not, since programs cannot take a lock().
^ permalink raw reply
* Re: [PATCH net-next 6/7] bpf: allow eBPF programs to use maps
From: Alexei Starovoitov @ 2014-11-04 23:08 UTC (permalink / raw)
To: Daniel Borkmann
Cc: David S. Miller, Ingo Molnar, Andy Lutomirski,
Hannes Frederic Sowa, Eric Dumazet, Linux API,
Network Development, LKML
In-Reply-To: <5458A17B.7030904@redhat.com>
On Tue, Nov 4, 2014 at 1:50 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> These WARN_ON_ONCE(!rcu_read_lock_held()) seem odd. While I see the point
> that
> you're holding RCU read lock on the lookup, can you elaborate on your RCU
> usage
> here and why it's necessary for delete/update?
>
> I suspect due to the synchronize_rcu() you're using and not using any RCU
> accessors but plain memcpy() e.g. in case of the array ...?
Correct in case of array.
Also hash delete/update() call into lookup() internally
that is using _rcu() helpers...
Future map types might have much more
complex implementations (like LPM), so it helps
to state the rules early.
Another reason is more complex to explain:
A program that intends to access maps has to be one
rcu critical section. So all lookup/update/delete calls
are under rcu_lock_held.
Since programs by themselves cannot have WARN_ON
inside them, I've added WARN_ON in these three
functions that will be called from the programs to make
sure that kernel subsystems don't do (*prog->bpf_func)(...)
without taking rcu_lock if they intend to let programs
access maps.
Having said that in the future we might have a case
for programs that don't call into these functions at all
and execute instructions only. Those won't need
rcu_lock() wrap. I experimented with that for the
patch where I replaced pred-tree walker with eBPF
program. There is no rcu there. And no calls
to map accessors.
Has to be noted, that socket filters use rcu to
protect sk_filter pointer and program itself. So for that
use case we'll keep using rcu for foreseeable future.
For tracing filters I had to add rcu_lock() around
BPF_PROG_RUN() invocation and these WARN_ON
checks saved me a lot of headache, so I prefer to
keep them since they cost nothing when lockdep is off.
^ permalink raw reply
* Re: [PATCH net-next 1/7] bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
From: Alexei Starovoitov @ 2014-11-04 23:04 UTC (permalink / raw)
To: Daniel Borkmann
Cc: David S. Miller, Ingo Molnar, Andy Lutomirski,
Hannes Frederic Sowa, Eric Dumazet, Linux API,
Network Development, LKML
In-Reply-To: <54589B89.5000309-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Tue, Nov 4, 2014 at 1:25 AM, Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> On 11/04/2014 03:54 AM, Alexei Starovoitov wrote:
>>
>> the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
>> either update existing map element or create a new one.
>> Initially the plan was to add a new command to handle the case of
>> 'create new element if it didn't exist', but 'flags' style looks
>> cleaner and overall diff is much smaller (more code reused), so add 'flags'
>> attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
>> enum {
>> BPF_MAP_UPDATE_OR_CREATE = 0, /* add new element or update existing */
>> BPF_MAP_CREATE_ONLY, /* add new element if it didn't exist */
>> BPF_MAP_UPDATE_ONLY /* update existing element */
>> };
>
>
> From you commit message/code I currently don't see an explanation why
> it cannot be done in typical ``flags style'' as various syscalls do,
> i.e. BPF_MAP_UPDATE_OR_CREATE rather represented as ...
>
> BPF_MAP_CREATE | BPF_MAP_UPDATE
>
> Do you expect more than 64 different flags to be passed from user space
> for BPF_MAP?
several reasons:
- preserve flags==0 as default behavior
- avoid holes and extra checks for invalid combinations, so
if (flags > BPF_MAP_UPDATE_ONLY) goto err, is enough.
- it looks much neater when user space uses
BPF_MAP_UPDATE_OR_CREATE instead of ORing bits.
Note this choice doesn't prevent adding bit-like flags
in the future. Today I cannot think of any new flags
for the update() command, but if somebody comes up with
a new selector that can apply to all three combinations,
we can add it as 3rd bit that can be ORed.
Default will stay zero and 'if >' check in older
kernels will seamlessly work with new userspace.
I don't like holes in flags and combinatorial
explosion of bits and checks for them unless
absolutely necessary.
^ permalink raw reply
* Re: Fw: [Bug 82471] New: net/core/dev.c skb_war_bad_offload
From: Tom Herbert @ 2014-11-04 22:57 UTC (permalink / raw)
To: Jesse Brandeburg; +Cc: Stephen Hemminger, NetDEV list, Jesse Brandeburg
In-Reply-To: <CAEuXFEwJeLeCRmNMXVk9naKd8zC2gos7-ORzu7XW=Ja0N-CkEA@mail.gmail.com>
Using vlan and bonding? vlan_dev_hard_start_xmit called. A possible
cause is that bonding interface is out of sync with slave interface
w.r.t. GSO features. Do we know if this worked in 3.14, 3.15?
On Tue, Nov 4, 2014 at 1:59 PM, Jesse Brandeburg
<jesse.brandeburg@gmail.com> wrote:
> I believe this is a regression, as reporters say this worked with 3.13 kernels.
>
> anyone have any idea what is up with this skb_warn_bad_offload with
> the bonding driver? see the bug text
> for a lot more detail. Is it fixed already? This is occurring on top
> of both Intel and Broadcom nics and is
> with 802.3ad bonding enabled, and turning off scatter gather avoids the issue.
>
> On Fri, Aug 15, 2014 at 2:26 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>
>> Subject: [Bug 82471] New: net/core/dev.c skb_war_bad_offload
>>
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=82471
>>
>> Bug ID: 82471
>> Summary: net/core/dev.c skb_war_bad_offload
>> Product: Networking
>> Version: 2.5
>> Kernel Version: 3.16.1
>> Hardware: x86-64
>> OS: Linux
>> Tree: Mainline
>> Status: NEW
>> Severity: normal
>> Priority: P1
>> Component: IPV4
>> Assignee: shemminger@linux-foundation.org
>> Reporter: vladi@aresgate.net
>> Regression: No
>>
>> Created attachment 146761
>> --> https://bugzilla.kernel.org/attachment.cgi?id=146761&action=edit
>> kernel config
>>
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973225] ------------[
>> cut here ]------------
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973236] WARNING: CPU:
>> 2 PID: 0 at net/core/dev.c:2246 skb_warn_bad_offload+0xc8/0xd5()
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973238] :
>> caps=(0x000000000419fba9, 0x00000000001b583b) len=2962 data_len=2896
>> gso_size=1448 gso_type=1 ip_summed=3
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973239] Modules
>> linked in: ntfs msdos xfs libcrc32c ipmi_devintf intel_rapl
>> x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul
>> ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper
>> cryptd sb_edac edac_core 8021q garp ioatdma stp ipmi_si mrp llc bonding
>> hid_generic ixgbe usbhid hid ahci dca libahci mdio
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973257] CPU: 2 PID: 0
>> Comm: swapper/2 Tainted: G W 3.16.0 #2
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973259] Hardware
>> name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973260]
>> 0000000000000009 ffff88046fd036b0 ffffffff815c4096 ffff88046fd036f8
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973262]
>> ffff88046fd036e8 ffffffff8103f633 ffff880018b7c4e0 ffff8804687df000
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973264]
>> 0000000000000001 0000000000000003 ffffffffa0193320 ffff88046fd03748
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973266] Call Trace:
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973268] <IRQ>
>> [<ffffffff815c4096>] dump_stack+0x45/0x56
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973280]
>> [<ffffffff8103f633>] warn_slowpath_common+0x73/0x90
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973286]
>> [<ffffffff8103f697>] warn_slowpath_fmt+0x47/0x50
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973288]
>> [<ffffffff812edadc>] ? ___ratelimit+0x7c/0xf0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973291]
>> [<ffffffff815c5e19>] skb_warn_bad_offload+0xc8/0xd5
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973294]
>> [<ffffffff814e57fe>] skb_checksum_help+0x16e/0x180
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973297]
>> [<ffffffff814e9ecc>] dev_hard_start_xmit+0x42c/0x4b0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973299]
>> [<ffffffff814ea154>] ? __dev_queue_xmit+0x204/0x440
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973301]
>> [<ffffffff814ea232>] __dev_queue_xmit+0x2e2/0x440
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973302]
>> [<ffffffff814ea39b>] ? dev_queue_xmit+0xb/0x10
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973304]
>> [<ffffffff814ea39b>] dev_queue_xmit+0xb/0x10
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973308]
>> [<ffffffffa012b758>] vlan_dev_hard_start_xmit+0x88/0x100 [8021q]
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973317]
>> [<ffffffff814e9d9a>] dev_hard_start_xmit+0x2fa/0x4b0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973321]
>> [<ffffffff814ea232>] __dev_queue_xmit+0x2e2/0x440
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973323]
>> [<ffffffff814ea39b>] dev_queue_xmit+0xb/0x10
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973325]
>> [<ffffffff814f1192>] neigh_connected_output+0xb2/0xf0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973327]
>> [<ffffffff815192dc>] ip_finish_output+0x4ec/0x890
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973329]
>> [<ffffffff8151ac03>] ip_output+0x53/0x90
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973331]
>> [<ffffffff8151a39b>] ip_local_out_sk+0x2b/0x30
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973333]
>> [<ffffffff8151a6fa>] ip_queue_xmit+0x13a/0x3c0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973335]
>> [<ffffffff815309fa>] tcp_transmit_skb+0x42a/0x8f0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973337]
>> [<ffffffff81530ffa>] tcp_write_xmit+0x13a/0xc00
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973347]
>> [<ffffffff8152f033>] ? tcp_established_options+0x33/0xd0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973350]
>> [<ffffffff81531d09>] __tcp_push_pending_frames+0x29/0xc0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973353]
>> [<ffffffff8152da77>] tcp_rcv_established+0x1f7/0x5e0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973356]
>> [<ffffffff81535fc5>] tcp_v4_do_rcv+0x215/0x4a0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973369]
>> [<ffffffff810651f8>] ? ttwu_do_activate.constprop.64+0x58/0x60
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973374]
>> [<ffffffff81295d31>] ? security_sock_rcv_skb+0x11/0x20
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973377]
>> [<ffffffff815392ad>] tcp_v4_rcv+0x73d/0x7c0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973380]
>> [<ffffffff8106fafc>] ? update_group_capacity+0x16c/0x270
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973386]
>> [<ffffffff812e8300>] ? cpumask_next_and+0x30/0x50
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973388]
>> [<ffffffff81514b50>] ip_local_deliver_finish+0x80/0x1c0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973390]
>> [<ffffffff81515154>] ip_local_deliver+0x34/0x90
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973392]
>> [<ffffffff81514d99>] ip_rcv_finish+0x109/0x350
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973399]
>> [<ffffffff815153d2>] ip_rcv+0x222/0x370
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973403]
>> [<ffffffff814e5eb6>] __netif_receive_skb_core+0x416/0x570
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973407]
>> [<ffffffff814e73f3>] __netif_receive_skb+0x13/0x60
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973410]
>> [<ffffffff814e745e>] netif_receive_skb_internal+0x1e/0x90
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973415]
>> [<ffffffff814e7b40>] napi_gro_receive+0x70/0xa0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973422]
>> [<ffffffffa0134f4c>] ixgbe_clean_rx_irq+0x75c/0xb20 [ixgbe]
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973427]
>> [<ffffffffa0136172>] ixgbe_poll+0x522/0x850 [ixgbe]
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973430]
>> [<ffffffff8105e986>] ? hrtimer_get_next_event+0xb6/0xc0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973437]
>> [<ffffffff814e8dc1>] net_rx_action+0x101/0x1a0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973443]
>> [<ffffffff810430ab>] __do_softirq+0xdb/0x240
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973447]
>> [<ffffffff8104349e>] irq_exit+0xee/0x110
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973450]
>> [<ffffffff81004913>] do_IRQ+0x53/0xf0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973453]
>> [<ffffffff815caaaa>] common_interrupt+0x6a/0x6a
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973454] <EOI>
>> [<ffffffff814af007>] ? cpuidle_enter_state+0x47/0xc0
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973463]
>> [<ffffffff814af132>] cpuidle_enter+0x12/0x20
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973468]
>> [<ffffffff81075fbf>] cpu_startup_entry+0x24f/0x280
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973477]
>> [<ffffffff810905e3>] ? clockevents_config_and_register+0x23/0x30
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973482]
>> [<ffffffff810282be>] start_secondary+0x1be/0x270
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.973486] ---[ end
>> trace de552357488766e8 ]---
>> Aug 6 06:46:50 prod-ent-ceph03.dc2.ec.loc kernel: [29530.974181] ------------[
>> cut here ]------------
>>
>> --
>> You are receiving this mail because:
>> You are the assignee for the bug.
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox