* [PATCH net-next] packet: add sockopt to ignore outgoing packets
From: Vincent Whitchurch @ 2018-08-30 10:01 UTC (permalink / raw)
To: davem; +Cc: netdev, willemb, Vincent Whitchurch
Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter. With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them. So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)
Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this. Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
The first intended user is lldpd.
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
---
include/linux/netdevice.h | 1 +
include/uapi/linux/if_packet.h | 1 +
net/core/dev.c | 3 +++
net/packet/af_packet.c | 15 +++++++++++++++
4 files changed, 20 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..8ef14d9edc58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2317,6 +2317,7 @@ static inline struct sk_buff *call_gro_receive_sk(gro_receive_sk_t cb,
struct packet_type {
__be16 type; /* This is really htons(ether_type). */
+ bool ignore_outgoing;
struct net_device *dev; /* NULL is wildcarded here */
int (*func) (struct sk_buff *,
struct net_device *,
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 67b61d91d89b..467b654bd4c7 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -57,6 +57,7 @@ struct sockaddr_ll {
#define PACKET_QDISC_BYPASS 20
#define PACKET_ROLLOVER_STATS 21
#define PACKET_FANOUT_DATA 22
+#define PACKET_IGNORE_OUTGOING 23
#define PACKET_FANOUT_HASH 0
#define PACKET_FANOUT_LB 1
diff --git a/net/core/dev.c b/net/core/dev.c
index 325fc5088370..0addb4f0abfe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb)
if (!ptype->af_packet_priv || !skb->sk)
return false;
+ if (ptype->ignore_outgoing)
+ return true;
+
if (ptype->id_match)
return ptype->id_match(ptype, skb->sk);
else if ((struct sock *)ptype->af_packet_priv == skb->sk)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5610061e7f2e..37bbafad724a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3805,6 +3805,18 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
return fanout_set_data(po, optval, optlen);
}
+ case PACKET_IGNORE_OUTGOING:
+ {
+ int val;
+
+ if (optlen != sizeof(val))
+ return -EINVAL;
+ if (copy_from_user(&val, optval, sizeof(val)))
+ return -EFAULT;
+
+ po->prot_hook.ignore_outgoing = !!val;
+ return 0;
+ }
case PACKET_TX_HAS_OFF:
{
unsigned int val;
@@ -3928,6 +3940,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
((u32)po->fanout->flags << 24)) :
0);
break;
+ case PACKET_IGNORE_OUTGOING:
+ val = po->prot_hook.ignore_outgoing;
+ break;
case PACKET_ROLLOVER_STATS:
if (!po->rollover)
return -EINVAL;
--
2.11.0
^ permalink raw reply related
* [PATCH] rtl8xxxu: Add rtl8188ctv support
From: Aleksei Mamlin @ 2018-08-30 14:05 UTC (permalink / raw)
To: Jes Sorensen
Cc: Kalle Valo, David S . Miller, linux-wireless, netdev,
linux-kernel, Aleksei Mamlin
The Realtek rtl8188ctv (0x0bda:0x018a) is a highly integrated single-chip
WLAN USB2.0 network interface controller.
Currently rtl8188ctv is supported by rtlwifi driver.
It is similar to the rtl8188cus(0x0bda:0x818a) and uses the same config.
Signed-off-by: Aleksei Mamlin <mamlinav@gmail.com>
---
drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
index 505ab1b055ff..73f6fc0d4a01 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
@@ -6231,6 +6231,8 @@ static const struct usb_device_id dev_table[] = {
{USB_DEVICE_AND_INTERFACE_INFO(0x2001, 0x3308, 0xff, 0xff, 0xff),
.driver_info = (unsigned long)&rtl8192cu_fops},
/* Currently untested 8188 series devices */
+{USB_DEVICE_AND_INTERFACE_INFO(USB_VENDOR_ID_REALTEK, 0x018a, 0xff, 0xff, 0xff),
+ .driver_info = (unsigned long)&rtl8192cu_fops},
{USB_DEVICE_AND_INTERFACE_INFO(USB_VENDOR_ID_REALTEK, 0x8191, 0xff, 0xff, 0xff),
.driver_info = (unsigned long)&rtl8192cu_fops},
{USB_DEVICE_AND_INTERFACE_INFO(USB_VENDOR_ID_REALTEK, 0x8170, 0xff, 0xff, 0xff),
--
2.16.4
^ permalink raw reply related
* Re: [PATCH 04/15] soc: octeontx2: Add mailbox support infra
From: Arnd Bergmann @ 2018-08-30 13:56 UTC (permalink / raw)
To: Sunil Kovvuri
Cc: Linux Kernel Mailing List, Olof Johansson, Linux ARM, linux-soc,
amakarov, sgoutham, lbartosik, Networking, David Miller
In-Reply-To: <CA+sq2Cdn6ut3kdTWoOgCayO2evQrD9TjBaUpVUR3_rWZFxYeig@mail.gmail.com>
On Tue, Aug 28, 2018 at 3:23 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
>
> On Tue, Aug 28, 2018 at 6:22 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Tue, Aug 28, 2018 at 2:48 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> > >
> > > On Tue, Aug 28, 2018 at 5:33 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Tue, Aug 28, 2018 at 12:57 PM <sunil.kovvuri@gmail.com> wrote:
> > > > >
> > > > > From: Aleksey Makarov <amakarov@marvell.com>
> > > > >
> > > > > This patch adds mailbox support infrastructure APIs.
> > > > > Each RVU device has a dedicated 64KB mailbox region
> > > > > shared with it's peer for communication. RVU AF has
> > > > > a separate mailbox region shared with each of RVU PFs
> > > > > and a RVU PF has a separate region shared with each of
> > > > > it's VF.
> > > > >
> > > > > These set of APIs are used by this driver (RVU AF) and
> > > > > other RVU PF/VF drivers eg netdev, crypto e.t.c.
> > > > >
> > > > > Signed-off-by: Aleksey Makarov <amakarov@marvell.com>
> > > > > Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
> > > > > Signed-off-by: Lukasz Bartosik <lbartosik@marvell.com>
> > > >
> > > > Why does this driver not use the drivers/mailbox/ infrastructure?
> > > >
> > > This is a common administrative software driver which will be handling requests
> > > from kernel drivers and as well as drivers in userspace applications.
> > > We had to keep mailbox communication infrastructure same across all usages.
> >
> > Can you explain more about the usage of userspace applications
> > and what interface you plan to use into the kernel?
>
> Any PCI device here irrespective in what domain (kernel or userspace)
> they are in
> use common mailbox communication. Which is
> # Write a mailbox msg (format is agreed between all parties) into
> shared (between AF and other PF/VFs)
> memory region and trigger a interrupt to admin function.
> # Admin function processes the msg and puts reply in the same memory
> region and trigger
> IRQ to the requesting device. If the device has a driver instance
> in kernel then it uses
> IRQ and userspace applications does polling on the IRQ status bit.
Ok, so the mailbox here is a communication mechanism between
two device drivers that may run on the same kernel, or in different
instances (user space, virtual machine, ...), but each driver
only talks to the mailbox visible in its own device, right?
What is the purpose of the exported interface then? Is this
just an abstraction so each of the drivers can talk to its own
mailbox using a set of common helper functions?
Arnd
^ permalink raw reply
* Re: [PATCH 00/15] soc: octeontx2: Add RVU admin function driver
From: Andrew Lunn @ 2018-08-30 13:26 UTC (permalink / raw)
To: Sunil Kovvuri
Cc: Arnd Bergmann, LKML, olof, LAKML, linux-soc, Sunil Goutham,
Linux Netdev List, David S. Miller
In-Reply-To: <CA+sq2CfbpQKas7TbQfrcw=Drpb7yO_gqH31iXabjNEQEG4xYrQ@mail.gmail.com>
> > > My feeling overall is that we need a review from the network driver
> > > folks more than the arm-soc team etc, and that maybe the driver
> > > as a whole should go into drivers/net/ethernet.
> >
> > This driver doesn't handle any network IO and moreever this driver has to handle
> > configuration requests from crypto driver as well. There will be
> > separate network and
> > crypto drivers which will be upstreamed into drivers/net/ethernet and
> > drivers/crypto.
> > And in future silicons there will be different types of functional
> > blocks which will be
> > added into this resource virtualization unit (RVU). Hence i thought
> > this driver is not a
> > right fit in drivers/net/ethernet.
Hi Sunil
Do you have a git branch for everything? I would like to look at the
actual Ethernet driver, and the full API this driver exports to other
drivers.
I think there real question here is, do you have split between this
driver and the actual device drivers in the right place? For me, link
up/down detection should be in the Ethernet driver, since it is not
shared with the crypto driver.
Thanks
Andrew
^ permalink raw reply
* [PATCH] staging: fsl-dpaa2/ethsw: Fix uninitialized variables
From: Ioana Radulescu @ 2018-08-30 13:17 UTC (permalink / raw)
To: gregkh, netdev; +Cc: devel, linux-kernel, dan.carpenter, ioana.ciornei
Functions port_vlans_add() and port_vlans_del() could,
in theory, return an uninitialized variable. Fix this
by initializing the variable in question at declaration.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
---
drivers/staging/fsl-dpaa2/ethsw/ethsw.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
index ecdd3d8..c1616c3 100644
--- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
+++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c
@@ -717,7 +717,7 @@ static int port_vlans_add(struct net_device *netdev,
struct switchdev_trans *trans)
{
struct ethsw_port_priv *port_priv = netdev_priv(netdev);
- int vid, err;
+ int vid, err = 0;
if (netif_is_bridge_master(vlan->obj.orig_dev))
return -EOPNOTSUPP;
@@ -872,7 +872,7 @@ static int port_vlans_del(struct net_device *netdev,
const struct switchdev_obj_port_vlan *vlan)
{
struct ethsw_port_priv *port_priv = netdev_priv(netdev);
- int vid, err;
+ int vid, err = 0;
if (netif_is_bridge_master(vlan->obj.orig_dev))
return -EOPNOTSUPP;
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net-next 1/5] pppoe: fix PPPOEIOCSFWD compat handling
From: Guillaume Nault @ 2018-08-30 13:09 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Paul Mackerras, linux-ppp, Networking, mitch, mostrows, jchapman,
xeb, David Miller, Al Viro, y2038 Mailman List,
Linux Kernel Mailing List
In-Reply-To: <CAK8P3a1WSa5yPWXYGieAc0+Xk=CAy_G-RhNNcQBsGQGOU-8WfA@mail.gmail.com>
On Thu, Aug 30, 2018 at 01:54:48PM +0200, Arnd Bergmann wrote:
> On Thu, Aug 30, 2018 at 1:04 PM Guillaume Nault <g.nault@alphalink.fr> wrote:
> >
> > On Wed, Aug 29, 2018 at 04:03:26PM +0200, Arnd Bergmann wrote:
> > > Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
> > > linux-2.5.69 along with hundreds of other commands, but was always broken
> > > sincen only the structure is compatible, but the command number is not,
> > > due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
> > > sockaddr_pppox)), which is different on 64-bit architectures.
> > >
> > And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe:
> > fix reference counting in PPPoE proxy")), and nobody ever noticed. I
> > should probably have removed this ioctl entirely instead of fixing it.
> > Clearly, it has never been used.
> >
> > If you think it's worth fixing (as opposed to dropping this ioctl or
> > its compat mode), then,
> > Acked-by: Guillaume Nault <g.nault@alphalink.fr>
>
> I don't care much, but fixing it seems seems easier than coming
> up with a convincing rationale for dropping.
>
> I'll update the changelog text to include your additional background
> information though, unless someone else prefers to have it dropped.
>
Sounds good. Thanks.
^ permalink raw reply
* Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
From: Björn Töpel @ 2018-08-30 9:05 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
Alexander Duyck, ast, Jesper Dangaard Brouer, Netdev,
Brandeburg, Jesse, Singhai, Anjali, peter.waskiewicz.jr,
Björn Töpel, michael.lundkvist, Willem de Bruijn,
John Fastabend, Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev,
Francois Ozog
In-Reply-To: <1c26d5c6-cd37-b022-fe34-1ca48ada598f@iogearbox.net>
Den ons 29 aug. 2018 kl 18:12 skrev Daniel Borkmann <daniel@iogearbox.net>:
>
> On 08/28/2018 02:44 PM, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This patch set introduces zero-copy AF_XDP support for Intel's i40e
> > driver. In the first preparatory patch we also add support for
> > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> > redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> > set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> > Brouer for reviewing earlier versions of this patch set.
> >
> > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> > that in the interest of time, to get an AF_XDP zero-copy implementation
> > out there for people to try, some code paths have been copied from the
> > XDP path to the zero-copy path. It is out goal to merge the two paths
> > in later patch sets.
> >
> > In contrast to the implementation from beginning of June, this patch
> > set does not require any extra HW queues for AF_XDP zero-copy
> > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> > AF_XDP zero-copy TX.
> >
> > Jeff, given that most of changes are in i40e, it is up to you how you
> > would like to route these patches. The set is tagged bpf-next, but
> > if taking it via the Intel driver tree is easier, let us know.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator HW
> > outputing packets at full 40 Gbit/s line rate. The results are with
> > retpoline and all other spectre and meltdown fixes, so these results
> > are not comparable to the ones from the zero-copy patch set in June.
> >
> > AF_XDP performance 64 byte packets.
> > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy
> > rxdrop 2.6 8.2 15.0
> > txpush 2.2 - 21.9
> > l2fwd 1.7 2.3 11.3
> >
> > AF_XDP performance 1500 byte packets:
> > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy
> > rxdrop 2.0 3.3 3.3
> > l2fwd 1.3 1.7 3.1
> >
> > XDP performance on our system as a base line:
> >
> > 64 byte packets:
> > XDP stats CPU pps issue-pps
> > XDP-RX CPU 16 18.4M 0
> >
> > 1500 byte packets:
> > XDP stats CPU pps issue-pps
> > XDP-RX CPU 16 3.3M 0
> >
> > The structure of the patch set is as follows:
> >
> > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> > Patches 2-4: Preparatory patches to common xsk and net code
> > Patches 5-7: Preparatory patches to i40e driver code for RX
> > Patch 8: i40e zero-copy support for RX
> > Patch 9: Preparatory patch to i40e driver code for TX
> > Patch 10: i40e zero-copy support for TX
> > Patch 11: Add flags to sample application to force zero-copy/copy mode
> >
> > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> >
> >
> > Magnus & Björn
> >
> > Björn Töpel (8):
> > xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
> > xdp: export xdp_rxq_info_unreg_mem_model
> > xsk: expose xdp_umem_get_{data,dma} to drivers
> > i40e: added queue pair disable/enable functions
> > i40e: refactor Rx path for re-use
> > i40e: move common Rx functions to i40e_txrx_common.h
> > i40e: add AF_XDP zero-copy Rx support
> > samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> >
> > Magnus Karlsson (3):
> > net: add napi_if_scheduled_mark_missed
> > i40e: move common Tx functions to i40e_txrx_common.h
> > i40e: add AF_XDP zero-copy Tx support
>
> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>
Yes, the plan is to get ixgbe out as the next driver, but we'll focus
on the existing i40e issues first.
Thanks,
Björn
> For the series:
>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>
> Thanks,
> Daniel
^ permalink raw reply
* Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues
From: Andy Shevchenko @ 2018-08-30 8:43 UTC (permalink / raw)
To: Hans de Goede
Cc: David S . Miller, Heiner Kallweit, Michael Turquette,
Stephen Boyd, Irina Tirdea, netdev, Johannes Stezenbach,
Carlo Caione, linux-clk
In-Reply-To: <e3be83dc-67c7-fdaa-1dd9-0a4aa3d10584@redhat.com>
On Wed, Aug 29, 2018 at 07:06:09PM +0200, Hans de Goede wrote:
> > What a nice stuff, thanks!
> >
> > Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>
> Thank you (and also thank you for the other reviews)
>
> > Btw, you probably better to refer to
> > https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for S0ix
> > issue.
>
> Ok, I've added a link to that. I've also added:
>
> Reported-by: Johannes Stezenbach <js@sig21.net>
>
> To honor Johannes for figuring out that leaving the clocks enabled
> was a problem in the first place.
>
> This will all be included in v2 of the series when I send it out.
Forgot to mention, instead of Irina, which is not longer working for Intel for
more than a year, put Pierre's address there.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH 0/3] IB/ipoib: Use dev_port to disambiguate port numbers
From: Arseny Maslennikov @ 2018-08-30 8:42 UTC (permalink / raw)
To: Leon Romanovsky; +Cc: linux-rdma, Doug Ledford, Jason Gunthorpe, netdev
In-Reply-To: <20180830054330.GC2796@mtr-leonro.mtl.com>
[-- Attachment #1: Type: text/plain, Size: 1805 bytes --]
On Thu, Aug 30, 2018 at 08:43:30AM +0300, Leon Romanovsky wrote:
> On Wed, Aug 29, 2018 at 12:01:14AM +0300, Arseny Maslennikov wrote:
> > Pre-3.15 userspace had trouble distinguishing different ports
> > of a NIC on a single PCI bus/device/function. To solve this,
> > a sysfs field `dev_port' was introduced quite a while ago
> > (commit v3.14-rc3-739-g3f85944fe207), and some relevant device
> > drivers were fixed to use it, but not in case of IPoIB.
> >
> > The convention for some reason never got documented in the kernel, but
> > was immediately adopted by userspace (notably udev[1][2], biosdevname[3])
> >
> > 3/3 documents the sysfs field — that's why I'm CC-ing netdev.
> >
> > This series was tested on and applies to 4.19-rc1.
> >
> > [1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
> > [2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
> > [3] https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38
> >
> > Arseny Maslennikov (3):
> > IB/ipoib: Use dev_port to expose network interface port numbers
> > IB/ipoib: Stop using dev_id to expose port numbers
>
> I completely agree with previous Yuval's comment, it makes no sense to
> start separate commits for every line.
>
> Please decide what is best and right behavior and do it, instead of
> pushing it up to be the maintainer's problem.
>
No problem; will squash those two in v2, then.
> Thanks
>
> > Documentation/ABI: document /sys/class/net/*/dev_port
> >
> > Documentation/ABI/testing/sysfs-class-net | 10 ++++++++++
> > drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
> > 2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > --
> > 2.18.0
> >
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next
From: Tariq Toukan @ 2018-08-30 8:35 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Saeed Mahameed, netdev@vger.kernel.org
Cc: Eran Ben Elisha
In-Reply-To: <20180829170358.5d822db8@redhat.com>
On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote:
> Hi Saeed,
>
> I'm having issues loading mlx5 driver on v4.19 kernels (tested both
> net-next and bpf-next), while kernel v4.18 seems to work. It happens
> with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
> from the system now).
>
Hi Jesper,
Thanks for your report!
We are working to analyze and debug the issue.
> One pain point is very long boot-time, caused by some timeout code in
> the driver. The kernel console log (dmesg) says:
>
> [ 5.763330] mlx5_core 0000:03:00.0: firmware version: 16.22.1002
> [ 5.769367] mlx5_core 0000:03:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at 0000:00:02.0 (capable of 252.048 Gb/s with 16 GT/s x16 link)
>
> (...) other drivers loading
>
> [ 66.816635] mlx5_core 0000:03:00.0: wait_func:964:(pid 112): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
> [ 66.828123] mlx5_core 0000:03:00.0: enable hca failed
> [ 66.845516] mlx5_core 0000:03:00.0: mlx5_load_one failed with error code -110
> [ 66.852802] mlx5_core: probe of 0000:03:00.0 failed with error -110
>
> [ 66.859347] mlx5_core 0000:03:00.1: firmware version: 16.22.1002
> [ 66.865388] mlx5_core 0000:03:00.1: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at 0000:00:02.0 (capable of 252.048 Gb/s with 16 GT/s x16 link)
>
> [ 125.787395] XFS (sda3): Mounting V5 Filesystem
> [ 125.848509] XFS (sda3): Ending clean mount
> [ 127.984784] mlx5_core 0000:03:00.1: wait_func:964:(pid 5): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
> [ 127.996090] mlx5_core 0000:03:00.1: enable hca failed
> [ 128.013819] mlx5_core 0000:03:00.1: mlx5_load_one failed with error code -110
> [ 128.021076] mlx5_core: probe of 0000:03:00.1 failed with error -110
>
>
> Do you have any idea what could be causing this?
>
We'll update regarding any progress.
Thanks!
^ permalink raw reply
* [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames
From: Björn Töpel @ 2018-08-30 8:09 UTC (permalink / raw)
To: bjorn.topel, magnus.karlsson, magnus.karlsson, ast, daniel,
netdev
Cc: Björn Töpel, brouer
From: Björn Töpel <bjorn.topel@intel.com>
Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.
In this commit, we check if meta data is available, and if so, it is
prepended to the frame.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
net/xdp/xsk.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd7c17d..817e4cee1540 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
{
- void *buffer;
+ void *to_buf, *from_buf;
+ u32 metalen;
u64 addr;
int err;
if (!xskq_peek_addr(xs->umem->fq, &addr) ||
- len > xs->umem->chunk_size_nohr) {
+ len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
addr += xs->umem->headroom;
- buffer = xdp_umem_get_data(xs->umem, addr);
- memcpy(buffer, xdp->data, len);
+ if (xdp_data_meta_unsupported(xdp)) {
+ from_buf = xdp->data;
+ metalen = 0;
+ } else {
+ from_buf = xdp->data_meta;
+ metalen = xdp->data - xdp->data_meta;
+ }
+
+ to_buf = xdp_umem_get_data(xs->umem, addr);
+ memcpy(to_buf, from_buf, len + metalen);
+ addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
@@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
{
- u32 len = xdp->data_end - xdp->data;
- void *buffer;
+ u32 metalen, len = xdp->data_end - xdp->data;
+ void *to_buf, *from_buf;
u64 addr;
int err;
@@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
return -EINVAL;
if (!xskq_peek_addr(xs->umem->fq, &addr) ||
- len > xs->umem->chunk_size_nohr) {
+ len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
addr += xs->umem->headroom;
- buffer = xdp_umem_get_data(xs->umem, addr);
- memcpy(buffer, xdp->data, len);
+ if (xdp_data_meta_unsupported(xdp)) {
+ from_buf = xdp->data;
+ metalen = 0;
+ } else {
+ from_buf = xdp->data_meta;
+ metalen = xdp->data - xdp->data_meta;
+ }
+
+ to_buf = xdp_umem_get_data(xs->umem, addr);
+ memcpy(to_buf, from_buf, len + metalen);
+ addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
--
2.17.1
^ permalink raw reply related
* Re: [PATCH] ieee802154: mcr20a: read out of bounds in mcr20a_set_channel()
From: Dan Carpenter @ 2018-08-30 7:54 UTC (permalink / raw)
To: Xue Liu
Cc: alex. aring, Stefan Schmidt, David S. Miller, linux-wpan - ML,
netdev, kernel-janitors
In-Reply-To: <CAJuUDwtjRi0HnR95uaFp4zviSC7Yrs68o8jsK1nY6jWu6909bg@mail.gmail.com>
On Wed, Aug 29, 2018 at 07:50:51PM +0200, Xue Liu wrote:
> Hello Dan,
>
>
> On Wed, 29 Aug 2018 at 16:49, Dan Carpenter <dan.carpenter@oracle.com>
> wrote:
>
> > The "channel" variable can be any u8 value. We need to make sure we
> > don't read outside of the PLL_INT[] or PLL_FRAC[] arrays.
> >
> I think the “channel” variable can not be any u8 value. This values is
> already checked before set_channel function is called.
> https://github.com/torvalds/linux/blob/master/net/ieee802154/nl802154.c#L978
Oh... Hm... I should have reviewed more carefully. This patch isn't
correct. But there may still be a bug. What Smatch is worried about is
the other call tree:
ieee802154_start_req() calls (struct ieee802154_mlme_ops)->start_req
-> mac802154_mlme_start_req()
-> mac802154_dev_set_page_channel()
-> drv_set_channel() calls local->ops->set_channel(&local->hw, page, channel);
-> mcr20a_set_channel()
So maybe we could move the check from nl802154_set_channel() to
drv_set_channel() so channel is checked on both call trees.
regards,
dan carpenter
^ permalink raw reply
* Re: [PATCH net-next 1/5] pppoe: fix PPPOEIOCSFWD compat handling
From: Arnd Bergmann @ 2018-08-30 11:54 UTC (permalink / raw)
To: g.nault
Cc: Paul Mackerras, linux-ppp, Networking, mitch, mostrows, jchapman,
xeb, David Miller, Al Viro, y2038 Mailman List,
Linux Kernel Mailing List
In-Reply-To: <20180830110453.GC1473@alphalink.fr>
On Thu, Aug 30, 2018 at 1:04 PM Guillaume Nault <g.nault@alphalink.fr> wrote:
>
> On Wed, Aug 29, 2018 at 04:03:26PM +0200, Arnd Bergmann wrote:
> > Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
> > linux-2.5.69 along with hundreds of other commands, but was always broken
> > sincen only the structure is compatible, but the command number is not,
> > due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
> > sockaddr_pppox)), which is different on 64-bit architectures.
> >
> And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe:
> fix reference counting in PPPoE proxy")), and nobody ever noticed. I
> should probably have removed this ioctl entirely instead of fixing it.
> Clearly, it has never been used.
>
> If you think it's worth fixing (as opposed to dropping this ioctl or
> its compat mode), then,
> Acked-by: Guillaume Nault <g.nault@alphalink.fr>
I don't care much, but fixing it seems seems easier than coming
up with a convincing rationale for dropping.
I'll update the changelog text to include your additional background
information though, unless someone else prefers to have it dropped.
Arnd
^ permalink raw reply
* Re: [PATCH net-next 5/5] ppp: handle PPPIOCGIDLE for 64-bit time_t
From: Arnd Bergmann @ 2018-08-30 11:47 UTC (permalink / raw)
To: g.nault
Cc: Paul Mackerras, linux-ppp, Networking, mitch, mostrows, jchapman,
xeb, David Miller, Al Viro, y2038 Mailman List,
Linux Kernel Mailing List, Karsten Keil, open list:DOCUMENTATION
In-Reply-To: <20180830110601.GA19534@alphalink.fr>
On Thu, Aug 30, 2018 at 1:06 PM Guillaume Nault <g.nault@alphalink.fr> wrote:
> On Wed, Aug 29, 2018 at 04:03:30PM +0200, Arnd Bergmann wrote:
> > @@ -743,10 +744,17 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > err = 0;
> > break;
> >
> > - case PPPIOCGIDLE:
> > - idle.xmit_idle = (jiffies - ppp->last_xmit) / HZ;
> > - idle.recv_idle = (jiffies - ppp->last_recv) / HZ;
> > - if (copy_to_user(argp, &idle, sizeof(idle)))
> > + case PPPIOCGIDLE32:
> > + idle32.xmit_idle = (jiffies - ppp->last_xmit) / HZ;
> > + idle32.recv_idle = (jiffies - ppp->last_recv) / HZ;
> > + if (copy_to_user(argp, &idle32, sizeof(idle32)))
> >
> Missing 'break;'
>
> > + err = 0;
> > + break;
> > +
> > + case PPPIOCGIDLE64:
> > + idle64.xmit_idle = (jiffies - ppp->last_xmit) / HZ;
> > + idle64.recv_idle = (jiffies - ppp->last_recv) / HZ;
> > + if (copy_to_user(argp, &idle32, sizeof(idle32)))
> >
> I guess you meant 'idle64' instead of 'idle32'.
Good catch, fixing up both now.
Thanks for the review!
Arnd
^ permalink raw reply
* Re: [PATCH v5 1/4] gpiolib: Pass bitmaps, not integer arrays, to get/set array
From: Geert Uytterhoeven @ 2018-08-30 7:40 UTC (permalink / raw)
To: jmkrzyszt
Cc: Linus Walleij, Jonathan Corbet, Miguel Ojeda Sandonis,
peter.korsgaard, Peter Rosin, Ulf Hansson, Andrew Lunn,
Florian Fainelli, David S. Miller, Dominik Brodowski, Greg KH,
Kishon Vijay Abraham I, Lars-Peter Clausen, Michael Hennerich,
Jonathan Cameron, Hartmut Knaack, Peter Meerwald, Jiri Slaby,
Willy Tarreau, open list:DOCUMENTATION
In-Reply-To: <20180829204900.19390-2-jmkrzyszt@gmail.com>
Hi Janusz,
On Wed, Aug 29, 2018 at 10:48 PM Janusz Krzysztofik <jmkrzyszt@gmail.com> wrote:
> Most users of get/set array functions iterate consecutive bits of data,
> usually a single integer, while processing array of results obtained
> from, or building an array of values to be passed to those functions.
> Save time wasted on those iterations by changing the functions' API to
> accept bitmaps.
>
> All current users are updated as well.
>
> More benefits from the change are expected as soon as planned support
> for accepting/passing those bitmaps directly from/to respective GPIO
> chip callbacks if applicable is implemented.
>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com>
> Cc: Peter Korsgaard <peter.korsgaard@barco.com>
> Cc: Peter Rosin <peda@axentia.se>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Dominik Brodowski <linux@dominikbrodowski.net>
> Cc: Kishon Vijay Abraham I <kishon@ti.com>
> Cc: Lars-Peter Clausen <lars@metafoo.de>
> Cc: Michael Hennerich <Michael.Hennerich@analog.com>
> Cc: Jonathan Cameron <jic23@kernel.org>
> Cc: Hartmut Knaack <knaack.h@gmx.de>
> Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Jiri Slaby <jslaby@suse.com>
> Signed-off-by: Janusz Krzysztofik <jmkrzyszt@gmail.com>
> Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Thanks for your patch!
> --- a/drivers/auxdisplay/hd44780.c
> +++ b/drivers/auxdisplay/hd44780.c
> @@ -62,20 +62,19 @@ static void hd44780_strobe_gpio(struct hd44780 *hd)
> /* write to an LCD panel register in 8 bit GPIO mode */
> static void hd44780_write_gpio8(struct hd44780 *hd, u8 val, unsigned int rs)
> {
> - int values[10]; /* for DATA[0-7], RS, RW */
> - unsigned int i, n;
> + unsigned long value_bitmap[1]; /* for DATA[0-7], RS, RW */
> + unsigned int n;
>
> - for (i = 0; i < 8; i++)
> - values[PIN_DATA0 + i] = !!(val & BIT(i));
> - values[PIN_CTRL_RS] = rs;
> + value_bitmap[0] = val;
> + __assign_bit(PIN_CTRL_RS, value_bitmap, rs);
> n = 9;
> if (hd->pins[PIN_CTRL_RW]) {
> - values[PIN_CTRL_RW] = 0;
> + __clear_bit(PIN_CTRL_RW, value_bitmap);
The clearing is not needed, as this has been done by 'value_bitmap[0] = val;'
> n++;
> }
So the above block can be simplified to:
n = hd->pins[PIN_CTRL_RW] ? 10 : 9;
>
> /* Present the data to the port */
> - gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA0], values);
> + gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA0], value_bitmap);
>
> hd44780_strobe_gpio(hd);
> }
> @@ -83,32 +82,31 @@ static void hd44780_write_gpio8(struct hd44780 *hd, u8 val, unsigned int rs)
> /* write to an LCD panel register in 4 bit GPIO mode */
> static void hd44780_write_gpio4(struct hd44780 *hd, u8 val, unsigned int rs)
> {
> - int values[10]; /* for DATA[0-7], RS, RW, but DATA[0-3] is unused */
> - unsigned int i, n;
> + /* for DATA[0-7], RS, RW, but DATA[0-3] is unused */
This comment is not correct, as the low bits will be used.
/* DATA[4-7], RS, RW */
> + unsigned long value_bitmap[1];
> + unsigned int n;
>
> /* High nibble + RS, RW */
> - for (i = 4; i < 8; i++)
> - values[PIN_DATA0 + i] = !!(val & BIT(i));
> - values[PIN_CTRL_RS] = rs;
> + value_bitmap[0] = val;
> + __assign_bit(PIN_CTRL_RS, value_bitmap, rs);
> n = 5;
> if (hd->pins[PIN_CTRL_RW]) {
> - values[PIN_CTRL_RW] = 0;
> + __clear_bit(PIN_CTRL_RW, value_bitmap);
Not needed.
> n++;
> }
Hence:
n = hd->pins[PIN_CTRL_RW] ? 6: 5;
> + value_bitmap[0] >>= PIN_DATA4;
Yuck?!?
Isn't it more readable to just do:
/* High nibble + RS, RW */
value_bitmap[0] = val >> 4;
__assign_bit(4, value_bitmap, rs);
n = hd->pins[PIN_CTRL_RW] ? 6: 5;
>
> /* Present the data to the port */
> - gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4],
> - &values[PIN_DATA4]);
> + gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4], value_bitmap);
>
> hd44780_strobe_gpio(hd);
>
> /* Low nibble */
> - for (i = 0; i < 4; i++)
> - values[PIN_DATA4 + i] = !!(val & BIT(i));
> + value_bitmap[0] &= ~((1 << PIN_DATA4) - 1);
> + value_bitmap[0] |= val & ~((1 << PIN_DATA4) - 1);
... and:
/* Low nibble */
value_bitmap[0] &= ~0x0f;
value_bitmap[0] |= val & 0x0f;
>
> /* Present the data to the port */
> - gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4],
> - &values[PIN_DATA4]);
> + gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4], value_bitmap);
>
> hd44780_strobe_gpio(hd);
> }
> @@ -155,23 +153,23 @@ static void hd44780_write_cmd_gpio4(struct charlcd *lcd, int cmd)
> /* Send 4-bits of a command to the LCD panel in raw 4 bit GPIO mode */
> static void hd44780_write_cmd_raw_gpio4(struct charlcd *lcd, int cmd)
> {
> - int values[10]; /* for DATA[0-7], RS, RW, but DATA[0-3] is unused */
> + /* for DATA[0-7], RS, RW, but DATA[0-3] is unused */
This comment is not correct, as the low bits will be used.
/* DATA[4-7], RS, RW */
> + unsigned long value_bitmap[1];
> struct hd44780 *hd = lcd->drvdata;
> - unsigned int i, n;
> + unsigned int n;
>
> /* Command nibble + RS, RW */
> - for (i = 0; i < 4; i++)
> - values[PIN_DATA4 + i] = !!(cmd & BIT(i));
> - values[PIN_CTRL_RS] = 0;
> + value_bitmap[0] = cmd << PIN_DATA4;
> + __clear_bit(PIN_CTRL_RS, value_bitmap);
Implied by the assignment above.
> n = 5;
> if (hd->pins[PIN_CTRL_RW]) {
> - values[PIN_CTRL_RW] = 0;
> + __clear_bit(PIN_CTRL_RW, value_bitmap);
> n++;
> }
> + value_bitmap[0] = value_bitmap[0] >> PIN_DATA4;
Hence:
/* Command nibble + RS, RW */
value_bitmap[0] = cmd;
n = hd->pins[PIN_CTRL_RW] ? 6: 5;
>
> /* Present the data to the port */
> - gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4],
> - &values[PIN_DATA4]);
> + gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4], value_bitmap);
>
> hd44780_strobe_gpio(hd);
> }
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [PATCHv2] net/rds: RDS is not Radio Data System
From: Sowmini Varadhan @ 2018-08-30 11:32 UTC (permalink / raw)
To: Pavel Machek
Cc: Trivial patch monkey, Netdev list, santosh.shilimkar,
ka-cheong.poon, linux-kernel, David S. Miller
In-Reply-To: <20180830113003.GA29794@amd>
On (08/30/18 13:30), Pavel Machek wrote:
> - tristate "The RDS Protocol"
> + tristate "The Reliable Datagram Sockets Protocol"
Looks good to me.
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
^ permalink raw reply
* [PATCHv2] net/rds: RDS is not Radio Data System
From: Pavel Machek @ 2018-08-30 11:30 UTC (permalink / raw)
To: Trivial patch monkey, Netdev list, sowmini.varadhan,
santosh.shilimkar, ka-cheong.poon, linux-kernel
Cc: David S. Miller
In-Reply-To: <20180830102336.GA21063@amd>
[-- Attachment #1: Type: text/plain, Size: 810 bytes --]
Getting prompt "The RDS Protocol" (RDS) is not too helpful, and it is
easily confused with Radio Data System (which we may want to support
in kernel, too).
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index 41f7556..2738f14 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -1,6 +1,6 @@
config RDS
- tristate "The RDS Protocol"
+ tristate "The Reliable Datagram Sockets Protocol"
depends on INET
---help---
The RDS (Reliable Datagram Sockets) protocol provides reliable,
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply related
* Re: [PATCH net] ebpf: fix bpf_msg_pull_data
From: Daniel Borkmann @ 2018-08-30 7:20 UTC (permalink / raw)
To: Tushar Dave, john.fastabend, ast, davem, netdev; +Cc: sowmini.varadhan
In-Reply-To: <6e4e767a-3e83-c6e2-b5bf-b82ec1770e62@oracle.com>
On 08/30/2018 02:21 AM, Tushar Dave wrote:
> On 08/29/2018 05:07 PM, Tushar Dave wrote:
>> While doing some preliminary testing it is found that bpf helper
>> bpf_msg_pull_data does not calculate the data and data_end offset
>> correctly. Fix it!
>>
>> Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>> net/core/filter.c | 38 +++++++++++++++++++++++++-------------
>> 1 file changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index c25eb36..3eeb3d6 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> BPF_CALL_4(bpf_msg_pull_data,
>> struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>> {
>> - unsigned int len = 0, offset = 0, copy = 0;
>> + unsigned int len = 0, offset = 0, copy = 0, off = 0;
>> struct scatterlist *sg = msg->sg_data;
>> int first_sg, last_sg, i, shift;
>> unsigned char *p, *to, *from;
>> @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> i = msg->sg_start;
>> do {
>> len = sg[i].length;
>> - offset += len;
>> if (start < offset + len)
>> break;
>> + offset += len;
>> i++;
>> if (i == MAX_SKB_FRAGS)
>> i = 0;
>> - } while (i != msg->sg_end);
>> + } while (i <= msg->sg_end);
I don't think this condition is correct, msg operates as a scatterlist ring,
so sg_end may very well be < current i when there's a wrap-around in the
traversal ... more below.
>> + /* return error if start is out of range */
>> if (unlikely(start >= offset + len))
>> return -EINVAL;
>> - if (!msg->sg_copy[i] && bytes <= len)
>> - goto out;
>> + /* return error if i is last entry in sglist and end is out of range */
>> + if (msg->sg_copy[i] && end > offset + len)
>> + return -EINVAL;
>> first_sg = i;
>> + /* if i is not last entry in sg list and end (i.e start + bytes) is
>> + * within this sg[i] then goto out and calculate data and data_end
>> + */
>> + if (!msg->sg_copy[i] && end <= offset + len)
>> + goto out;
>> +
>> /* At this point we need to linearize multiple scatterlist
>> * elements or a single shared page. Either way we need to
>> * copy into a linear buffer exclusively owned by BPF. Then
>> @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> i++;
>> if (i == MAX_SKB_FRAGS)
>> i = 0;
>> - if (bytes < copy)
>> + if (end < copy)
>> break;
>> - } while (i != msg->sg_end);
>> + } while (i <= msg->sg_end);
>> +
>> + /* return error if i is last entry in sglist and end is out of range */
>> + if (i > msg->sg_end && end > offset + copy)
>> + return -EINVAL;
>> +
>> last_sg = i;
>> if (unlikely(copy < end - start))
>> @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> if (unlikely(!page))
>> return -ENOMEM;
>> p = page_address(page);
>> - offset = 0;
>> i = first_sg;
>> do {
>> from = sg_virt(&sg[i]);
>> len = sg[i].length;
>> - to = p + offset;
>> + to = p + off;
>> memcpy(to, from, len);
>> - offset += len;
>> + off += len;
>> sg[i].length = 0;
>> put_page(sg_page(&sg[i]));
>> i++;
>> if (i == MAX_SKB_FRAGS)
>> i = 0;
>> - } while (i != last_sg);
>> + } while (i < last_sg);
>> sg[first_sg].length = copy;
>> sg_set_page(&sg[first_sg], page, copy, 0);
>> @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> else
>> move_from = i + shift;
>> - if (move_from == msg->sg_end)
>> + if (move_from > msg->sg_end)
>> break;
>> sg[i] = sg[move_from];
>> @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>> if (msg->sg_end < 0)
>> msg->sg_end += MAX_SKB_FRAGS;
>> out:
>> - msg->data = sg_virt(&sg[i]) + start - offset;
>> + msg->data = sg_virt(&sg[first_sg]) + start - offset;
>> msg->data_end = msg->data + bytes;
>> return 0;
>>
>
> Please discard this patch. I just noticed that Daniel Borkmann sent some similar fixes for bpf_msg_pull_data.
Yeah I've been looking at these recently as well, please make sure you test
with the below fixes included to see if there's anything left:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=5b24109b0563d45094c470684c1f8cea1af269f8
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=0e06b227c5221dd51b5569de93f3b9f532be4a32
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=2e43f95dd8ee62bc8bf57f2afac37fbd70c8d565
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=a8cf76a9023bc6709b1361d06bb2fae5227b9d68
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH v5 1/4] gpiolib: Pass bitmaps, not integer arrays, to get/set array
From: Miguel Ojeda @ 2018-08-30 11:10 UTC (permalink / raw)
To: Janusz Krzysztofik
Cc: Linus Walleij, Jonathan Corbet, Peter Korsgaard, Peter Rosin,
Ulf Hansson, Andrew Lunn, Florian Fainelli, David S. Miller,
Dominik Brodowski, Greg Kroah-Hartman, Kishon Vijay Abraham I,
Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
Hartmut Knaack, Peter Meerwald-Stadler, Jiri Slaby, Willy Tarreau,
Geert Uytterhoeven, Linux
In-Reply-To: <20180829204900.19390-2-jmkrzyszt@gmail.com>
Hi Janusz,
On Wed, Aug 29, 2018 at 10:48 PM, Janusz Krzysztofik
<jmkrzyszt@gmail.com> wrote:
> Most users of get/set array functions iterate consecutive bits of data,
> usually a single integer, while processing array of results obtained
> from, or building an array of values to be passed to those functions.
> Save time wasted on those iterations by changing the functions' API to
> accept bitmaps.
>
> All current users are updated as well.
>
> More benefits from the change are expected as soon as planned support
> for accepting/passing those bitmaps directly from/to respective GPIO
> chip callbacks if applicable is implemented.
>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com>
> Cc: Peter Korsgaard <peter.korsgaard@barco.com>
> Cc: Peter Rosin <peda@axentia.se>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Dominik Brodowski <linux@dominikbrodowski.net>
> Cc: Kishon Vijay Abraham I <kishon@ti.com>
> Cc: Lars-Peter Clausen <lars@metafoo.de>
> Cc: Michael Hennerich <Michael.Hennerich@analog.com>
> Cc: Jonathan Cameron <jic23@kernel.org>
> Cc: Hartmut Knaack <knaack.h@gmx.de>
> Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Jiri Slaby <jslaby@suse.com>
> Signed-off-by: Janusz Krzysztofik <jmkrzyszt@gmail.com>
> Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
> Documentation/driver-api/gpio/consumer.rst | 22 ++++----
> drivers/auxdisplay/hd44780.c | 52 +++++++++--------
> drivers/bus/ts-nbus.c | 19 ++-----
> drivers/gpio/gpio-max3191x.c | 17 +++---
> drivers/gpio/gpiolib.c | 86 +++++++++++++++--------------
> drivers/gpio/gpiolib.h | 4 +-
> drivers/i2c/muxes/i2c-mux-gpio.c | 8 +--
> drivers/mmc/core/pwrseq_simple.c | 13 ++---
> drivers/mux/gpio.c | 9 +--
> drivers/net/phy/mdio-mux-gpio.c | 3 +-
> drivers/pcmcia/soc_common.c | 11 ++--
> drivers/phy/motorola/phy-mapphone-mdm6600.c | 17 +++---
> drivers/staging/iio/adc/ad7606.c | 9 +--
> drivers/tty/serial/serial_mctrl_gpio.c | 7 ++-
> include/linux/gpio/consumer.h | 18 +++---
> 15 files changed, 140 insertions(+), 155 deletions(-)
>
> diff --git a/Documentation/driver-api/gpio/consumer.rst b/Documentation/driver-api/gpio/consumer.rst
> index aa03f389d41d..ed68042ddccf 100644
> --- a/Documentation/driver-api/gpio/consumer.rst
> +++ b/Documentation/driver-api/gpio/consumer.rst
> @@ -323,29 +323,29 @@ The following functions get or set the values of an array of GPIOs::
>
> int gpiod_get_array_value(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array);
> + unsigned long *value_bitmap);
> int gpiod_get_raw_array_value(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array);
> + unsigned long *value_bitmap);
> int gpiod_get_array_value_cansleep(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array);
> + unsigned long *value_bitmap);
> int gpiod_get_raw_array_value_cansleep(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array);
> + unsigned long *value_bitmap);
>
> void gpiod_set_array_value(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array)
> + unsigned long *value_bitmap)
> void gpiod_set_raw_array_value(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array)
> + unsigned long *value_bitmap)
> void gpiod_set_array_value_cansleep(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array)
> + unsigned long *value_bitmap)
> void gpiod_set_raw_array_value_cansleep(unsigned int array_size,
> struct gpio_desc **desc_array,
> - int *value_array)
> + unsigned long *value_bitmap)
>
> The array can be an arbitrary set of GPIOs. The functions will try to access
> GPIOs belonging to the same bank or chip simultaneously if supported by the
> @@ -356,8 +356,8 @@ accessed sequentially.
> The functions take three arguments:
> * array_size - the number of array elements
> * desc_array - an array of GPIO descriptors
> - * value_array - an array to store the GPIOs' values (get) or
> - an array of values to assign to the GPIOs (set)
> + * value_bitmap - a bitmap to store the GPIOs' values (get) or
> + a bitmap of values to assign to the GPIOs (set)
>
> The descriptor array can be obtained using the gpiod_get_array() function
> or one of its variants. If the group of descriptors returned by that function
> @@ -366,7 +366,7 @@ the struct gpio_descs returned by gpiod_get_array()::
>
> struct gpio_descs *my_gpio_descs = gpiod_get_array(...);
> gpiod_set_array_value(my_gpio_descs->ndescs, my_gpio_descs->desc,
> - my_gpio_values);
> + my_gpio_value_bitmap);
>
> It is also possible to access a completely arbitrary array of descriptors. The
> descriptors may be obtained using any combination of gpiod_get() and
> diff --git a/drivers/auxdisplay/hd44780.c b/drivers/auxdisplay/hd44780.c
> index f1a42f0f1ded..bbbd6a29bf01 100644
> --- a/drivers/auxdisplay/hd44780.c
> +++ b/drivers/auxdisplay/hd44780.c
> @@ -62,20 +62,19 @@ static void hd44780_strobe_gpio(struct hd44780 *hd)
> /* write to an LCD panel register in 8 bit GPIO mode */
> static void hd44780_write_gpio8(struct hd44780 *hd, u8 val, unsigned int rs)
> {
> - int values[10]; /* for DATA[0-7], RS, RW */
> - unsigned int i, n;
> + unsigned long value_bitmap[1]; /* for DATA[0-7], RS, RW */
(I read your comments in the other email)
I still find this odd, but if everyone is going to have this change
done like this, consistency is better.
> + unsigned int n;
>
> - for (i = 0; i < 8; i++)
> - values[PIN_DATA0 + i] = !!(val & BIT(i));
> - values[PIN_CTRL_RS] = rs;
> + value_bitmap[0] = val;
> + __assign_bit(PIN_CTRL_RS, value_bitmap, rs);
> n = 9;
> if (hd->pins[PIN_CTRL_RW]) {
> - values[PIN_CTRL_RW] = 0;
> + __clear_bit(PIN_CTRL_RW, value_bitmap);
> n++;
> }
>
> /* Present the data to the port */
> - gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA0], values);
> + gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA0], value_bitmap);
>
> hd44780_strobe_gpio(hd);
> }
> @@ -83,32 +82,31 @@ static void hd44780_write_gpio8(struct hd44780 *hd, u8 val, unsigned int rs)
> /* write to an LCD panel register in 4 bit GPIO mode */
> static void hd44780_write_gpio4(struct hd44780 *hd, u8 val, unsigned int rs)
> {
> - int values[10]; /* for DATA[0-7], RS, RW, but DATA[0-3] is unused */
> - unsigned int i, n;
> + /* for DATA[0-7], RS, RW, but DATA[0-3] is unused */
> + unsigned long value_bitmap[1];
Ditto.
> + unsigned int n;
>
> /* High nibble + RS, RW */
> - for (i = 4; i < 8; i++)
> - values[PIN_DATA0 + i] = !!(val & BIT(i));
> - values[PIN_CTRL_RS] = rs;
> + value_bitmap[0] = val;
> + __assign_bit(PIN_CTRL_RS, value_bitmap, rs);
> n = 5;
> if (hd->pins[PIN_CTRL_RW]) {
> - values[PIN_CTRL_RW] = 0;
> + __clear_bit(PIN_CTRL_RW, value_bitmap);
> n++;
> }
> + value_bitmap[0] >>= PIN_DATA4;
>
> /* Present the data to the port */
> - gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4],
> - &values[PIN_DATA4]);
> + gpiod_set_array_value_cansleep(n, &hd->pins[PIN_DATA4], value_bitmap);
>
> hd44780_strobe_gpio(hd);
>
> /* Low nibble */
> - for (i = 0; i < 4; i++)
> - values[PIN_DATA4 + i] = !!(val & BIT(i));
> + value_bitmap[0] &= ~((1 << PIN_DATA4) - 1);
> + value_bitmap[0] |= val & ~((1 << PIN_DATA4) - 1);
This is still wrong! What I originally meant in my v4 review is that
there is an extra ~ in the second line.
Also, a couple of general comments:
- Please review the list of CCs (I was not CC'd originally, so maybe
there are other maintainers that aren't, either)
- In general, the new code seems harder to read than the original one
(but that is my impression).
Thanks for your effort! :-)
Cheers,
Miguel
^ permalink raw reply
* Re: [PATCH net-next 5/5] ppp: handle PPPIOCGIDLE for 64-bit time_t
From: Guillaume Nault @ 2018-08-30 11:06 UTC (permalink / raw)
To: Arnd Bergmann
Cc: paulus, linux-ppp, netdev, mitch, mostrows, jchapman, xeb, davem,
viro, y2038, linux-kernel, Karsten Keil, linux-doc
In-Reply-To: <20180829140409.833488-5-arnd@arndb.de>
On Wed, Aug 29, 2018 at 04:03:30PM +0200, Arnd Bergmann wrote:
> The ppp_idle structure is defined in terms of __kernel_time_t, which is
> defined as 'long' on all architectures, and this usage is not affected
> by the y2038 problem since it transports a time interval rather than an
> absolute time.
>
> However, the ppp user space defines the same structure as time_t, which
> may be 64-bit wide on new libc versions even on 32-bit architectures.
>
> It's easy enough to just handle both possible structure layouts on
> all architectures, to deal with the possibility that a user space ppp
> implementation comes with its own ppp_idle structure definition, as well
> as to document the fact that the driver is y2038-safe.
>
> Doing this also avoids the need for a special compat mode translation,
> since 32-bit and 64-bit kernels now support the same interfaces.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> Documentation/networking/ppp_generic.txt | 2 ++
> drivers/isdn/i4l/isdn_ppp.c | 14 ++++++++---
> drivers/net/ppp/ppp_generic.c | 18 ++++++++++----
> fs/compat_ioctl.c | 31 ------------------------
> include/uapi/linux/ppp-ioctl.h | 2 ++
> include/uapi/linux/ppp_defs.h | 14 +++++++++++
> 6 files changed, 42 insertions(+), 39 deletions(-)
>
> diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt
> index 61daf4b39600..fd563aff5fc9 100644
> --- a/Documentation/networking/ppp_generic.txt
> +++ b/Documentation/networking/ppp_generic.txt
> @@ -378,6 +378,8 @@ an interface unit are:
> CONFIG_PPP_FILTER option is enabled, the set of packets which reset
> the transmit and receive idle timers is restricted to those which
> pass the `active' packet filter.
> + Two versions of this command exist, to deal with user space
> + expecting times as either 32-bit or 64-bit time_t seconds.
>
> * PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the
> number of connection slots) for the TCP header compressor and
> diff --git a/drivers/isdn/i4l/isdn_ppp.c b/drivers/isdn/i4l/isdn_ppp.c
> index a7b275ea5de1..1f17126c5fa4 100644
> --- a/drivers/isdn/i4l/isdn_ppp.c
> +++ b/drivers/isdn/i4l/isdn_ppp.c
> @@ -543,11 +543,19 @@ isdn_ppp_ioctl(int min, struct file *file, unsigned int cmd, unsigned long arg)
> }
> is->pppcfg = val;
> break;
> - case PPPIOCGIDLE: /* get idle time information */
> + case PPPIOCGIDLE32: /* get idle time information */
> if (lp) {
> - struct ppp_idle pidle;
> + struct ppp_idle32 pidle;
> pidle.xmit_idle = pidle.recv_idle = lp->huptimer;
> - if ((r = set_arg(argp, &pidle, sizeof(struct ppp_idle))))
> + if ((r = set_arg(argp, &pidle, sizeof(struct ppp_idle32))))
> + return r;
> + }
> + break;
> + case PPPIOCGIDLE64: /* get idle time information */
> + if (lp) {
> + struct ppp_idle64 pidle;
> + pidle.xmit_idle = pidle.recv_idle = lp->huptimer;
> + if ((r = set_arg(argp, &pidle, sizeof(struct ppp_idle64))))
> return r;
> }
> break;
> diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
> index 3a7aa2eed415..c8b8aa071140 100644
> --- a/drivers/net/ppp/ppp_generic.c
> +++ b/drivers/net/ppp/ppp_generic.c
> @@ -619,7 +619,8 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> struct ppp_file *pf;
> struct ppp *ppp;
> int err = -EFAULT, val, val2, i;
> - struct ppp_idle idle;
> + struct ppp_idle32 idle32;
> + struct ppp_idle64 idle64;
> struct npioctl npi;
> int unit, cflags;
> struct slcompress *vj;
> @@ -743,10 +744,17 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> err = 0;
> break;
>
> - case PPPIOCGIDLE:
> - idle.xmit_idle = (jiffies - ppp->last_xmit) / HZ;
> - idle.recv_idle = (jiffies - ppp->last_recv) / HZ;
> - if (copy_to_user(argp, &idle, sizeof(idle)))
> + case PPPIOCGIDLE32:
> + idle32.xmit_idle = (jiffies - ppp->last_xmit) / HZ;
> + idle32.recv_idle = (jiffies - ppp->last_recv) / HZ;
> + if (copy_to_user(argp, &idle32, sizeof(idle32)))
>
Missing 'break;'
> + err = 0;
> + break;
> +
> + case PPPIOCGIDLE64:
> + idle64.xmit_idle = (jiffies - ppp->last_xmit) / HZ;
> + idle64.recv_idle = (jiffies - ppp->last_recv) / HZ;
> + if (copy_to_user(argp, &idle32, sizeof(idle32)))
>
I guess you meant 'idle64' instead of 'idle32'.
^ permalink raw reply
* Re: [PATCH net-next 4/5] ppp: move PPPIOCSPASS32/PPPIOCSACTIVE32 to ppp_generic.c
From: Guillaume Nault @ 2018-08-30 11:05 UTC (permalink / raw)
To: Arnd Bergmann
Cc: paulus, linux-ppp, netdev, mitch, mostrows, jchapman, xeb, davem,
viro, y2038, linux-kernel
In-Reply-To: <20180829140409.833488-4-arnd@arndb.de>
On Wed, Aug 29, 2018 at 04:03:29PM +0200, Arnd Bergmann wrote:
> PPPIOCSPASS and PPPIOCSACTIVE are implemented in ppp_generic and isdn_ppp,
> but the latter one doesn't work for compat mode in general, so we can
> move these two into the generic code.
>
> Again, the best implementation I could come up with was to merge
> the compat handling into the regular ppp_ioctl() function and
> treating all ioctl commands as compatible.
>
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
^ permalink raw reply
* Re: [PATCH net-next 3/5] ppp: move PPPIOCSCOMPRESS32 to ppp-generic.c
From: Guillaume Nault @ 2018-08-30 11:04 UTC (permalink / raw)
To: Arnd Bergmann
Cc: paulus, linux-ppp, netdev, mitch, mostrows, jchapman, xeb, davem,
viro, y2038, linux-kernel, Kirill Tkhai
In-Reply-To: <20180829140409.833488-3-arnd@arndb.de>
On Wed, Aug 29, 2018 at 04:03:28PM +0200, Arnd Bergmann wrote:
> PPPIOCSCOMPRESS is only implemented in ppp_generic, so it's best to move
> the compat handling there. My first approach was to keep it in a new
> ppp_compat_ioctl() function, but it turned out to be much simpler to do
> it in the regular ioctl handler, by allowing both structure layouts to
> be handled directly there.
>
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
^ permalink raw reply
* Re: [PATCH net-next 1/5] pppoe: fix PPPOEIOCSFWD compat handling
From: Guillaume Nault @ 2018-08-30 11:04 UTC (permalink / raw)
To: Arnd Bergmann
Cc: paulus, linux-ppp, netdev, mitch, mostrows, jchapman, xeb, davem,
viro, y2038, linux-kernel
In-Reply-To: <20180829140409.833488-1-arnd@arndb.de>
On Wed, Aug 29, 2018 at 04:03:26PM +0200, Arnd Bergmann wrote:
> Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
> linux-2.5.69 along with hundreds of other commands, but was always broken
> sincen only the structure is compatible, but the command number is not,
> due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
> sockaddr_pppox)), which is different on 64-bit architectures.
>
And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe:
fix reference counting in PPPoE proxy")), and nobody ever noticed. I
should probably have removed this ioctl entirely instead of fixing it.
Clearly, it has never been used.
If you think it's worth fixing (as opposed to dropping this ioctl or
its compat mode), then,
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
^ permalink raw reply
* [PATCH v4] 9p: Add refcount to p9_req_t
From: Dominique Martinet @ 2018-08-30 10:52 UTC (permalink / raw)
To: Tomas Bortoli, Eric Van Hensbergen, Latchesar Ionkov
Cc: v9fs-developer, netdev, linux-kernel, syzkaller,
Dominique Martinet
In-Reply-To: <1535518779-28551-1-git-send-email-asmadeus@codewreck.org>
From: Tomas Bortoli <tomasbortoli@gmail.com>
To avoid use-after-free(s), use a refcount to keep track of the
usable references to any instantiated struct p9_req_t.
This commit adds p9_req_put(), p9_req_get() and p9_req_try_get() as
wrappers to kref_put(), kref_get() and kref_get_unless_zero().
These are used by the client and the transports to keep track of
valid requests' references.
p9_free_req() is added back and used as callback by kref_put().
Add SLAB_TYPESAFE_BY_RCU as it ensures that the memory freed by
kmem_cache_free() will not be reused for another type until the rcu
synchronisation period is over, so an address gotten under rcu read
lock is safe to inc_ref() without corrupting random memory while
the lock is held.
Co-developed-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+467050c1ce275af2a5b8@syzkaller.appspotmail.com
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
---
v3:
- add req put if virtio zc request fails
- add req put if cancelled callback is not defined for virtio
- (incorrectly) add req put in rdma cancelled callback
v4:
- removed rdma's cancelled callback put again
- changed the else if no cancelled callback into actually giving virtio
a callback, xen does not need to call put in that case either because
both function rely on tag_lookup to find the request. trans_fd only
needs to put in cancelled because it also keeps the req in a list around
for cancel.
- add req put for trans xen's request(), I'm not sure why that one was
missing either..
And with that I believe I am done testing all four transports.
I'll do a second round of tests next week just to make sure, but it
should be good enough™
Sorry for the multiple iterations.
include/net/9p/client.h | 14 ++++++++++
net/9p/client.c | 57 ++++++++++++++++++++++++++++++++++++-----
net/9p/trans_fd.c | 11 +++++++-
net/9p/trans_rdma.c | 1 +
net/9p/trans_virtio.c | 26 ++++++++++++++++---
net/9p/trans_xen.c | 1 +
6 files changed, 98 insertions(+), 12 deletions(-)
diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 735f3979d559..947a570307a6 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -94,6 +94,7 @@ enum p9_req_status_t {
struct p9_req_t {
int status;
int t_err;
+ struct kref refcount;
wait_queue_head_t wq;
struct p9_fcall tc;
struct p9_fcall rc;
@@ -233,6 +234,19 @@ int p9_client_lock_dotl(struct p9_fid *fid, struct p9_flock *flock, u8 *status);
int p9_client_getlock_dotl(struct p9_fid *fid, struct p9_getlock *fl);
void p9_fcall_fini(struct p9_fcall *fc);
struct p9_req_t *p9_tag_lookup(struct p9_client *, u16);
+
+static inline void p9_req_get(struct p9_req_t *r)
+{
+ kref_get(&r->refcount);
+}
+
+static inline int p9_req_try_get(struct p9_req_t *r)
+{
+ return kref_get_unless_zero(&r->refcount);
+}
+
+int p9_req_put(struct p9_req_t *r);
+
void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status);
int p9_parse_header(struct p9_fcall *, int32_t *, int8_t *, int16_t *, int);
diff --git a/net/9p/client.c b/net/9p/client.c
index 7942c0bfcc5b..aeeb6d8515d4 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -310,6 +310,18 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size)
if (tag < 0)
goto free;
+ /* Init ref to two because in the general case there is one ref
+ * that is put asynchronously by a writer thread, one ref
+ * temporarily given by p9_tag_lookup and put by p9_client_cb
+ * in the recv thread, and one ref put by p9_tag_remove in the
+ * main thread. The only exception is virtio that does not use
+ * p9_tag_lookup but does not have a writer thread either
+ * (the write happens synchronously in the request/zc_request
+ * callback), so p9_client_cb eats the second ref there
+ * as the pointer is duplicated directly by virtqueue_add_sgs()
+ */
+ refcount_set(&req->refcount.refcount, 2);
+
return req;
free:
@@ -333,10 +345,21 @@ struct p9_req_t *p9_tag_lookup(struct p9_client *c, u16 tag)
struct p9_req_t *req;
rcu_read_lock();
+again:
req = idr_find(&c->reqs, tag);
- /* There's no refcount on the req; a malicious server could cause
- * us to dereference a NULL pointer
- */
+ if (req) {
+ /* We have to be careful with the req found under rcu_read_lock
+ * Thanks to SLAB_TYPESAFE_BY_RCU we can safely try to get the
+ * ref again without corrupting other data, then check again
+ * that the tag matches once we have the ref
+ */
+ if (!p9_req_try_get(req))
+ goto again;
+ if (req->tc.tag != tag) {
+ p9_req_put(req);
+ goto again;
+ }
+ }
rcu_read_unlock();
return req;
@@ -350,7 +373,7 @@ EXPORT_SYMBOL(p9_tag_lookup);
*
* Context: Any context.
*/
-static void p9_tag_remove(struct p9_client *c, struct p9_req_t *r)
+static int p9_tag_remove(struct p9_client *c, struct p9_req_t *r)
{
unsigned long flags;
u16 tag = r->tc.tag;
@@ -359,11 +382,23 @@ static void p9_tag_remove(struct p9_client *c, struct p9_req_t *r)
spin_lock_irqsave(&c->lock, flags);
idr_remove(&c->reqs, tag);
spin_unlock_irqrestore(&c->lock, flags);
+ return p9_req_put(r);
+}
+
+static void p9_req_free(struct kref *ref)
+{
+ struct p9_req_t *r = container_of(ref, struct p9_req_t, refcount);
p9_fcall_fini(&r->tc);
p9_fcall_fini(&r->rc);
kmem_cache_free(p9_req_cache, r);
}
+int p9_req_put(struct p9_req_t *r)
+{
+ return kref_put(&r->refcount, p9_req_free);
+}
+EXPORT_SYMBOL(p9_req_put);
+
/**
* p9_tag_cleanup - cleans up tags structure and reclaims resources
* @c: v9fs client struct
@@ -379,7 +414,9 @@ static void p9_tag_cleanup(struct p9_client *c)
rcu_read_lock();
idr_for_each_entry(&c->reqs, req, id) {
pr_info("Tag %d still in use\n", id);
- p9_tag_remove(c, req);
+ if (p9_tag_remove(c, req) == 0)
+ pr_warn("Packet with tag %d has still references",
+ req->tc.tag);
}
rcu_read_unlock();
}
@@ -403,6 +440,7 @@ void p9_client_cb(struct p9_client *c, struct p9_req_t *req, int status)
wake_up(&req->wq);
p9_debug(P9_DEBUG_MUX, "wakeup: %d\n", req->tc.tag);
+ p9_req_put(req);
}
EXPORT_SYMBOL(p9_client_cb);
@@ -643,9 +681,10 @@ static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
* if we haven't received a response for oldreq,
* remove it from the list
*/
- if (oldreq->status == REQ_STATUS_SENT)
+ if (oldreq->status == REQ_STATUS_SENT) {
if (c->trans_mod->cancelled)
c->trans_mod->cancelled(c, oldreq);
+ }
p9_tag_remove(c, req);
return 0;
@@ -682,6 +721,8 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
return req;
reterr:
p9_tag_remove(c, req);
+ /* We have to put also the 2nd reference as it won't be used */
+ p9_req_put(req);
return ERR_PTR(err);
}
@@ -716,6 +757,8 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
err = c->trans_mod->request(c, req);
if (err < 0) {
+ /* write won't happen */
+ p9_req_put(req);
if (err != -ERESTARTSYS && err != -EFAULT)
c->status = Disconnected;
goto recalc_sigpending;
@@ -2241,7 +2284,7 @@ EXPORT_SYMBOL(p9_client_readlink);
int __init p9_client_init(void)
{
- p9_req_cache = KMEM_CACHE(p9_req_t, 0);
+ p9_req_cache = KMEM_CACHE(p9_req_t, SLAB_TYPESAFE_BY_RCU);
return p9_req_cache ? 0 : -ENOMEM;
}
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 20f46f13fe83..686e24e355d0 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -132,6 +132,7 @@ struct p9_conn {
struct list_head req_list;
struct list_head unsent_req_list;
struct p9_req_t *req;
+ struct p9_req_t *wreq;
char tmp_buf[7];
struct p9_fcall rc;
int wpos;
@@ -383,6 +384,7 @@ static void p9_read_work(struct work_struct *work)
m->rc.sdata = NULL;
m->rc.offset = 0;
m->rc.capacity = 0;
+ p9_req_put(m->req);
m->req = NULL;
}
@@ -472,6 +474,8 @@ static void p9_write_work(struct work_struct *work)
m->wbuf = req->tc.sdata;
m->wsize = req->tc.size;
m->wpos = 0;
+ p9_req_get(req);
+ m->wreq = req;
spin_unlock(&m->client->lock);
}
@@ -492,8 +496,11 @@ static void p9_write_work(struct work_struct *work)
}
m->wpos += err;
- if (m->wpos == m->wsize)
+ if (m->wpos == m->wsize) {
m->wpos = m->wsize = 0;
+ p9_req_put(m->wreq);
+ m->wreq = NULL;
+ }
end_clear:
clear_bit(Wworksched, &m->wsched);
@@ -694,6 +701,7 @@ static int p9_fd_cancel(struct p9_client *client, struct p9_req_t *req)
if (req->status == REQ_STATUS_UNSENT) {
list_del(&req->req_list);
req->status = REQ_STATUS_FLSHD;
+ p9_req_put(req);
ret = 0;
}
spin_unlock(&client->lock);
@@ -711,6 +719,7 @@ static int p9_fd_cancelled(struct p9_client *client, struct p9_req_t *req)
spin_lock(&client->lock);
list_del(&req->req_list);
spin_unlock(&client->lock);
+ p9_req_put(req);
return 0;
}
diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 5b0cda1aaa7a..9cc9b3a19ee7 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -365,6 +365,7 @@ send_done(struct ib_cq *cq, struct ib_wc *wc)
c->busa, c->req->tc.size,
DMA_TO_DEVICE);
up(&rdma->sq_sem);
+ p9_req_put(c->req);
kfree(c);
}
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 3dd6ce1c0f2d..eb596c2ed546 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -207,6 +207,13 @@ static int p9_virtio_cancel(struct p9_client *client, struct p9_req_t *req)
return 1;
}
+/* Reply won't come, so drop req ref */
+static int p9_virtio_cancelled(struct p9_client *client, struct p9_req_t *req)
+{
+ p9_req_put(req);
+ return 0;
+}
+
/**
* pack_sg_list_p - Just like pack_sg_list. Instead of taking a buffer,
* this takes a list of pages.
@@ -404,6 +411,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
struct scatterlist *sgs[4];
size_t offs;
int need_drop = 0;
+ int kicked = 0;
p9_debug(P9_DEBUG_TRANS, "virtio request\n");
@@ -411,8 +419,10 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
__le32 sz;
int n = p9_get_mapped_pages(chan, &out_pages, uodata,
outlen, &offs, &need_drop);
- if (n < 0)
- return n;
+ if (n < 0) {
+ err = n;
+ goto err_out;
+ }
out_nr_pages = DIV_ROUND_UP(n + offs, PAGE_SIZE);
if (n != outlen) {
__le32 v = cpu_to_le32(n);
@@ -428,8 +438,10 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
} else if (uidata) {
int n = p9_get_mapped_pages(chan, &in_pages, uidata,
inlen, &offs, &need_drop);
- if (n < 0)
- return n;
+ if (n < 0) {
+ err = n;
+ goto err_out;
+ }
in_nr_pages = DIV_ROUND_UP(n + offs, PAGE_SIZE);
if (n != inlen) {
__le32 v = cpu_to_le32(n);
@@ -498,6 +510,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
}
virtqueue_kick(chan->vq);
spin_unlock_irqrestore(&chan->lock, flags);
+ kicked = 1;
p9_debug(P9_DEBUG_TRANS, "virtio request kicked\n");
err = wait_event_killable(req->wq, req->status >= REQ_STATUS_RCVD);
/*
@@ -518,6 +531,10 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
}
kvfree(in_pages);
kvfree(out_pages);
+ if (!kicked) {
+ /* reply won't come */
+ p9_req_put(req);
+ }
return err;
}
@@ -750,6 +767,7 @@ static struct p9_trans_module p9_virtio_trans = {
.request = p9_virtio_request,
.zc_request = p9_virtio_zc_request,
.cancel = p9_virtio_cancel,
+ .cancelled = p9_virtio_cancelled,
/*
* We leave one entry for input and one entry for response
* headers. We also skip one more entry to accomodate, address
diff --git a/net/9p/trans_xen.c b/net/9p/trans_xen.c
index 782a07f2ad0c..e2fbf3677b9b 100644
--- a/net/9p/trans_xen.c
+++ b/net/9p/trans_xen.c
@@ -185,6 +185,7 @@ static int p9_xen_request(struct p9_client *client, struct p9_req_t *p9_req)
ring->intf->out_prod = prod;
spin_unlock_irqrestore(&ring->lock, flags);
notify_remote_via_irq(ring->irq);
+ p9_req_put(p9_req);
return 0;
}
--
2.17.1
^ permalink raw reply related
* [PATCH] 9p/rdma: do not disconnect on down_interruptible EAGAIN
From: Dominique Martinet @ 2018-08-30 10:35 UTC (permalink / raw)
To: Eric Van Hensbergen, Latchesar Ionkov
Cc: Dominique Martinet, v9fs-developer, netdev, linux-kernel,
linux-rdma
From: Dominique Martinet <dominique.martinet@cea.fr>
9p/rdma would sometimes drop the connection and display errors in
recv_done when the user does ^C.
The errors were IB_WC_WR_FLUSH_ERR caused by recv buffers that were
posted at the time of disconnect, and we just do not want to disconnect
when down_interruptible is... interrupted.
As a bonus, re-evaluate the log importance of post_recv() failing: we
want to see that one if it ever happens.
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
---
this had actually been bugging me forever, but I couldn't test the refcount
stuff properly around flush with that so I had to finally figure it out..
net/9p/trans_rdma.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 9cc9b3a19ee7..9719bc4d9424 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -477,7 +477,7 @@ static int rdma_request(struct p9_client *client, struct p9_req_t *req)
err = post_recv(client, rpl_context);
if (err) {
- p9_debug(P9_DEBUG_FCALL, "POST RECV failed\n");
+ p9_debug(P9_DEBUG_ERROR, "POST RECV failed: %d\n", err);
goto recv_error;
}
/* remove posted receive buffer from request structure */
@@ -546,7 +546,7 @@ static int rdma_request(struct p9_client *client, struct p9_req_t *req)
recv_error:
kfree(rpl_context);
spin_lock_irqsave(&rdma->req_lock, flags);
- if (rdma->state < P9_RDMA_CLOSING) {
+ if (err != -EINTR && rdma->state < P9_RDMA_CLOSING) {
rdma->state = P9_RDMA_CLOSING;
spin_unlock_irqrestore(&rdma->req_lock, flags);
rdma_disconnect(rdma->cm_id);
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox