* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: John Fastabend @ 2016-12-13 17:43 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Christoph Lameter
Cc: Mike Rapoport, netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
Björn Töpel, Karlsson, Magnus, Alexander Duyck,
Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich
In-Reply-To: <20161213171028.24dbf519@redhat.com>
On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote:
>
> On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
>> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>>
>>> Hmmm. If you can rely on hardware setup to give you steering and
>>> dedicated access to the RX rings. In those cases, I guess, the "push"
>>> model could be a more direct API approach.
>>
>> If the hardware does not support steering then one should be able to
>> provide those services in software.
>
> This is the early demux problem. With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
>
> My model pre-VMA map all the pages in the RX ring (if zero-copy gets
> enabled, by a single user). The software step can filter and zero-copy
> send packet-pages to the application/socket that requested this. The
What does "zero-copy send packet-pages to the application/socket that
requested this" mean? At the moment on x86 page-flipping appears to be
more expensive than memcpy (I can post some data shortly) and shared
memory was proposed and rejected for security reasons when we were
working on bifurcated driver.
> disadvantage is all zero-copy application need to share this VMA
> mapping. This is solved by configuring HW filters into a RX-queue, and
> then only attach your zero-copy application to that queue.
>
>
>>> I was shooting for a model that worked without hardware support.
>>> And then transparently benefit from HW support by configuring a HW
>>> filter into a specific RX queue and attaching/using to that queue.
>>
>> The discussion here is a bit amusing since these issues have been
>> resolved a long time ago with the design of the RDMA subsystem. Zero
>> copy is already in wide use. Memory registration is used to pin down
>> memory areas. Work requests can be filed with the RDMA subsystem that
>> then send and receive packets from the registered memory regions.
>> This is not strictly remote memory access but this is a basic mode of
>> operations supported by the RDMA subsystem. The mlx5 driver quoted
>> here supports all of that.
>
> I hear what you are saying. I will look into a push-model, as it might
> be a better solution.
> I will read up on RDMA + verbs and learn more about their API model. I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)
>
>
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
>
> Interesting. So you even imagine sockets registering memory regions
> with the NIC. If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
>
Note rte_flow is in the same family of APIs as the proposed Flow API
that was rejected as well. The features in Flow API that are not
included in the rte_flow proposal have logical extensions to support
them. In kernel we have 'tc' and multiple vendors support cls_flower
and cls_tc which offer a subset of the functionality in the DPDK
implementation.
Are you suggesting 'tc' is not a proper NIC HW filter API?
>
>> The RX/TX ring in user space should be an additional mode of
>> operation of the socket layer. Once that is in place the "Remote
>> memory acces" can be trivially implemented on top of that and the
>> ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
>
>
> Appreciate your input, it challenged my thinking.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH] kcm: fix spelling mistake in Kconfig, "connectons"
From: Colin Ian King @ 2016-12-13 17:32 UTC (permalink / raw)
To: David S . Miller, Tom Herbert, netdev; +Cc: linux-kernel
In-Reply-To: <20161213173025.24331-1-colin.king@canonical.com>
On 13/12/16 17:30, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
>
> Trivial fix to spelling mistake "connectons" to "connections" in
> Kconfig text.
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
> net/kcm/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig
> index 87fca36..23b01e1 100644
> --- a/net/kcm/Kconfig
> +++ b/net/kcm/Kconfig
> @@ -7,5 +7,5 @@ config AF_KCM
> ---help---
> KCM (Kernel Connection Multiplexor) sockets provide a method
> for multiplexing messages of a message based application
> - protocol over kernel connectons (e.g. TCP connections).
> + protocol over kernel connections (e.g. TCP connections).
>
>
Oops, ignore that, I was working on the wrong tree.
^ permalink raw reply
* [PATCH] kcm: fix spelling mistake in Kconfig, "connectons"
From: Colin King @ 2016-12-13 17:30 UTC (permalink / raw)
To: David S . Miller, Tom Herbert, netdev; +Cc: linux-kernel
From: Colin Ian King <colin.king@canonical.com>
Trivial fix to spelling mistake "connectons" to "connections" in
Kconfig text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
net/kcm/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig
index 87fca36..23b01e1 100644
--- a/net/kcm/Kconfig
+++ b/net/kcm/Kconfig
@@ -7,5 +7,5 @@ config AF_KCM
---help---
KCM (Kernel Connection Multiplexor) sockets provide a method
for multiplexing messages of a message based application
- protocol over kernel connectons (e.g. TCP connections).
+ protocol over kernel connections (e.g. TCP connections).
--
2.10.2
^ permalink raw reply related
* [PATCH net-next] netlink: revert broken, broken "2-clause nla_ok()"
From: Alexey Dobriyan @ 2016-12-13 19:30 UTC (permalink / raw)
To: davem; +Cc: netdev, johannes
In-Reply-To: <1480950553.31788.40.camel@sipsolutions.net>
Commit 4f7df337fe79bba1e4c2d525525d63b5ba186bbd
"netlink: 2-clause nla_ok()" is BROKEN.
First clause tests if "->nla_len" could even be accessed at all,
it can not possibly be omitted.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
include/net/netlink.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -698,7 +698,8 @@ static inline int nla_len(const struct nlattr *nla)
*/
static inline int nla_ok(const struct nlattr *nla, int remaining)
{
- return nla->nla_len >= sizeof(*nla) &&
+ return remaining >= (int) sizeof(*nla) &&
+ nla->nla_len >= sizeof(*nla) &&
nla->nla_len <= remaining;
}
^ permalink raw reply
* Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments
From: Marcin Wojtas @ 2016-12-13 17:03 UTC (permalink / raw)
To: Thomas Petazzoni
Cc: David S . Miller, netdev, linux-arm-kernel@lists.infradead.org,
Jason Cooper, Andrew Lunn, Sebastian Hesselbarth, Gregory Clement,
Stefan Chulski, Nadav Haklai, Hanna Hawa, Yehuda Yitschak,
Raphael G, stable@vger.kernel.org
In-Reply-To: <1481647995-7213-1-git-send-email-thomas.petazzoni@free-electrons.com>
Hi Thomas,
Reviewed-by: Marcin Wojtas <mw@semihalf.com>
Best regards,
Marcin
2016-12-13 17:53 GMT+01:00 Thomas Petazzoni
<thomas.petazzoni@free-electrons.com>:
> Since commit 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX
> buffers unmapping"), we are not correctly DMA unmapping TX buffers for
> fragments.
>
> Indeed, the mvpp2_txq_inc_put() function only stores in the
> txq_cpu->tx_buffs[] array the physical address of the buffer to be
> DMA-unmapped when skb != NULL. In addition, when DMA-unmapping, we use
> skb_headlen(skb) to get the size to be unmapped. Both of this works fine
> for TX descriptors that are associated directly to a SKB, but not the
> ones that are used for fragments, with a NULL pointer as skb:
>
> - We have a NULL physical address when calling DMA unmap
> - skb_headlen(skb) crashes because skb is NULL
>
> This causes random crashes when fragments are used.
>
> To solve this problem, this commit:
>
> - Stores the physical address of the buffer to be unmapped
> unconditionally, regardless of whether it is tied to a SKB or not.
>
> - Adds a txq_cpu->tx_data_size[] array to store the size of the DMA
> buffer to be unmapped upon TX completion.
>
> Fixes: 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX buffers unmapping")
> Reported-by: Raphael G <raphael.glon@corp.ovh.com>
> Cc: Raphael G <raphael.glon@corp.ovh.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
> ---
> drivers/net/ethernet/marvell/mvpp2.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index 1026c45..d168b13 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -791,6 +791,8 @@ struct mvpp2_txq_pcpu {
> /* Array of transmitted buffers' physical addresses */
> dma_addr_t *tx_buffs;
>
> + size_t *tx_data_size;
> +
> /* Index of last TX DMA descriptor that was inserted */
> int txq_put_index;
>
> @@ -980,9 +982,10 @@ static void mvpp2_txq_inc_put(struct mvpp2_txq_pcpu *txq_pcpu,
> struct mvpp2_tx_desc *tx_desc)
> {
> txq_pcpu->tx_skb[txq_pcpu->txq_put_index] = skb;
> - if (skb)
> - txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
> - tx_desc->buf_phys_addr;
> + txq_pcpu->tx_data_size[txq_pcpu->txq_put_index] =
> + tx_desc->data_size;
> + txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
> + tx_desc->buf_phys_addr;
> txq_pcpu->txq_put_index++;
> if (txq_pcpu->txq_put_index == txq_pcpu->size)
> txq_pcpu->txq_put_index = 0;
> @@ -4404,11 +4407,13 @@ static void mvpp2_txq_bufs_free(struct mvpp2_port *port,
> dma_addr_t buf_phys_addr =
> txq_pcpu->tx_buffs[txq_pcpu->txq_get_index];
> struct sk_buff *skb = txq_pcpu->tx_skb[txq_pcpu->txq_get_index];
> + size_t data_size =
> + txq_pcpu->tx_data_size[txq_pcpu->txq_get_index];
>
> mvpp2_txq_inc_get(txq_pcpu);
>
> dma_unmap_single(port->dev->dev.parent, buf_phys_addr,
> - skb_headlen(skb), DMA_TO_DEVICE);
> + data_size, DMA_TO_DEVICE);
> if (!skb)
> continue;
> dev_kfree_skb_any(skb);
> @@ -4662,6 +4667,11 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
> if (!txq_pcpu->tx_buffs)
> goto error;
>
> + txq_pcpu->tx_data_size = kmalloc(txq_pcpu->size *
> + sizeof(size_t), GFP_KERNEL);
> + if (!txq_pcpu->tx_data_size)
> + goto error;
> +
> txq_pcpu->count = 0;
> txq_pcpu->reserved_num = 0;
> txq_pcpu->txq_put_index = 0;
> @@ -4675,6 +4685,7 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
> txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
> kfree(txq_pcpu->tx_skb);
> kfree(txq_pcpu->tx_buffs);
> + kfree(txq_pcpu->tx_data_size);
> }
>
> dma_free_coherent(port->dev->dev.parent,
> @@ -4695,6 +4706,7 @@ static void mvpp2_txq_deinit(struct mvpp2_port *port,
> txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
> kfree(txq_pcpu->tx_skb);
> kfree(txq_pcpu->tx_buffs);
> + kfree(txq_pcpu->tx_data_size);
> }
>
> if (txq->descs)
> --
> 2.7.4
>
^ permalink raw reply
* Re: [PATCH] ARM: add cmpxchg64 helper for ARMv7-M
From: Russell King - ARM Linux @ 2016-12-13 16:58 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Arnd Bergmann, netdev, linux-kernel, coreteam, netfilter-devel,
Paul E. McKenney, David S. Miller, linux-arm-kernel
In-Reply-To: <20161210123234.GA5468@salvia>
On Sat, Dec 10, 2016 at 01:32:34PM +0100, Pablo Neira Ayuso wrote:
> Hi Arnd,
>
> On Sat, Dec 10, 2016 at 11:36:34AM +0100, Arnd Bergmann wrote:
> > A change to the netfilter code in net-next introduced the first caller of
> > cmpxchg64 that can get built on ARMv7-M, leading to an error from the
> > assembler that points out the lack of 64-bit atomics on this architecture:
> >
> > /tmp/ccMe7djj.s: Assembler messages:
> > /tmp/ccMe7djj.s:367: Error: selected processor does not support `ldrexd r0,r1,[lr]' in Thumb mode
> > /tmp/ccMe7djj.s:371: Error: selected processor does not support `strexd ip,r2,r3,[lr]' in Thumb mode
> > /tmp/ccMe7djj.s:389: Error: selected processor does not support `ldrexd r8,r9,[r7]' in Thumb mode
> > /tmp/ccMe7djj.s:393: Error: selected processor does not support `strexd lr,r0,r1,[r7]' in Thumb mode
> > scripts/Makefile.build:299: recipe for target 'net/netfilter/nft_counter.o' failed
> >
> > This makes ARMv7-M use the same emulation from asm-generic/cmpxchg-local.h
> > that we use on architectures earlier than ARMv6K, to fix the build. The
> > 32-bit atomics are available on ARMv7-M and we keep using them there.
> > This ARM specific change is probably something we should do regardless
> > of the netfilter code.
> >
> > However, looking at the new nft_counter_reset() function in nft_counter.c,
> > this looks incorrect to me not just on ARMv7-M but also on other
> > architectures, with at least the following possible race:
>
> Right, Eric Dumazet already spotted this problem. I'm preparing a
> patch that doesn't require cmpxchg64(). Will keep you on Cc. Thanks.
Please keep me on the Cc as well so I know what's happening, thanks.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply
* Re: [PATCH V2 18/22] bnxt_re: Support for DCB
From: Jason Gunthorpe @ 2016-12-13 16:56 UTC (permalink / raw)
To: Selvin Xavier
Cc: Or Gerlitz, Doug Ledford,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Linux Netdev List, Eddie Wai, Devesh Sharma, Somnath Kotur,
Sriharsha Basavapatna
In-Reply-To: <CA+sbYW1irBd0cTqJJSGJWRbBi-iFzvX3JpoTfF_daU47EqNtAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Dec 13, 2016 at 11:55:55AM +0530, Selvin Xavier wrote:
> v1 eth_type is not defined. All vendor drivers have their own definition.
Send a cleanup patch?
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: double free issue, mvpp2 driver, armada375 modules
From: Thomas Petazzoni @ 2016-12-13 16:55 UTC (permalink / raw)
To: Raphael G
Cc: linux-kernel, linux-arm-kernel, netdev, marcin wojtas, davem,
Grégory Clement
In-Reply-To: <57272206.6070703@corp.ovh.com>
Hello,
On Mon, 2 May 2016 11:46:46 +0200, Raphael G wrote:
> enclosed the kernel panic we obtain after boot with a slightly patched
> upstream kernel (4.5.2).
>
> (as well as the patchset applied to the upstream kernel, so that you can
> know which code we are talking about. Too bad we cannot use the upstream
> kernel but we had no choice about this + we're no experts so we rely on
> provided patches, adapted to our bootloader and hardware for this)
>
> Reproduce:
> boot on kernel on an armada375 module, connect to it, launch a top in
> commandline
>
> As seen with Marcin Wojtas, reverting commit
> e864b4c7b184bde36fa6a02bb3190983d2f796f9 fixes this issue.
>
> Reporting upstream so that you can decide what should be done next
Thanks for your report. I have finally submitted a fix for this issue.
You can find it at:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-December/473989.html
If you have the time to test it and report back, it would be very
useful.
Thanks a lot!
Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
^ permalink raw reply
* [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments
From: Thomas Petazzoni @ 2016-12-13 16:53 UTC (permalink / raw)
To: David S . Miller
Cc: Thomas Petazzoni, Andrew Lunn, Yehuda Yitschak, Jason Cooper,
Raphael G, netdev, Hanna Hawa, stable, Nadav Haklai,
Gregory Clement, Stefan Chulski, Marcin Wojtas, linux-arm-kernel,
Sebastian Hesselbarth
Since commit 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX
buffers unmapping"), we are not correctly DMA unmapping TX buffers for
fragments.
Indeed, the mvpp2_txq_inc_put() function only stores in the
txq_cpu->tx_buffs[] array the physical address of the buffer to be
DMA-unmapped when skb != NULL. In addition, when DMA-unmapping, we use
skb_headlen(skb) to get the size to be unmapped. Both of this works fine
for TX descriptors that are associated directly to a SKB, but not the
ones that are used for fragments, with a NULL pointer as skb:
- We have a NULL physical address when calling DMA unmap
- skb_headlen(skb) crashes because skb is NULL
This causes random crashes when fragments are used.
To solve this problem, this commit:
- Stores the physical address of the buffer to be unmapped
unconditionally, regardless of whether it is tied to a SKB or not.
- Adds a txq_cpu->tx_data_size[] array to store the size of the DMA
buffer to be unmapped upon TX completion.
Fixes: 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX buffers unmapping")
Reported-by: Raphael G <raphael.glon@corp.ovh.com>
Cc: Raphael G <raphael.glon@corp.ovh.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
---
drivers/net/ethernet/marvell/mvpp2.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
index 1026c45..d168b13 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -791,6 +791,8 @@ struct mvpp2_txq_pcpu {
/* Array of transmitted buffers' physical addresses */
dma_addr_t *tx_buffs;
+ size_t *tx_data_size;
+
/* Index of last TX DMA descriptor that was inserted */
int txq_put_index;
@@ -980,9 +982,10 @@ static void mvpp2_txq_inc_put(struct mvpp2_txq_pcpu *txq_pcpu,
struct mvpp2_tx_desc *tx_desc)
{
txq_pcpu->tx_skb[txq_pcpu->txq_put_index] = skb;
- if (skb)
- txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
- tx_desc->buf_phys_addr;
+ txq_pcpu->tx_data_size[txq_pcpu->txq_put_index] =
+ tx_desc->data_size;
+ txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
+ tx_desc->buf_phys_addr;
txq_pcpu->txq_put_index++;
if (txq_pcpu->txq_put_index == txq_pcpu->size)
txq_pcpu->txq_put_index = 0;
@@ -4404,11 +4407,13 @@ static void mvpp2_txq_bufs_free(struct mvpp2_port *port,
dma_addr_t buf_phys_addr =
txq_pcpu->tx_buffs[txq_pcpu->txq_get_index];
struct sk_buff *skb = txq_pcpu->tx_skb[txq_pcpu->txq_get_index];
+ size_t data_size =
+ txq_pcpu->tx_data_size[txq_pcpu->txq_get_index];
mvpp2_txq_inc_get(txq_pcpu);
dma_unmap_single(port->dev->dev.parent, buf_phys_addr,
- skb_headlen(skb), DMA_TO_DEVICE);
+ data_size, DMA_TO_DEVICE);
if (!skb)
continue;
dev_kfree_skb_any(skb);
@@ -4662,6 +4667,11 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
if (!txq_pcpu->tx_buffs)
goto error;
+ txq_pcpu->tx_data_size = kmalloc(txq_pcpu->size *
+ sizeof(size_t), GFP_KERNEL);
+ if (!txq_pcpu->tx_data_size)
+ goto error;
+
txq_pcpu->count = 0;
txq_pcpu->reserved_num = 0;
txq_pcpu->txq_put_index = 0;
@@ -4675,6 +4685,7 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
kfree(txq_pcpu->tx_skb);
kfree(txq_pcpu->tx_buffs);
+ kfree(txq_pcpu->tx_data_size);
}
dma_free_coherent(port->dev->dev.parent,
@@ -4695,6 +4706,7 @@ static void mvpp2_txq_deinit(struct mvpp2_port *port,
txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
kfree(txq_pcpu->tx_skb);
kfree(txq_pcpu->tx_buffs);
+ kfree(txq_pcpu->tx_data_size);
}
if (txq->descs)
--
2.7.4
^ permalink raw reply related
* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter @ 2016-12-13 16:36 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: John Fastabend, Mike Rapoport, netdev@vger.kernel.org, linux-mm,
Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
Vladislav Yasevich
In-Reply-To: <20161213171028.24dbf519@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]
On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote:
> This is the early demux problem. With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
Right. But we could fall back to software. Transfer to a kernel buffer and
then move stuff over. Not much of an improvment but it will make things
work.
> > The discussion here is a bit amusing since these issues have been
> > resolved a long time ago with the design of the RDMA subsystem. Zero
> > copy is already in wide use. Memory registration is used to pin down
> > memory areas. Work requests can be filed with the RDMA subsystem that
> > then send and receive packets from the registered memory regions.
> > This is not strictly remote memory access but this is a basic mode of
> > operations supported by the RDMA subsystem. The mlx5 driver quoted
> > here supports all of that.
>
> I hear what you are saying. I will look into a push-model, as it might
> be a better solution.
> I will read up on RDMA + verbs and learn more about their API model. I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)
Great.
> > What is bad about RDMA is that it is a separate kernel subsystem.
> > What I would like to see is a deeper integration with the network
> > stack so that memory regions can be registred with a network socket
> > and work requests then can be submitted and processed that directly
> > read and write in these regions. The network stack should provide the
> > services that the hardware of the NIC does not suppport as usual.
>
> Interesting. So you even imagine sockets registering memory regions
> with the NIC. If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
Well doing this would mean adding some features and that also would at
best allow general support for zero copy direct to user space with a
fallback to software if the hardware is missing some feature.
> > The RX/TX ring in user space should be an additional mode of
> > operation of the socket layer. Once that is in place the "Remote
> > memory acces" can be trivially implemented on top of that and the
> > ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
A work request contains the user space address of the data to be sent
and/or received. The address must be in a registered memory region. This
is different from copying the packet into kernel data structures.
I think this can easily be generalized. We need support for registering
memory regions, submissions of work request and the processing of
completion requets. QP (queue-pair) processing is probably the basis for
the whole scheme that is used in multiple context these days.
^ permalink raw reply
* Re: [PATCHv2 1/5] sh_eth: add generic wake-on-lan support via magic packet
From: Niklas Söderlund @ 2016-12-13 16:27 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Sergei Shtylyov, Simon Horman, netdev@vger.kernel.org,
Linux-Renesas, linux-pm
In-Reply-To: <CAMuHMdV6xrzjBQeVAWK3B_c077O6keWrH7Ndi5CX+2JLqjsL5A@mail.gmail.com>
Hi Geert,
Thanks for feedback and testing!
On 2016-12-13 14:03:52 +0100, Geert Uytterhoeven wrote:
> CC linux-pm
I think you forgot to CC linux-pm :-)
>
> On Mon, Dec 12, 2016 at 5:09 PM, Niklas Söderlund
> <niklas.soderlund+renesas@ragnatech.se> wrote:
> > Add generic functionality to support Wake-on-Lan using MagicPacket which
> > are supported by at least a few versions of sh_eth. Only add
> > functionality for WoL, no specific sh_eth version are marked to support
> > WoL yet.
> >
> > WoL is enabled in the suspend callback by setting MagicPacket detection
> > and disabling all interrupts expect MagicPacket. In the resume path the
> > driver needs to reset the hardware to rearm the WoL logic, this prevents
> > the driver from simply restoring the registers and to take advantage of
> > that sh_eth was not suspended to reduce resume time. To reset the
> > hardware the driver close and reopens the device just like it would do
> > in a normal suspend/resume scenario without WoL enabled, but it both
> > close and open the device in the resume callback since the device needs
> > to be open for WoL to work.
> >
> > One quirk needed for WoL is that the module clock needs to be prevented
> > from being switched off by Runtime PM. To keep the clock alive the
> > suspend callback need to call clk_enable() directly to increase the
> > usage count of the clock. Then when Runtime PM decreases the clock usage
> > count it won't reach 0 and be switched off.
> >
> > Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
>
> Thanks for the update!
>
> I've verified WoL is working on r8a7791/koelsch and r8a7740/armadillo.
>
> However, if I look at /sys/kernel/debug/wakeup_sources, "active_count" and
> "event_count" for the Ethernet device do not increase when using WoL, while
> they do for the GPIO keyboard when using the keyboard for wake up.
> So something seems to be missing from a bookkeeping point of view.
Cool, now I know why some net drivers call pm_wakeup_event() if the
wakeup source was WoL :-) I added this to sh_eth and now the bookkeeping
seems to work as you describe, "active_count" and "event_count" are
incremented while waking up from WoL. I will include this in the next
version, thanks for reporting I had no idea to check for this.
>
> Interestingly, "wakeup_count" does not increase for both?
> Has this been broken recently?
I had a quick look at this and the 'wakeup_count' is increased in
wakeup_source_report_event() which is in the call path from
pm_wakeup_event().
pm_wakeup_event()
__pm_wakeup_event()
wakeup_source_report_event()
static void wakeup_source_report_event(struct wakeup_source *ws)
{
ws->event_count++;
/* This is racy, but the counter is approximate anyway. */
if (events_check_enabled)
ws->wakeup_count++;
if (!ws->active)
wakeup_source_activate(ws);
}
So maybe 'wakeup_count' is not incremented due to the race with
events_check_enabled? But then again I'm new to PM so I might miss
something obvious. I'm also not sure if I can do anything in this series
to improve the behavior of 'wakeup_count' for sh_eth?
>
> > ---
> > drivers/net/ethernet/renesas/sh_eth.c | 112 +++++++++++++++++++++++++++++++---
> > drivers/net/ethernet/renesas/sh_eth.h | 3 +
> > 2 files changed, 107 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> > index 05b0dc5..87640b9 100644
> > --- a/drivers/net/ethernet/renesas/sh_eth.c
> > +++ b/drivers/net/ethernet/renesas/sh_eth.c
> > @@ -2199,6 +2199,33 @@ static int sh_eth_set_ringparam(struct net_device *ndev,
> > return 0;
> > }
> >
> > +static void sh_eth_get_wol(struct net_device *ndev, struct ethtool_wolinfo *wol)
> > +{
> > + struct sh_eth_private *mdp = netdev_priv(ndev);
> > +
> > + wol->supported = 0;
> > + wol->wolopts = 0;
> > +
> > + if (mdp->cd->magic && mdp->clk) {
> > + wol->supported = WAKE_MAGIC;
> > + wol->wolopts = mdp->wol_enabled ? WAKE_MAGIC : 0;
> > + }
> > +}
> > +
> > +static int sh_eth_set_wol(struct net_device *ndev, struct ethtool_wolinfo *wol)
> > +{
> > + struct sh_eth_private *mdp = netdev_priv(ndev);
> > +
> > + if (!mdp->cd->magic || !mdp->clk || wol->wolopts & ~WAKE_MAGIC)
> > + return -EOPNOTSUPP;
> > +
> > + mdp->wol_enabled = !!(wol->wolopts & WAKE_MAGIC);
> > +
> > + device_set_wakeup_enable(&mdp->pdev->dev, mdp->wol_enabled);
> > +
> > + return 0;
> > +}
> > +
> > static const struct ethtool_ops sh_eth_ethtool_ops = {
> > .get_regs_len = sh_eth_get_regs_len,
> > .get_regs = sh_eth_get_regs,
> > @@ -2213,6 +2240,8 @@ static const struct ethtool_ops sh_eth_ethtool_ops = {
> > .set_ringparam = sh_eth_set_ringparam,
> > .get_link_ksettings = sh_eth_get_link_ksettings,
> > .set_link_ksettings = sh_eth_set_link_ksettings,
> > + .get_wol = sh_eth_get_wol,
> > + .set_wol = sh_eth_set_wol,
> > };
> >
> > /* network device open function */
> > @@ -3017,6 +3046,11 @@ static int sh_eth_drv_probe(struct platform_device *pdev)
> > goto out_release;
> > }
> >
> > + /* Get clock, if not found that's OK but Wake-On-Lan is unavailable */
> > + mdp->clk = devm_clk_get(&pdev->dev, NULL);
> > + if (IS_ERR(mdp->clk))
> > + mdp->clk = NULL;
> > +
> > ndev->base_addr = res->start;
> >
> > spin_lock_init(&mdp->lock);
> > @@ -3111,6 +3145,9 @@ static int sh_eth_drv_probe(struct platform_device *pdev)
> > if (ret)
> > goto out_napi_del;
> >
> > + if (mdp->cd->magic && mdp->clk)
> > + device_set_wakeup_capable(&pdev->dev, 1);
> > +
> > /* print device information */
> > netdev_info(ndev, "Base address at 0x%x, %pM, IRQ %d.\n",
> > (u32)ndev->base_addr, ndev->dev_addr, ndev->irq);
> > @@ -3150,15 +3187,67 @@ static int sh_eth_drv_remove(struct platform_device *pdev)
> >
> > #ifdef CONFIG_PM
> > #ifdef CONFIG_PM_SLEEP
> > +static int sh_eth_wol_setup(struct net_device *ndev)
> > +{
> > + struct sh_eth_private *mdp = netdev_priv(ndev);
> > +
> > + /* Only allow ECI interrupts */
> > + synchronize_irq(ndev->irq);
> > + napi_disable(&mdp->napi);
> > + sh_eth_write(ndev, DMAC_M_ECI, EESIPR);
> > +
> > + /* Enable MagicPacket */
> > + sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE);
> > +
> > + /* Increased clock usage so device won't be suspended */
> > + clk_enable(mdp->clk);
> > +
> > + return enable_irq_wake(ndev->irq);
> > +}
> > +
> > +static int sh_eth_wol_restore(struct net_device *ndev)
> > +{
> > + struct sh_eth_private *mdp = netdev_priv(ndev);
> > + int ret;
> > +
> > + napi_enable(&mdp->napi);
> > +
> > + /* Disable MagicPacket */
> > + sh_eth_modify(ndev, ECMR, ECMR_PMDE, 0);
> > +
> > + /* The device need to be reset to restore MagicPacket logic
> > + * for next wakeup. If we close and open the device it will
> > + * both be reset and all registers restored. This is what
> > + * happens during suspend and resume without WoL enabled.
> > + */
> > + ret = sh_eth_close(ndev);
> > + if (ret < 0)
> > + return ret;
> > + ret = sh_eth_open(ndev);
> > + if (ret < 0)
> > + return ret;
> > +
> > + /* Restore clock usage count */
> > + clk_disable(mdp->clk);
> > +
> > + return disable_irq_wake(ndev->irq);
> > +}
> > +
> > static int sh_eth_suspend(struct device *dev)
> > {
> > struct net_device *ndev = dev_get_drvdata(dev);
> > + struct sh_eth_private *mdp = netdev_priv(ndev);
> > int ret = 0;
> >
> > - if (netif_running(ndev)) {
> > - netif_device_detach(ndev);
> > + if (!netif_running(ndev))
> > + return 0;
> > +
> > + netif_device_detach(ndev);
> > +
> > + if (mdp->wol_enabled)
> > + ret = sh_eth_wol_setup(ndev);
> > + else
> > ret = sh_eth_close(ndev);
> > - }
> >
> > return ret;
> > }
> > @@ -3166,14 +3255,21 @@ static int sh_eth_suspend(struct device *dev)
> > static int sh_eth_resume(struct device *dev)
> > {
> > struct net_device *ndev = dev_get_drvdata(dev);
> > + struct sh_eth_private *mdp = netdev_priv(ndev);
> > int ret = 0;
> >
> > - if (netif_running(ndev)) {
> > + if (!netif_running(ndev))
> > + return 0;
> > +
> > + if (mdp->wol_enabled)
> > + ret = sh_eth_wol_restore(ndev);
> > + else
> > ret = sh_eth_open(ndev);
> > - if (ret < 0)
> > - return ret;
> > - netif_device_attach(ndev);
> > - }
> > +
> > + if (ret < 0)
> > + return ret;
> > +
> > + netif_device_attach(ndev);
> >
> > return ret;
> > }
> > diff --git a/drivers/net/ethernet/renesas/sh_eth.h b/drivers/net/ethernet/renesas/sh_eth.h
> > index d050f37..4ceed00 100644
> > --- a/drivers/net/ethernet/renesas/sh_eth.h
> > +++ b/drivers/net/ethernet/renesas/sh_eth.h
> > @@ -493,6 +493,7 @@ struct sh_eth_cpu_data {
> > unsigned shift_rd0:1; /* shift Rx descriptor word 0 right by 16 */
> > unsigned rmiimode:1; /* EtherC has RMIIMODE register */
> > unsigned rtrate:1; /* EtherC has RTRATE register */
> > + unsigned magic:1; /* EtherC has ECMR.PMDE and ECSR.MPD */
> > };
> >
> > struct sh_eth_private {
> > @@ -501,6 +502,7 @@ struct sh_eth_private {
> > const u16 *reg_offset;
> > void __iomem *addr;
> > void __iomem *tsu_addr;
> > + struct clk *clk;
> > u32 num_rx_ring;
> > u32 num_tx_ring;
> > dma_addr_t rx_desc_dma;
> > @@ -529,6 +531,7 @@ struct sh_eth_private {
> > unsigned no_ether_link:1;
> > unsigned ether_link_active_low:1;
> > unsigned is_opened:1;
> > + unsigned wol_enabled:1;
> > };
> >
> > static inline void sh_eth_soft_swap(char *src, int len)
> > --
> > 2.10.2
>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
--
Regards,
Niklas Söderlund
^ permalink raw reply
* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-13 16:10 UTC (permalink / raw)
To: Christoph Lameter
Cc: John Fastabend, Mike Rapoport, netdev@vger.kernel.org, linux-mm,
Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
Vladislav Yasevich, brouer
In-Reply-To: <alpine.DEB.2.20.1612121200280.13607@east.gentwo.org>
On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>
> > Hmmm. If you can rely on hardware setup to give you steering and
> > dedicated access to the RX rings. In those cases, I guess, the "push"
> > model could be a more direct API approach.
>
> If the hardware does not support steering then one should be able to
> provide those services in software.
This is the early demux problem. With the push-mode of registering
memory, you need hardware steering support, for zero-copy support, as
the software step happens after DMA engine have written into the memory.
My model pre-VMA map all the pages in the RX ring (if zero-copy gets
enabled, by a single user). The software step can filter and zero-copy
send packet-pages to the application/socket that requested this. The
disadvantage is all zero-copy application need to share this VMA
mapping. This is solved by configuring HW filters into a RX-queue, and
then only attach your zero-copy application to that queue.
> > I was shooting for a model that worked without hardware support.
> > And then transparently benefit from HW support by configuring a HW
> > filter into a specific RX queue and attaching/using to that queue.
>
> The discussion here is a bit amusing since these issues have been
> resolved a long time ago with the design of the RDMA subsystem. Zero
> copy is already in wide use. Memory registration is used to pin down
> memory areas. Work requests can be filed with the RDMA subsystem that
> then send and receive packets from the registered memory regions.
> This is not strictly remote memory access but this is a basic mode of
> operations supported by the RDMA subsystem. The mlx5 driver quoted
> here supports all of that.
I hear what you are saying. I will look into a push-model, as it might
be a better solution.
I will read up on RDMA + verbs and learn more about their API model. I
even plan to write a small sample program to get a feeling for the API,
and maybe we can use that as a baseline for the performance target we
can obtain on the same HW. (Thanks to Björn for already giving me some
pointer here)
> What is bad about RDMA is that it is a separate kernel subsystem.
> What I would like to see is a deeper integration with the network
> stack so that memory regions can be registred with a network socket
> and work requests then can be submitted and processed that directly
> read and write in these regions. The network stack should provide the
> services that the hardware of the NIC does not suppport as usual.
Interesting. So you even imagine sockets registering memory regions
with the NIC. If we had a proper NIC HW filter API across the drivers,
to register the steering rule (like ibv_create_flow), this would be
doable, but we don't (DPDK actually have an interesting proposal[1])
> The RX/TX ring in user space should be an additional mode of
> operation of the socket layer. Once that is in place the "Remote
> memory acces" can be trivially implemented on top of that and the
> ugly RDMA sidecar subsystem can go away.
I cannot follow that 100%, but I guess you are saying we also need a
more efficient mode of handing over pages/packet to userspace (than
going through the normal socket API calls).
Appreciate your input, it challenged my thinking.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[1] https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH net v3] ibmveth: set correct gso_size and gso_type
From: Thomas Falcon @ 2016-12-13 15:49 UTC (permalink / raw)
To: David Miller
Cc: netdev, brking, marcelo.leitner, pradeeps, jmaxwell37, zdai,
eric.dumazet
In-Reply-To: <20161210.155647.1988856288960520451.davem@davemloft.net>
On 12/10/2016 02:56 PM, David Miller wrote:
> From: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
> Date: Sat, 10 Dec 2016 12:39:48 -0600
>
>> v3: include a check for non-zero mss when calculating gso_segs
>>
>> v2: calculate gso_segs after Eric Dumazet's comments on the earlier patch
>> and make sure everyone is included on CC
> I already applied v1 which made it all the way even to Linus's
> tree. So you'll have to send me relative fixups if there are
> things to fix or change since v1.
>
> You must always generate patches against the current 'net' tree.
>
Sorry about that. Thank you for applying it on short notice. I agree that using the TCP checksum field is not ideal, but it was a compromise with the VIOS team. Hopefully, there will be a better way in the future.
Thanks again,
Tom
^ permalink raw reply
* Re: [PATCH net] virtio-net: correctly enable multiqueue
From: David Miller @ 2016-12-13 15:39 UTC (permalink / raw)
To: jasowang; +Cc: netdev, virtualization, tytso, nhorman, mst
In-Reply-To: <1481610185-12183-1-git-send-email-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Tue, 13 Dec 2016 14:23:05 +0800
> Commit 4490001029012539937ff02778fe6180613fa949 ("virtio-net: enable
> multiqueue by default") blindly set the affinity instead of queues
> during probe which can cause a mismatch of #queues between guest and
> host. This patch fixes it by setting queues.
>
> Reported-by: Theodore Ts'o <tytso@mit.edu>
> Tested-by: Theodore Ts'o <tytso@mit.edu>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Fixes: 49000102901 ("virtio-net: enable multiqueue by default")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
Applied, thanks Jason.
^ permalink raw reply
* [PATCH/replace net-next 17/27] OVS: remove use of VLAN_TAG_PRESENT
From: Michał Mirosław @ 2016-12-13 15:31 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA; +Cc: open list:OPENVSWITCH, Jiri Benc
In-Reply-To: <e44219bc56d3e44aa0711c83c626adabf4c4ecd8.1481586602.git.mirq-linux-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>
This is a minimal change to allow removing of VLAN_TAG_PRESENT.
It leaves OVS unable to use CFI bit, as fixing this would need
a deeper surgery involving userspace interface.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
net/openvswitch/actions.c | 13 +++++++++----
net/openvswitch/flow.c | 2 +-
net/openvswitch/flow.h | 2 +-
net/openvswitch/flow_netlink.c | 22 +++++++++++-----------
4 files changed, 22 insertions(+), 17 deletions(-)
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 514f7bc..fb41833 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -278,7 +278,7 @@ static int push_vlan(struct sk_buff *skb, struct sw_flow_key *key,
key->eth.vlan.tpid = vlan->vlan_tpid;
}
return skb_vlan_push(skb, vlan->vlan_tpid,
- ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
+ ntohs(vlan->vlan_tci) & ~VLAN_CFI_MASK);
}
/* 'src' is already properly masked. */
@@ -704,8 +704,10 @@ static int ovs_vport_output(struct net *net, struct sock *sk, struct sk_buff *sk
__skb_dst_copy(skb, data->dst);
*OVS_CB(skb) = data->cb;
skb->inner_protocol = data->inner_protocol;
- skb->vlan_tci = data->vlan_tci;
- skb->vlan_proto = data->vlan_proto;
+ if (data->vlan_tci & VLAN_CFI_MASK)
+ __vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci & ~VLAN_CFI_MASK);
+ else
+ __vlan_hwaccel_clear_tag(skb);
/* Reconstruct the MAC header. */
skb_push(skb, data->l2_len);
@@ -749,7 +751,10 @@ static void prepare_frag(struct vport *vport, struct sk_buff *skb,
data->cb = *OVS_CB(skb);
data->inner_protocol = skb->inner_protocol;
data->network_offset = orig_network_offset;
- data->vlan_tci = skb->vlan_tci;
+ if (skb_vlan_tag_present(skb))
+ data->vlan_tci = skb_vlan_tag_get(skb) | VLAN_CFI_MASK;
+ else
+ data->vlan_tci = 0;
data->vlan_proto = skb->vlan_proto;
data->mac_proto = mac_proto;
data->l2_len = hlen;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 08aa926..2bd4eac 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -347,7 +347,7 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
int res;
if (skb_vlan_tag_present(skb)) {
- key->eth.vlan.tci = htons(skb->vlan_tci);
+ key->eth.vlan.tci = htons(skb->vlan_tci) | htons(VLAN_CFI_MASK);
key->eth.vlan.tpid = skb->vlan_proto;
} else {
/* Parse outer vlan tag in the non-accelerated case. */
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index f61cae7..c8724ca 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -58,7 +58,7 @@ struct ovs_tunnel_info {
struct vlan_head {
__be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad.*/
- __be16 tci; /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+ __be16 tci; /* 0 if no VLAN, VLAN_CFI_MASK set otherwise. */
};
#define OVS_SW_FLOW_KEY_METADATA_SIZE \
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d19044f..b72fcbd 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -853,9 +853,9 @@ static int validate_vlan_from_nlattrs(const struct sw_flow_match *match,
if (a[OVS_KEY_ATTR_VLAN])
tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
- if (!(tci & htons(VLAN_TAG_PRESENT))) {
+ if (!(tci & htons(VLAN_CFI_MASK))) {
if (tci) {
- OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT bit set.",
+ OVS_NLERR(log, "%s TCI does not have VLAN_CFI_MASK bit set.",
(inner) ? "C-VLAN" : "VLAN");
return -EINVAL;
} else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
@@ -876,9 +876,9 @@ static int validate_vlan_mask_from_nlattrs(const struct sw_flow_match *match,
__be16 tci = 0;
__be16 tpid = 0;
bool encap_valid = !!(match->key->eth.vlan.tci &
- htons(VLAN_TAG_PRESENT));
+ htons(VLAN_CFI_MASK));
bool i_encap_valid = !!(match->key->eth.cvlan.tci &
- htons(VLAN_TAG_PRESENT));
+ htons(VLAN_CFI_MASK));
if (!(key_attrs & (1 << OVS_KEY_ATTR_ENCAP))) {
/* Not a VLAN. */
@@ -902,8 +902,8 @@ static int validate_vlan_mask_from_nlattrs(const struct sw_flow_match *match,
(inner) ? "C-VLAN" : "VLAN", ntohs(tpid));
return -EINVAL;
}
- if (!(tci & htons(VLAN_TAG_PRESENT))) {
- OVS_NLERR(log, "%s TCI mask does not have exact match for VLAN_TAG_PRESENT bit.",
+ if (!(tci & htons(VLAN_CFI_MASK))) {
+ OVS_NLERR(log, "%s TCI mask does not have exact match for VLAN_CFI_MASK bit.",
(inner) ? "C-VLAN" : "VLAN");
return -EINVAL;
}
@@ -958,7 +958,7 @@ static int parse_vlan_from_nlattrs(struct sw_flow_match *match,
if (err)
return err;
- encap_valid = !!(match->key->eth.vlan.tci & htons(VLAN_TAG_PRESENT));
+ encap_valid = !!(match->key->eth.vlan.tci & htons(VLAN_CFI_MASK));
if (encap_valid) {
err = __parse_vlan_from_nlattrs(match, key_attrs, true, a,
is_mask, log);
@@ -2444,7 +2444,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
vlan = nla_data(a);
if (!eth_type_vlan(vlan->vlan_tpid))
return -EINVAL;
- if (!(vlan->vlan_tci & htons(VLAN_TAG_PRESENT)))
+ if (!(vlan->vlan_tci & htons(VLAN_CFI_MASK)))
return -EINVAL;
vlan_tci = vlan->vlan_tci;
break;
@@ -2460,7 +2460,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
/* Prohibit push MPLS other than to a white list
* for packets that have a known tag order.
*/
- if (vlan_tci & htons(VLAN_TAG_PRESENT) ||
+ if (vlan_tci & htons(VLAN_CFI_MASK) ||
(eth_type != htons(ETH_P_IP) &&
eth_type != htons(ETH_P_IPV6) &&
eth_type != htons(ETH_P_ARP) &&
@@ -2472,7 +2472,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
}
case OVS_ACTION_ATTR_POP_MPLS:
- if (vlan_tci & htons(VLAN_TAG_PRESENT) ||
+ if (vlan_tci & htons(VLAN_CFI_MASK) ||
!eth_p_mpls(eth_type))
return -EINVAL;
@@ -2530,7 +2530,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
case OVS_ACTION_ATTR_POP_ETH:
if (mac_proto != MAC_PROTO_ETHERNET)
return -EINVAL;
- if (vlan_tci & htons(VLAN_TAG_PRESENT))
+ if (vlan_tci & htons(VLAN_CFI_MASK))
return -EINVAL;
mac_proto = MAC_PROTO_ETHERNET;
break;
--
2.10.2
_______________________________________________
dev mailing list
dev@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
^ permalink raw reply related
* Re: [PATCH v2 2/2] net: rfkill: Add rfkill-any LED trigger
From: Johannes Berg @ 2016-12-13 15:18 UTC (permalink / raw)
To: Michał Kępień, David S . Miller
Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20161208073052.12988-2-kernel@kempniu.pl>
On Thu, 2016-12-08 at 08:30 +0100, Michał Kępień wrote:
> Add a new "global" (i.e. not per-rfkill device) LED trigger, rfkill-
> any,
> which may be useful on laptops with a single "radio LED" and multiple
> radio transmitters. The trigger is meant to turn a LED on whenever
> there is at least one radio transmitter active and turn it off
> otherwise.
>
Also applied, but I moved the discussion of the mutex into the recorded
commit log.
johannes
^ permalink raw reply
* Re: [PATCH v2 1/2] net: rfkill: Cleanup error handling in rfkill_init()
From: Johannes Berg @ 2016-12-13 15:16 UTC (permalink / raw)
To: Michał Kępień, David S . Miller
Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20161208073052.12988-1-kernel@kempniu.pl>
On Thu, 2016-12-08 at 08:30 +0100, Michał Kępień wrote:
> Use a separate label per error condition in rfkill_init() to make it
> a bit cleaner and easier to extend.
applied.
johannes
^ permalink raw reply
* Re: [PATCH net-next 17/27] OVS: remove assumptions about VLAN_TAG_PRESENT bit
From: Michał Mirosław @ 2016-12-13 15:14 UTC (permalink / raw)
To: Jiri Benc; +Cc: open list:OPENVSWITCH, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161213114010.0fa2ddb6@griffin>
On Tue, Dec 13, 2016 at 11:40:10AM +0100, Jiri Benc wrote:
> On Tue, 13 Dec 2016 01:12:38 +0100 (CET), Michał Mirosław wrote:
> > @@ -850,20 +848,11 @@ static int validate_vlan_from_nlattrs(const struct sw_flow_match *match,
> > return -EINVAL;
> > }
> >
> > - if (a[OVS_KEY_ATTR_VLAN])
> > - tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
> > -
> > - if (!(tci & htons(VLAN_TAG_PRESENT))) {
> > - if (tci) {
> > - OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT bit set.",
> > - (inner) ? "C-VLAN" : "VLAN");
> > - return -EINVAL;
> > - } else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
> > - /* Corner case for truncated VLAN header. */
> > - OVS_NLERR(log, "Truncated %s header has non-zero encap attribute.",
> > - (inner) ? "C-VLAN" : "VLAN");
> > - return -EINVAL;
> > - }
> > + if (!a[OVS_KEY_ATTR_VLAN] && nla_len(a[OVS_KEY_ATTR_ENCAP])) {
> > + /* Corner case for truncated VLAN header. */
> > + OVS_NLERR(log, "Truncated %s header has non-zero encap attribute.",
> > + (inner) ? "C-VLAN" : "VLAN");
> > + return -EINVAL;
>
> This looks wrong. The zero value of nla_get_be16(a[OVS_KEY_ATTR_VLAN])
> together with empty a[OVS_KEY_ATTR_ENCAP] means truncated VLAN header
> and this needs to be preserved.
>
> In other words, you need to check also !nla_get_be16(a[OVS_KEY_ATTR_VLAN])
> here.
>
>
> Overall, this whole patch looks very suspicious from the very
> beginning. The xors used are strong indication that something is not
> right.
>
> And indeed, you're breaking uAPI compatibility. Previously, the
> VLAG_TAG_PRESENT bit set in OVS_KEY_ATTR_VLAN caused the flow to match
> on packets with VLAN tag present. After this patch, it causes the flow
> to match only on those packets that have a certain value of
> VLAN_CFI_MASK in their VLAN tag (I haven't bother deciphering what
> value is checked after all these xors, it's as well possible that it
> will also match on packets _without_ any VLAN header).
>
> You have to preserve the old meaning of VLAN_TAG_PRESENT in
> OVS_KEY_ATTR_VLAN. When doing flow lookups, VLAN_TAG_PRESENT must match
> on and only on packets that have VLAN tag present, irrespective of their
> VLAN_CFI_MASK.
>
> If you want to introduce support for lookups on VLAN_CFI_MASK, you'll
> have to do that by introducing a new netlink attribute.
Hmm. In that case I'll just mask the CFI bit on entry to OVS.
Otherwise it's for me like to stab in a dark at a large beast
I know nothing about.
Best Regards,
Michał Mirosław
^ permalink raw reply
* Re: [PATCH net-next 13/27] bridge: use __vlan_hwaccel helpers
From: Michał Mirosław @ 2016-12-13 15:11 UTC (permalink / raw)
To: Sergei Shtylyov
Cc: netdev, moderated list:ETHERNET BRIDGE,
open list:NETFILTER ({IP, IP6, ARP, EB , NF}TABLES),
Jozsef Kadlecsik (supporter:NETFILTER ({IP, IP6, ARP, EB, NF}TABLES)),
Patrick McHardy (supporter:NETFILTER ({IP, IP6, ARP , EB, NF}TABLES)),
Pablo Neira Ayuso (supporter:NETFILTER ({IP , IP6, ARP, EB, NF}TABLES))
In-Reply-To: <f5078470-0fc0-cfcd-36df-3ebb380b547a@cogentembedded.com>
On Tue, Dec 13, 2016 at 03:59:46PM +0300, Sergei Shtylyov wrote:
> Hello!
>
> On 12/13/2016 3:12 AM, Michał Mirosław wrote:
>
> > This removes assumption than vlan_tci != 0 when tag is present.
> >
> > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > ---
> > net/bridge/br_netfilter_hooks.c | 14 ++++++++------
> > net/bridge/br_private.h | 2 +-
> > net/bridge/br_vlan.c | 6 +++---
> > 3 files changed, 12 insertions(+), 10 deletions(-)
> >
> > diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
> > index b12501a..2cc0747 100644
> > --- a/net/bridge/br_netfilter_hooks.c
> > +++ b/net/bridge/br_netfilter_hooks.c
> [...]
> > @@ -749,8 +747,12 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
> >
> > data = this_cpu_ptr(&brnf_frag_data_storage);
> >
> > - data->vlan_tci = skb->vlan_tci;
> > - data->vlan_proto = skb->vlan_proto;
> > + if (skb_vlan_tag_present(skb)) {
> > + data->vlan_tci = skb->vlan_tci;
> > + data->vlan_proto = skb->vlan_proto;
> > + } else
> > + data->vlan_proto = 0;
>
> CodingStyle: should use {} in all branches of *if* if at least one branch
> has {}.
>
> [...]
> > diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> > index b6de4f4..ef94664 100644
> > --- a/net/bridge/br_vlan.c
> > +++ b/net/bridge/br_vlan.c
>
> > @@ -444,8 +444,8 @@ static bool __allowed_ingress(const struct net_bridge *br,
> > __vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid);
> > else
> > /* Priority-tagged Frame.
> > - * At this point, We know that skb->vlan_tci had
> > - * VLAN_TAG_PRESENT bit and its VID field was 0x000.
> > + * At this point, We know that skb->vlan_tci VID
>
> s/We/we/.
>
> > + * field was 0x000.
>
> Simply 0, maybe?
Thanks, fixed.
-- Michał Mirosław
^ permalink raw reply
* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Andrew Lunn @ 2016-12-13 15:09 UTC (permalink / raw)
To: Volodymyr Bendiuga
Cc: Florian Fainelli, Vivien Didelot, Volodymyr Bendiuga, netdev,
Volodymyr Bendiuga
In-Reply-To: <CAMr9LbqyvrRYe_m34Ufzxkf2Vef3LbMQiuoiwVX0mK=hW6euyA@mail.gmail.com>
> Another interesting thing is that fdb dump operation is slow
> too. The more entries you have the slower it gets. We have
> implemented caching with timing there as well (I have not submitted
> such patch yet), which makes fdb dump much more pleasant in user
> space, the operation happens instantly, where regular fdb dump op
> might take few seconds.
I know some people won't like the added MDIO transactions, because
they are doing PTP and want low jitter when talking to the switch and
PHYs. As things are now, once the system is configured, the MDIO bus
is idle. Hence there is very low jitter for PTP. For your cache to be
useful, i assume you are refreshing it at regular intervals. So you
are adding a constant load, which i guess 99.9% of the timer is never
used because there is no user looking at the table, where as the PTP
load is used to keep the clocks synchronised.
Andrew
^ permalink raw reply
* [RFC PATCH v3] audit: use proper refcount locking on audit_sock
From: Richard Guy Briggs @ 2016-12-13 15:03 UTC (permalink / raw)
To: netdev, linux-kernel, linux-audit
Cc: Richard Guy Briggs, dvyukov, xiyou.wangcong, edumazet, eparis,
pmoore, sgrubb
In-Reply-To: <20161212100215.GA1305@madcap2.tricolour.ca>
Resetting audit_sock appears to be racy.
audit_sock was being copied and dereferenced without using a refcount on
the source sock.
Bump the refcount on the underlying sock when we store a refrence in
audit_sock and release it when we reset audit_sock. audit_sock
modification needs the audit_cmd_mutex.
See: https://lkml.org/lkml/2016/11/26/232
Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
<xiyou.wangcong@gmail.com> on ideas how to fix it.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
There has been a lot of change in the audit code that is about to go
upstream to address audit queue issues. This patch is based on the
source tree: git://git.infradead.org/users/pcmoore/audit#next
---
kernel/audit.c | 28 +++++++++++++++++++++++-----
1 files changed, 23 insertions(+), 5 deletions(-)
diff --git a/kernel/audit.c b/kernel/audit.c
index f20eee0..3bb4126 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -446,14 +446,19 @@ static void kauditd_retry_skb(struct sk_buff *skb)
* Description:
* Break the auditd/kauditd connection and move all the records in the retry
* queue into the hold queue in case auditd reconnects.
+ * The audit_cmd_mutex must be held when calling this function.
*/
static void auditd_reset(void)
{
struct sk_buff *skb;
/* break the connection */
+ if (audit_sock) {
+ sock_put(audit_sock);
+ audit_sock = NULL;
+ }
audit_pid = 0;
- audit_sock = NULL;
+ audit_nlk_portid = 0;
/* flush all of the retry queue to the hold queue */
while ((skb = skb_dequeue(&audit_retry_queue)))
@@ -579,7 +584,9 @@ static int kauditd_thread(void *dummy)
auditd = 0;
if (AUDITD_BAD(rc, reschedule)) {
+ mutex_lock(&audit_cmd_mutex);
auditd_reset();
+ mutex_unlock(&audit_cmd_mutex);
reschedule = 0;
}
} else
@@ -594,7 +601,9 @@ static int kauditd_thread(void *dummy)
auditd = 0;
if (AUDITD_BAD(rc, reschedule)) {
kauditd_hold_skb(skb);
+ mutex_lock(&audit_cmd_mutex);
auditd_reset();
+ mutex_unlock(&audit_cmd_mutex);
reschedule = 0;
} else
/* temporary problem (we hope), queue
@@ -623,7 +632,9 @@ quick_loop:
if (rc) {
auditd = 0;
if (AUDITD_BAD(rc, reschedule)) {
+ mutex_lock(&audit_cmd_mutex);
auditd_reset();
+ mutex_unlock(&audit_cmd_mutex);
reschedule = 0;
}
@@ -1010,11 +1021,16 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
}
if (audit_enabled != AUDIT_OFF)
audit_log_config_change("audit_pid", new_pid, audit_pid, 1);
- audit_pid = new_pid;
- audit_nlk_portid = NETLINK_CB(skb).portid;
- audit_sock = skb->sk;
- if (!new_pid)
+ if (new_pid) {
+ if (audit_sock)
+ sock_put(audit_sock);
+ audit_pid = new_pid;
+ audit_nlk_portid = NETLINK_CB(skb).portid;
+ sock_hold(skb->sk);
+ audit_sock = skb->sk;
+ } else {
auditd_reset();
+ }
wake_up_interruptible(&kauditd_wait);
}
if (s.mask & AUDIT_STATUS_RATE_LIMIT) {
@@ -1283,8 +1299,10 @@ static void __net_exit audit_net_exit(struct net *net)
{
struct audit_net *aunet = net_generic(net, audit_net_id);
struct sock *sock = aunet->nlsk;
+ mutex_lock(&audit_cmd_mutex);
if (sock == audit_sock)
auditd_reset();
+ mutex_unlock(&audit_cmd_mutex);
RCU_INIT_POINTER(aunet->nlsk, NULL);
synchronize_net();
--
1.7.1
^ permalink raw reply related
* Re: [PATCH v2] audit: use proper refcount locking on audit_sock
From: Richard Guy Briggs @ 2016-12-13 15:01 UTC (permalink / raw)
To: Paul Moore
Cc: netdev, linux-kernel, linux-audit, edumazet, xiyou.wangcong,
dvyukov
In-Reply-To: <20161213051028.GE1305@madcap2.tricolour.ca>
On 2016-12-13 00:10, Richard Guy Briggs wrote:
> On 2016-12-12 15:18, Paul Moore wrote:
> > On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Resetting audit_sock appears to be racy.
> > >
> > > audit_sock was being copied and dereferenced without using a refcount on
> > > the source sock.
> > >
> > > Bump the refcount on the underlying sock when we store a refrence in
> > > audit_sock and release it when we reset audit_sock. audit_sock
> > > modification needs the audit_cmd_mutex.
> > >
> > > See: https://lkml.org/lkml/2016/11/26/232
> > >
> > > Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
> > > <xiyou.wangcong@gmail.com> on ideas how to fix it.
> > >
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > > There has been a lot of change in the audit code that is about to go
> > > upstream to address audit queue issues. This patch is based on the
> > > source tree: git://git.infradead.org/users/pcmoore/audit#next
> > > ---
> > > kernel/audit.c | 34 ++++++++++++++++++++++++++++------
> > > 1 files changed, 28 insertions(+), 6 deletions(-)
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index f20eee0..439f7f3 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -452,7 +452,9 @@ static void auditd_reset(void)
> > > struct sk_buff *skb;
> > >
> > > /* break the connection */
> > > + sock_put(audit_sock);
> > > audit_pid = 0;
> > > + audit_nlk_portid = 0;
> > > audit_sock = NULL;
> > >
> > > /* flush all of the retry queue to the hold queue */
> > > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> > > if (rc >= 0) {
> > > consume_skb(skb);
> > > rc = 0;
> > > + } else {
> > > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> >
> > I dislike the way you wrote this because instead of simply looking at
> > this to see if it correct I need to sort out all the bits and find out
> > if there are other error codes that could run afoul of this check ...
> > make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...).
> > Actually, since EPERM is 1, -EPERM (-1 in two's compliment is
> > 0xffffffff) is going to cause this to be true for pretty much any
> > value of rc, yes?
>
> Yes, you are correct. We need there a logical or on the results of each
> comparison to the return code rather than bit-wise or-ing the result
> codes together first to save a step.
>
> > > + mutex_lock(&audit_cmd_mutex);
> > > + auditd_reset();
> > > + mutex_unlock(&audit_cmd_mutex);
> > > + }
> >
> > The code in audit#next handles netlink_unicast() errors in
> > kauditd_thread() and you are adding error handling code here in
> > kauditd_send_unicast_skb() ... that's messy. I don't care too much
> > where the auditd_reset() call is made, but let's only do it in one
> > function; FWIW, I originally put the error handling code in
> > kauditd_thread() because there was other error handling code that
> > needed to done in that scope so it resulted in cleaner code.
>
> Hmmm, I seem to remember it not returning the return code and I thought
> I had changed it to do so, but I see now that it was already there.
> Agreed, I needlessly duplicated that error handling.
>
> > Related, I see you are now considering ENOMEM to be a fatal condition,
> > that differs from the AUDITD_BAD macro in kauditd_thread(); this
> > difference needs to be reconciled.
>
> Also correct about -EPERM now that I check back to the intent of commit
> 32a1dbaece7e ("audit: try harder to send to auditd upon netlink
> failure")
>
> > Finally, you should update the comment header block for auditd_reset()
> > that it needs to be called with the audit_cmd_mutex held.
>
> Yup.
>
> > > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
> > > return -EACCES;
> > > }
> > > if (audit_pid && new_pid &&
> > > - audit_replace(requesting_pid) != -ECONNREFUSED) {
> > > + (audit_replace(requesting_pid) & (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> >
> > Do we simply want to treat any error here as fatal, and not just
> > ECONN/EPERM/ENOMEM? If not, let's come up with a single macro to
> > handle the fatal netlink_unicast() return codes so we have some chance
> > to keep things consistent in the future.
>
> I'll work through this before I post another patch...
Ok, I've gone back to look at the reasoning in commit 133e1e5acd4a
("audit: stop an old auditd being starved out by a new auditd") which
suggests only ECONNREFUSED can cause an audit_pid replace, so I've
returned it to its original state.
I'll post another tested patch, but I'm still not that happy that it
does not proactively reset audit_pid, audit_nlk_portid and audit_sock
when auditd's socket has a problem. I'll leave the test run overnight.
> > paul moore
>
> - RGB
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH v2] audit: use proper refcount locking on audit_sock
From: Richard Guy Briggs @ 2016-12-13 14:55 UTC (permalink / raw)
To: Cong Wang
Cc: Linux Kernel Network Developers, LKML, linux-audit, Dmitry Vyukov,
Eric Dumazet, Eric Paris, Paul Moore, sgrubb
In-Reply-To: <CAM_iQpXAZOOn7G-EbEy1T11w6uoqwx5M8jVt=iHfOj4TJYsqpA@mail.gmail.com>
On 2016-12-12 15:58, Cong Wang wrote:
> On Mon, Dec 12, 2016 at 2:03 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Resetting audit_sock appears to be racy.
> >
> > audit_sock was being copied and dereferenced without using a refcount on
> > the source sock.
> >
> > Bump the refcount on the underlying sock when we store a refrence in
> > audit_sock and release it when we reset audit_sock. audit_sock
> > modification needs the audit_cmd_mutex.
> >
> > See: https://lkml.org/lkml/2016/11/26/232
> >
> > Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
> > <xiyou.wangcong@gmail.com> on ideas how to fix it.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> > There has been a lot of change in the audit code that is about to go
> > upstream to address audit queue issues. This patch is based on the
> > source tree: git://git.infradead.org/users/pcmoore/audit#next
> > ---
> > kernel/audit.c | 34 ++++++++++++++++++++++++++++------
> > 1 files changed, 28 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index f20eee0..439f7f3 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -452,7 +452,9 @@ static void auditd_reset(void)
> > struct sk_buff *skb;
> >
> > /* break the connection */
> > + sock_put(audit_sock);
>
> Why audit_sock can't be NULL here?
Fixed.
> > audit_pid = 0;
> > + audit_nlk_portid = 0;
> > audit_sock = NULL;
> >
> > /* flush all of the retry queue to the hold queue */
> > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> > if (rc >= 0) {
> > consume_skb(skb);
> > rc = 0;
> > + } else {
> > + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
>
> Are these errno's bits??
No, I've fixed this silly error.
> > + mutex_lock(&audit_cmd_mutex);
> > + auditd_reset();
> > + mutex_unlock(&audit_cmd_mutex);
> > + }
> > }
> >
> > return rc;
> > @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
> >
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > + mutex_lock(&audit_cmd_mutex);
> > auditd_reset();
> > + mutex_unlock(&audit_cmd_mutex);
> > reschedule = 0;
> > }
> > } else
> > @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > kauditd_hold_skb(skb);
> > + mutex_lock(&audit_cmd_mutex);
> > auditd_reset();
> > + mutex_unlock(&audit_cmd_mutex);
> > reschedule = 0;
> > } else
> > /* temporary problem (we hope), queue
> > @@ -623,7 +635,9 @@ quick_loop:
> > if (rc) {
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > + mutex_lock(&audit_cmd_mutex);
> > auditd_reset();
> > + mutex_unlock(&audit_cmd_mutex);
> > reschedule = 0;
> > }
> >
> > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
> > return -EACCES;
> > }
> > if (audit_pid && new_pid &&
> > - audit_replace(requesting_pid) != -ECONNREFUSED) {
> > + (audit_replace(requesting_pid) & (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> > audit_log_config_change("audit_pid", new_pid, audit_pid, 0);
> > return -EEXIST;
> > }
> > if (audit_enabled != AUDIT_OFF)
> > audit_log_config_change("audit_pid", new_pid, audit_pid, 1);
> > - audit_pid = new_pid;
> > - audit_nlk_portid = NETLINK_CB(skb).portid;
> > - audit_sock = skb->sk;
> > - if (!new_pid)
> > + if (new_pid) {
> > + if (audit_sock)
> > + sock_put(audit_sock);
> > + audit_pid = new_pid;
> > + audit_nlk_portid = NETLINK_CB(skb).portid;
> > + sock_hold(skb->sk);
>
> Why refcnt is still needed here? I need it because I removed the code
> in net exit code path.
Because there is a chance that auditd exits abnormally and no message is
send from the kauditd thread to discover it has gone.
> > + audit_sock = skb->sk;
> > + } else {
> > auditd_reset();
> > + }
> > wake_up_interruptible(&kauditd_wait);
> > }
> > if (s.mask & AUDIT_STATUS_RATE_LIMIT) {
> > @@ -1283,8 +1302,11 @@ static void __net_exit audit_net_exit(struct net *net)
> > {
> > struct audit_net *aunet = net_generic(net, audit_net_id);
> > struct sock *sock = aunet->nlsk;
> > - if (sock == audit_sock)
> > + if (sock == audit_sock) {
> > + mutex_lock(&audit_cmd_mutex);
>
> You need to put the if check inside the mutex too. Again, this could be
> removed if you use refcnt.
Ok, right, fixed.
That last patch was a bit of a mess! Thanks for your patience in
checking it...
> > auditd_reset();
> > + mutex_unlock(&audit_cmd_mutex);
> > + }
> >
> > RCU_INIT_POINTER(aunet->nlsk, NULL);
> > synchronize_net();
> > --
> > 1.7.1
> >
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: mlx5: net_device.addr_list_lock usage before initialization
From: Saeed Mahameed @ 2016-12-13 14:48 UTC (permalink / raw)
To: Sebastian Ott
Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, Linux Netdev List,
linux-kernel
In-Reply-To: <alpine.LFD.2.20.1612131405080.4914@schleppi>
On Tue, Dec 13, 2016 at 3:22 PM, Sebastian Ott
<sebott@linux.vnet.ibm.com> wrote:
> Hi,
>
> I ran into the following lockdep complaint:
>
> [ 7.059561] INFO: trying to register non-static key.
> [ 7.059566] the code is fine but needs lockdep annotation.
> [ 7.059570] turning off the locking correctness validator.
> [ 7.059579] CPU: 6 PID: 6 Comm: kworker/u32:0 Not tainted 4.9.0-02683-g784243e-dirty #77
> [ 7.059582] Hardware name: IBM 2964 N96 704 (LPAR)
> [ 7.061260] Workqueue: mlx5e mlx5e_set_rx_mode_work [mlx5_core]
> [ 7.061268] Stack:
> [ 7.061270] 00000000f95739c0 00000000f9573a50 0000000000000003 0000000000000000
> [ 7.061278] 00000000f9573af0 00000000f9573a68 00000000f9573a68 0000000000000020
> [ 7.061286] 0000000000000000 0000000000000020 000000000000000a 000000000000000a
> [ 7.061294] 000000000000000c 00000000f9573ab8 0000000000000000 0000000000000000
> [ 7.061301] 00000000008a1038 0000000000112a50 00000000f9573a50 00000000f9573aa8
> [ 7.061314] Call Trace:
> [ 7.061321] ([<000000000011292a>] show_trace+0x8a/0xe0)
> [ 7.061327] [<0000000000112a00>] show_stack+0x80/0xd8
> [ 7.061334] [<00000000005cdce6>] dump_stack+0x96/0xd8
> [ 7.061338] [<00000000001ae352>] register_lock_class+0x1d2/0x530
> [ 7.061341] [<00000000001b33f6>] __lock_acquire+0xfe/0x7d8
> [ 7.061345] [<00000000001b4394>] lock_acquire+0x30c/0x358
> [ 7.061352] [<000000000089454c>] _raw_spin_lock_bh+0x64/0xa0
> [ 7.062171] [<000003ff81465858>] mlx5e_set_rx_mode_work+0x248/0x490 [mlx5_core]
> [ 7.062178] [<0000000000163864>] process_one_work+0x41c/0x830
> [ 7.062181] [<0000000000163f2c>] worker_thread+0x2b4/0x478
> [ 7.062186] [<000000000016c46c>] kthread+0x15c/0x170
> [ 7.062190] [<0000000000895a52>] kernel_thread_starter+0x6/0xc
> [ 7.062193] [<0000000000895a4c>] kernel_thread_starter+0x0/0xc
> [ 7.062196] INFO: lockdep is turned off.
>
> The problematic lock is net_device.addr_list_lock whose usage is
> asynchronously triggered by:
>
> mlx5e_add -> mlx5e_attach -> mlx5e_attach_netdev -> mlx5e_nic_enable
> [workq] mlx5e_set_rx_mode_work -> mlx5e_handle_netdev_addr -> mlx5e_sync_netdev_addr
>
> Initialization of this lock is triggered by:
> mlx5e_add -> register_netdev
>
> ...after the call to mlx5e_attach which is obviously racy.
>
Thanks Sebastian for the report,
indeed there is an issue, I wonder why the net_device.addr_list_lock
is initialized so late (at register_netdevice) IMHO it should be
initialized at
alloc_netdev_mqs->dev_addr_init
where all the other net_device fields are initialized!
We will handle this.
Thanks,
Saeed.
^ permalink raw reply
* RE: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat
From: David Laight @ 2016-12-13 14:42 UTC (permalink / raw)
To: 'Alexey Dobriyan'
Cc: davem@davemloft.net, netdev@vger.kernel.org, xemul@openvz.org
In-Reply-To: <CACVxJT8fT0Lbe_ojNjU9DYYO=PO+QzA=jpQb7V6_US8W9D-KTQ@mail.gmail.com>
From: Alexey Dobriyan
> Sent: 13 December 2016 14:23
...
> Well, the point of the patch is to save .text, so might as well save
> as much as possible. Any form other than "ptr[id]" is going
> to be either bigger or bigger and slower and "ptr" should be the first field.
You've not read and understood the next bit:
> > However if you offset the 'id' values so that only
> > values 2 up are valid the code becomes:
> > return net->gen2->ptr[id - 2];
> > which will be exactly the same code as:
> > return net->gen1->ptr[id];
> > but it is much more obvious that 'id' values must be >= 2.
> >
> > The '2' should be generated from the structure offset, but with my method
> > is doesn't actually matter if it is wrong.
If you have foo->bar[id - const] then the compiler has to add the
offset of 'bar' and subtract for 'const'.
If the numbers match no add or subtract is needed.
It is much cleaner to do this by explicitly removing the offset on the
accesses than using a union.
David
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox