* [net-next PATCH V2 0/3] Fix page_pool API and dma address storage
From: Jesper Dangaard Brouer @ 2019-02-12 14:48 UTC (permalink / raw)
To: netdev, linux-mm
Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
Andrew Morton, mgorman, David S. Miller, Tariq Toukan
As pointed out by David Miller in [1] the current page_pool implementation
stores dma_addr_t in page->private. This won't work on 32-bit platforms with
64-bit DMA addresses since the page->private is an unsigned long and the
dma_addr_t a u64.
Since no driver is yet using the DMA mapping capabilities of the API let's
fix this by storing the information in 'struct page' and use that to store
and retrieve DMA addresses from network drivers.
As long as the addresses returned from dma_map_page() are aligned the first
bit, used by the compound pages code should not be set.
Ilias tested this on Espressobin driver mvneta, for which we have patches
for using the DMA API of page_pool.
[1]: https://lore.kernel.org/netdev/20181207.230655.1261252486319967024.davem@davemloft.net/
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
Ilias Apalodimas (1):
net: page_pool: don't use page->private to store dma_addr_t
Jesper Dangaard Brouer (2):
mm: add dma_addr_t to struct page
page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings
include/linux/mm_types.h | 7 +++++++
net/core/page_pool.c | 22 ++++++++++++++--------
2 files changed, 21 insertions(+), 8 deletions(-)
--
^ permalink raw reply
* [net-next PATCH V2 1/3] mm: add dma_addr_t to struct page
From: Jesper Dangaard Brouer @ 2019-02-12 14:49 UTC (permalink / raw)
To: netdev, linux-mm
Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
Andrew Morton, mgorman, David S. Miller, Tariq Toukan
In-Reply-To: <154998290571.8783.11827147914798438839.stgit@firesoul>
The page_pool API is using page->private to store DMA addresses.
As pointed out by David Miller we can't use that on 32-bit architectures
with 64-bit DMA
This patch adds a new dma_addr_t struct to allow storing DMA addresses
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/mm_types.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..581737bd0878 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -95,6 +95,13 @@ struct page {
*/
unsigned long private;
};
+ struct { /* page_pool used by netstack */
+ /**
+ * @dma_addr: page_pool requires a 64-bit value even on
+ * 32-bit architectures.
+ */
+ dma_addr_t dma_addr;
+ };
struct { /* slab, slob and slub */
union {
struct list_head slab_list; /* uses lru */
^ permalink raw reply related
* [net-next PATCH V2 2/3] net: page_pool: don't use page->private to store dma_addr_t
From: Jesper Dangaard Brouer @ 2019-02-12 14:49 UTC (permalink / raw)
To: netdev, linux-mm
Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
Andrew Morton, mgorman, David S. Miller, Tariq Toukan
In-Reply-To: <154998290571.8783.11827147914798438839.stgit@firesoul>
From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
As pointed out by David Miller the current page_pool implementation
stores dma_addr_t in page->private.
This won't work on 32-bit platforms with 64-bit DMA addresses since the
page->private is an unsigned long and the dma_addr_t a u64.
A previous patch is adding dma_addr_t on struct page to accommodate this.
This patch adapts the page_pool related functions to use the newly added
struct for storing and retrieving DMA addresses from network drivers.
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
net/core/page_pool.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43a932cb609b..897a69a1477e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -136,7 +136,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
if (!(pool->p.flags & PP_FLAG_DMA_MAP))
goto skip_dma_map;
- /* Setup DMA mapping: use page->private for DMA-addr
+ /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+ * since dma_addr_t can be either 32 or 64 bits and does not always fit
+ * into page private data (i.e 32bit cpu with 64bit DMA caps)
* This mapping is kept for lifetime of page, until leaving pool.
*/
dma = dma_map_page(pool->p.dev, page, 0,
@@ -146,7 +148,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
put_page(page);
return NULL;
}
- set_page_private(page, dma); /* page->private = dma; */
+ page->dma_addr = dma;
skip_dma_map:
/* When page just alloc'ed is should/must have refcnt 1. */
@@ -175,13 +177,16 @@ EXPORT_SYMBOL(page_pool_alloc_pages);
static void __page_pool_clean_page(struct page_pool *pool,
struct page *page)
{
+ dma_addr_t dma;
+
if (!(pool->p.flags & PP_FLAG_DMA_MAP))
return;
+ dma = page->dma_addr;
/* DMA unmap */
- dma_unmap_page(pool->p.dev, page_private(page),
+ dma_unmap_page(pool->p.dev, dma,
PAGE_SIZE << pool->p.order, pool->p.dma_dir);
- set_page_private(page, 0);
+ page->dma_addr = 0;
}
/* Return a page to the page allocator, cleaning up our state */
^ permalink raw reply related
* [net-next PATCH V2 3/3] page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings
From: Jesper Dangaard Brouer @ 2019-02-12 14:49 UTC (permalink / raw)
To: netdev, linux-mm
Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
Andrew Morton, mgorman, David S. Miller, Tariq Toukan
In-Reply-To: <154998290571.8783.11827147914798438839.stgit@firesoul>
As pointed out by Alexander Duyck, the DMA mapping done in page_pool needs
to use the DMA attribute DMA_ATTR_SKIP_CPU_SYNC.
As the principle behind page_pool keeping the pages mapped is that the
driver takes over the DMA-sync steps.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
net/core/page_pool.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 897a69a1477e..7e624c2cd709 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -141,9 +141,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
* into page private data (i.e 32bit cpu with 64bit DMA caps)
* This mapping is kept for lifetime of page, until leaving pool.
*/
- dma = dma_map_page(pool->p.dev, page, 0,
- (PAGE_SIZE << pool->p.order),
- pool->p.dma_dir);
+ dma = dma_map_page_attrs(pool->p.dev, page, 0,
+ (PAGE_SIZE << pool->p.order),
+ pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
if (dma_mapping_error(pool->p.dev, dma)) {
put_page(page);
return NULL;
@@ -184,8 +184,9 @@ static void __page_pool_clean_page(struct page_pool *pool,
dma = page->dma_addr;
/* DMA unmap */
- dma_unmap_page(pool->p.dev, dma,
- PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+ dma_unmap_page_attr(pool->p.dev, dma,
+ PAGE_SIZE << pool->p.order, pool->p.dma_dir,
+ DMA_ATTR_SKIP_CPU_SYNC);
page->dma_addr = 0;
}
^ permalink raw reply related
* Re: [PATCH] ser_gigaset: mark expected switch fall-through
From: Gustavo A. R. Silva @ 2019-02-12 14:58 UTC (permalink / raw)
To: Paul Bolle
Cc: gigaset307x-common, netdev, linux-kernel, Kees Cook, Karsten Keil
In-Reply-To: <3dbf6ecb46ad65b3b4bddd4154e31fde2c79d13d.camel@tiscali.nl>
On 2/12/19 2:45 AM, Paul Bolle wrote:
> Gustavo A. R. Silva schreef op ma 11-02-2019 om 16:34 [-0600]:
>> In preparation to enabling -Wimplicit-fallthrough, mark switch
>> cases where we are expecting to fall through.
>>
>> This patch fixes the following warning:
>>
>> drivers/isdn/gigaset/ser-gigaset.c: In function ‘gigaset_tty_ioctl’:
>> drivers/isdn/gigaset/ser-gigaset.c:627:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
>> switch (arg) {
>> ^~~~~~
>> drivers/isdn/gigaset/ser-gigaset.c:638:2: note: here
>> default:
>> ^~~~~~~
>>
>> Warning level 3 was used: -Wimplicit-fallthrough=3
>>
>> Notice that, in this particular case, the code comment is modified
>> in accordance with what GCC is expecting to find.
>>
>> This patch is part of the ongoing efforts to enable
>> -Wimplicit-fallthrough.
>>
>> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
>
> Acked-by: Paul Bolle <pebolle@tiscali.nl>
>
Thanks, Paul.
--
Gustavo
^ permalink raw reply
* Re: [RFC, PATCH] net: page_pool: Don't use page->private to store dma_addr_t
From: Tariq Toukan @ 2019-02-12 14:58 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, Ilias Apalodimas, Matthew Wilcox, David Miller,
toke@redhat.com, netdev@vger.kernel.org,
mgorman@techsingularity.net, linux-mm@kvack.org
In-Reply-To: <20190212144938.36dd45b4@carbon>
On 2/12/2019 3:49 PM, Jesper Dangaard Brouer wrote:
> On Tue, 12 Feb 2019 12:39:59 +0000
> Tariq Toukan <tariqt@mellanox.com> wrote:
>
>> On 2/11/2019 7:14 PM, Eric Dumazet wrote:
>>>
>>> On 02/11/2019 12:53 AM, Tariq Toukan wrote:
>>>>
>>>
>>>> Hi,
>>>>
>>>> It's great to use the struct page to store its dma mapping, but I am
>>>> worried about extensibility.
>>>> page_pool is evolving, and it would need several more per-page fields.
>>>> One of them would be pageref_bias, a planned optimization to reduce the
>>>> number of the costly atomic pageref operations (and replace existing
>>>> code in several drivers).
>>>>
>>>
>>> But the point about pageref_bias is to place it in a different
>>> cache line than "struct page"
>
> Yes, exactly.
>
>
>>> The major cost is having a cache line bouncing between producer and
>>> consumer.
>>
>> pageref_bias is meant to be dirtied only by the page requester, i.e. the
>> NIC driver / page_pool.
>> All other components (basically, SKB release flow / put_page) should
>> continue working with the atomic page_refcnt, and not dirty the
>> pageref_bias.
>>
>> However, what bothers me more is another issue.
>> The optimization doesn't cleanly combine with the new page_pool
>> direction for maintaining a queue for "available" pages, as the put_page
>> flow would need to read pageref_bias, asynchronously, and act accordingly.
>>
>> The suggested hook in put_page (to catch the 2 -> 1 "biased refcnt"
>> transition) causes a problem to the traditional pageref_bias idea, as it
>> implies a new point in which the pageref_bias field is read
>> *asynchronously*. This would risk missing the this critical 2 -> 1
>> transition! Unless pageref_bias is atomic...
>
> I want to stop you here...
>
> It seems to me that you are trying to shoehorn in a refcount
> optimization into page_pool. The page_pool is optimized for the XDP
> case of one-frame-per-page, where we can avoid changing the refcount,
> and tradeoff memory usage for speed. It is compatible with the elevated
> refcount usage, but that is not the optimization target.
>
> If the case you are optimizing for is "packing" more frames in a page,
> then the page_pool might be the wrong choice. To me it would make more
> sense to create another enum xdp_mem_type, that generalize the
> pageref_bias tricks also used by some drivers.
>
Hi Jesper,
We share the same interest, I tried to combine the pageref_bias
optimization on top of the put_page hook, but turns out it doesn't fit.
That's all.
Of course, I am aware of the fact that page_pool is optimized for XDP
use cases. But, as drivers prefer a single flow for their
page-allocation management, rather than having several allocation/free
methods depending on whether XDP program is loaded or not, the
performance for non-XDP flow also matters.
I know you're not ignoring this, the fact that you're adding
compatibility for the elevated refcount usage is a key step in this
direction.
Another key benefit of page_pool is providing a netdev-optimized API
that can replace the page allocation / dma mapping logic of the
different drivers, and take it into one common shared unit.
This helps remove many LOCs from drivers, significantly improves
modularity, and eases the support of new optimizations.
By improving the general non-XDP flow (packing several packets in a
page) you encourage more and more drivers to do the transition.
We all look to further improve the page-pool performance. The
pageref_bias idea does not fit, that's fine.
We can still introduce an API for bulk page-allocation, it will improve
both XDP and non-XDP flows.
Regards,
Tariq
^ permalink raw reply
* Re: [bpf-next 1/2] tcp: replace SOCK_DEBUG() with tcp_stats()
From: Eric Dumazet @ 2019-02-12 15:07 UTC (permalink / raw)
To: Yafang Shao, daniel, ast
Cc: yhs, brakmo, edumazet, davem, netdev, linux-kernel, shaoyafang
In-Reply-To: <1549971097-12627-2-git-send-email-laoar.shao@gmail.com>
On 02/12/2019 03:31 AM, Yafang Shao wrote:
> SOCK_DEBUG is a very ancient debugging interface, and it's not very useful
> for debugging.
> So this patch removes the SOCK_DEBUG() and introduce a new function
> tcp_stats() to trace this kind of events.
> Some MIBs are added for these events.
>
> Regarding the SO_DEBUG in sock_{s,g}etsockopt, I think it is better to
> keep as-is, because if we return an errno to tell the application that
> this optname isn't supported for TCP, it may break the application.
> The application still can use this option but don't take any effect for
> TCP.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> include/uapi/linux/snmp.h | 3 +++
> net/ipv4/proc.c | 3 +++
> net/ipv4/tcp_input.c | 26 +++++++++++---------------
> net/ipv6/tcp_ipv6.c | 2 --
> 4 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index 86dc24a..fd5c09c 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -283,6 +283,9 @@ enum
> LINUX_MIB_TCPACKCOMPRESSED, /* TCPAckCompressed */
> LINUX_MIB_TCPZEROWINDOWDROP, /* TCPZeroWindowDrop */
> LINUX_MIB_TCPRCVQDROP, /* TCPRcvQDrop */
> + LINUX_MIB_TCPINVALIDACK, /* TCPInvalidAck */
> + LINUX_MIB_TCPOLDACK, /* TCPOldAck */
> + LINUX_MIB_TCPPARTIALPACKET, /* TCPPartialPacket */
> __LINUX_MIB_MAX
> };
>
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index c3610b3..1b0320a 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -291,6 +291,9 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
> SNMP_MIB_ITEM("TCPAckCompressed", LINUX_MIB_TCPACKCOMPRESSED),
> SNMP_MIB_ITEM("TCPZeroWindowDrop", LINUX_MIB_TCPZEROWINDOWDROP),
> SNMP_MIB_ITEM("TCPRcvQDrop", LINUX_MIB_TCPRCVQDROP),
> + SNMP_MIB_ITEM("TCPInvalidAck", LINUX_MIB_TCPINVALIDACK),
> + SNMP_MIB_ITEM("TCPOldAck", LINUX_MIB_TCPOLDACK),
> + SNMP_MIB_ITEM("TCPPartialPacket", LINUX_MIB_TCPPARTIALPACKET),
> SNMP_MIB_SENTINEL
> };
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 7a027dec..88deb1f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3554,6 +3554,11 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
> return delivered;
> }
>
> +static void tcp_stats(struct sock *sk, int mib_idx)
> +{
> + NET_INC_STATS(sock_net(sk), mib_idx);
> +}
This is not a very descriptive name.
Why is it static, and in net/ipv4/tcp_input.c ???
> +
> /* This routine deals with incoming acks, but not outgoing ones. */
> static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> {
> @@ -3715,7 +3720,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> return 1;
>
> invalid_ack:
> - SOCK_DEBUG(sk, "Ack %u after %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
> + tcp_stats(sk, LINUX_MIB_TCPINVALIDACK);
> return -1;
>
> old_ack:
> @@ -3731,7 +3736,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> tcp_xmit_recovery(sk, rexmit);
> }
>
> - SOCK_DEBUG(sk, "Ack %u before %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
> + tcp_stats(sk, LINUX_MIB_TCPOLDACK);
> return 0;
> }
>
These counters will add noise to an already crowded MIB space.
What bug do you expect to track and fix with these ?
I see many TCP patches coming adding icache pressure, enabling companies to build their own modified
TCP stack, but no real meat.
^ permalink raw reply
* [PATCH] NETWORKING: avoid use IPCB in cipso_v4_error
From: Nazarov Sergey @ 2019-02-12 15:10 UTC (permalink / raw)
To: netdev@vger.kernel.org
Cc: linux-security-module@vger.kernel.org, davem, kuznet, yoshfuji,
Paul Moore
In-Reply-To: <CAHC9VhSDot+uNDi7ULoT7dry4FW9Jgp=NcqDh6TmA=z--kmqkA@mail.gmail.com>
Since cipso_v4_error might be called from different network stack layers, we can't safely use icmp_send there.
icmp_send copies IP options with ip_option_echo, which uses IPCB to take access to IP header compiled data.
But after commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses"), IPCB can't be used
above IP layer.
This patch fixes the problem by creating in cipso_v4_error a local copy of compiled IP options and using it with
introduced __icmp_send function. This looks some overloaded, but in quite rare error conditions only.
The original discussion is here:
https://lore.kernel.org/linux-security-module/16659801547571984@sas1-890ba5c2334a.qloud-c.yandex.net/
Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
---
include/net/icmp.h | 9 ++++++++-
net/ipv4/cipso_ipv4.c | 18 ++++++++++++++++--
net/ipv4/icmp.c | 7 ++++---
3 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/include/net/icmp.h b/include/net/icmp.h
index 6ac3a5b..e0f709d 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -22,6 +22,7 @@
#include <net/inet_sock.h>
#include <net/snmp.h>
+#include <net/ip.h>
struct icmp_err {
int errno;
@@ -39,7 +40,13 @@ struct icmp_err {
struct sk_buff;
struct net;
-void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
+void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
+ const struct ip_options *opt);
+static inline void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
+{
+ __icmp_send(skb_in, type, code, info, &IPCB(skb_in)->opt);
+}
+
int icmp_rcv(struct sk_buff *skb);
int icmp_err(struct sk_buff *skb, u32 info);
int icmp_init(void);
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 777fa3b..234d12e 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1735,13 +1735,27 @@ int cipso_v4_validate(const struct sk_buff *skb, unsigned char **option)
*/
void cipso_v4_error(struct sk_buff *skb, int error, u32 gateway)
{
+ unsigned char optbuf[sizeof(struct ip_options) + 40];
+ struct ip_options *opt = (struct ip_options *)optbuf;
+
if (ip_hdr(skb)->protocol == IPPROTO_ICMP || error != -EACCES)
return;
+ /*
+ * We might be called above the IP layer,
+ * so we can not use icmp_send and IPCB here.
+ */
+
+ memset(opt, 0, sizeof(struct ip_options));
+ opt->optlen = ip_hdr(skb)->ihl*4 - sizeof(struct iphdr);
+ memcpy(opt->__data, (unsigned char *)&(ip_hdr(skb)[1]), opt->optlen);
+ if (ip_options_compile(dev_net(skb->dev), opt, NULL))
+ return;
+
if (gateway)
- icmp_send(skb, ICMP_DEST_UNREACH, ICMP_NET_ANO, 0);
+ __icmp_send(skb, ICMP_DEST_UNREACH, ICMP_NET_ANO, 0, opt);
else
- icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_ANO, 0);
+ __icmp_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_ANO, 0, opt);
}
/**
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 065997f..3f24414 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -570,7 +570,8 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
* MUST reply to only the first fragment.
*/
-void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
+void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
+ const struct ip_options *opt)
{
struct iphdr *iph;
int room;
@@ -691,7 +692,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
iph->tos;
mark = IP4_REPLY_MARK(net, skb_in->mark);
- if (ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in))
+ if (__ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in, opt))
goto out_unlock;
@@ -742,7 +743,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
local_bh_enable();
out:;
}
-EXPORT_SYMBOL(icmp_send);
+EXPORT_SYMBOL(__icmp_send);
static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
--
^ permalink raw reply related
* Re: [bpf-next 2/2] bpf: add BPF_SOCK_OPS_STATS_CB for tcp_stats()
From: Eric Dumazet @ 2019-02-12 15:11 UTC (permalink / raw)
To: Yafang Shao, daniel, ast
Cc: yhs, brakmo, edumazet, davem, netdev, linux-kernel, shaoyafang
In-Reply-To: <1549971097-12627-3-git-send-email-laoar.shao@gmail.com>
On 02/12/2019 03:31 AM, Yafang Shao wrote:
> Introuce this new op BPF_SOCK_OPS_STATS_CB for tcp_stats() such that it
> can be traced via BPF on a per socket basis.
> There's one argument in BPF_SOCK_OPS_STATS_CB, which is Linux MIB index
> LINUX_MIB_* to indicate the TCP event.
> All these Linux MIBs are defined in include/uapi/linux/snmp.h.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> include/uapi/linux/bpf.h | 5 +++++
> net/ipv4/tcp_input.c | 1 +
> 2 files changed, 6 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1777fa0..0314ddd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2894,6 +2894,11 @@ enum {
> BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after
> * socket transition to LISTEN state.
> */
> + BPF_SOCK_OPS_STATS_CB, /*
> + * Called on tcp_stats().
> + * Arg1: Linux MIB index
> + * LINUX_MIB_*
> + */
> };
>
> /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 88deb1f..4acf458 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3557,6 +3557,7 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
> static void tcp_stats(struct sock *sk, int mib_idx)
> {
> NET_INC_STATS(sock_net(sk), mib_idx);
> + tcp_call_bpf(sk, BPF_SOCK_OPS_STATS_CB, 1, &mib_idx);
> }
>
> /* This routine deals with incoming acks, but not outgoing ones. */
>
If the plan is to add to all NET_INC_STATS() calls in TCP an additional tcp_call_bpf()
I will say no.
So far, tcp_call_bpf() has not been used in fast path.
^ permalink raw reply
* Re: [RFC, PATCH] net: page_pool: Don't use page->private to store dma_addr_t
From: Eric Dumazet @ 2019-02-12 15:15 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Ilias Apalodimas, Matthew Wilcox,
brouer@redhat.com
Cc: David Miller, toke@redhat.com, netdev@vger.kernel.org,
mgorman@techsingularity.net, linux-mm@kvack.org
In-Reply-To: <27e97aac-f25b-d46c-3e70-7d0d44f784b5@mellanox.com>
On 02/12/2019 04:39 AM, Tariq Toukan wrote:
>
>
> On 2/11/2019 7:14 PM, Eric Dumazet wrote:
>>
>>
>> On 02/11/2019 12:53 AM, Tariq Toukan wrote:
>>>
>>
>>> Hi,
>>>
>>> It's great to use the struct page to store its dma mapping, but I am
>>> worried about extensibility.
>>> page_pool is evolving, and it would need several more per-page fields.
>>> One of them would be pageref_bias, a planned optimization to reduce the
>>> number of the costly atomic pageref operations (and replace existing
>>> code in several drivers).
>>>
>>
>> But the point about pageref_bias is to place it in a different cache line than "struct page"
>>
>> The major cost is having a cache line bouncing between producer and consumer.
>>
>
> pageref_bias is meant to be dirtied only by the page requester, i.e. the
> NIC driver / page_pool.
> All other components (basically, SKB release flow / put_page) should
> continue working with the atomic page_refcnt, and not dirty the
> pageref_bias.
This is exactly my point.
You suggested to put pageref_bias in struct page, which breaks this completely.
pageref_bias is better kept in a driver structure, with appropriate prefetching
since most NIC use a ring buffer for their queues.
The dma address _can_ be put in the struct page, since the driver does not dirty it
and does not even read it when page can be recycled.
^ permalink raw reply
* [PATCH net-next 1/7] net/smc: reset cursor update required flag
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
From: Karsten Graul <kgraul@linux.ibm.com>
When an updated rx_cursor_confirmed field was sent to the peer then
reset the cons_curs_upd_req flag. And remove the duplicate reset and
cursor update in smc_tx_consumer_update().
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
net/smc/smc_cdc.c | 5 ++++-
net/smc/smc_tx.c | 3 ---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index a712c9f8699b..99d9d6e85dfb 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -105,8 +105,10 @@ int smc_cdc_msg_send(struct smc_connection *conn,
&conn->local_tx_ctrl, conn);
smc_curs_copy(&cfed, &((struct smc_host_cdc_msg *)wr_buf)->cons, conn);
rc = smc_wr_tx_send(link, (struct smc_wr_tx_pend_priv *)pend);
- if (!rc)
+ if (!rc) {
smc_curs_copy(&conn->rx_curs_confirmed, &cfed, conn);
+ conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
+ }
return rc;
}
@@ -194,6 +196,7 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
if (rc)
return rc;
smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
+ conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
/* Calculate transmitted data and increment free send buffer space */
diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
&conn->tx_curs_sent);
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index f93f3580c100..ce9586bce364 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -610,9 +610,6 @@ void smc_tx_consumer_update(struct smc_connection *conn, bool force)
SMC_TX_WORK_DELAY);
return;
}
- smc_curs_copy(&conn->rx_curs_confirmed,
- &conn->local_tx_ctrl.cons, conn);
- conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
}
if (conn->local_rx_ctrl.prod_flags.write_blocked &&
!atomic_read(&conn->bytes_to_rcv))
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 2/7] net/smc: move wake up of close waiter
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
From: Karsten Graul <kgraul@linux.ibm.com>
Move the call to smc_close_wake_tx_prepared() (which wakes up a possibly
waiting close processing that might wait for 'all data sent') to
smc_tx_sndbuf_nonempty() (which is the main function to send data).
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
net/smc/smc_cdc.c | 2 --
net/smc/smc_tx.c | 7 +++++++
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 99d9d6e85dfb..780b36c69292 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -290,8 +290,6 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
/* trigger sndbuf consumer: RDMA write into peer RMBE and CDC */
if (diff_cons && smc_tx_prepared_sends(conn)) {
smc_tx_sndbuf_nonempty(conn);
- /* trigger socket release if connection closed */
- smc_close_wake_tx_prepared(smc);
}
if (diff_cons && conn->urg_tx_pend &&
atomic_read(&conn->peer_rmbe_space) == conn->peer_rmbe_size) {
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index ce9586bce364..dd10a913b38e 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -24,6 +24,7 @@
#include "smc.h"
#include "smc_wr.h"
#include "smc_cdc.h"
+#include "smc_close.h"
#include "smc_ism.h"
#include "smc_tx.h"
@@ -554,6 +555,12 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
else
rc = smcr_tx_sndbuf_nonempty(conn);
+ if (!rc) {
+ /* trigger socket release if connection is closing */
+ struct smc_sock *smc = container_of(conn, struct smc_sock,
+ conn);
+ smc_close_wake_tx_prepared(smc);
+ }
return rc;
}
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 3/7] net/smc: no delay for free tx buffer wait
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
From: Karsten Graul <kgraul@linux.ibm.com>
When no free transfer buffers are available then a work to call
smc_tx_work() is scheduled. Set the schedule delay to zero, because for
the out-of-buffers condition the work can start immediately and will
block in the called function smc_wr_tx_get_free_slot(), waiting for free
buffers.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
net/smc/smc_tx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index dd10a913b38e..a3bff08ff8c8 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -28,7 +28,7 @@
#include "smc_ism.h"
#include "smc_tx.h"
-#define SMC_TX_WORK_DELAY HZ
+#define SMC_TX_WORK_DELAY 0
#define SMC_TX_CORK_DELAY (HZ >> 2) /* 250 ms */
/***************************** sndbuf producer *******************************/
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 4/7] net/smc: reduce amount of status updates to peer
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
From: Karsten Graul <kgraul@linux.ibm.com>
In smc_cdc_msg_recv_action() the received cdc message is evaluated.
To reduce the number of messaged triggered by this evaluation the logic
is streamlined. For the write_blocked condition we do not need to send
a response immediately. The remaining conditions can be put together
into one if clause.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
net/smc/smc_cdc.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 780b36c69292..28bbdb04bc35 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -273,24 +273,18 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
smp_mb__after_atomic();
smc->sk.sk_data_ready(&smc->sk);
} else {
- if (conn->local_rx_ctrl.prod_flags.write_blocked ||
- conn->local_rx_ctrl.prod_flags.cons_curs_upd_req ||
- conn->local_rx_ctrl.prod_flags.urg_data_pending) {
- if (conn->local_rx_ctrl.prod_flags.urg_data_pending)
- conn->urg_state = SMC_URG_NOTYET;
- /* force immediate tx of current consumer cursor, but
- * under send_lock to guarantee arrival in seqno-order
- */
- if (smc->sk.sk_state != SMC_INIT)
- smc_tx_sndbuf_nonempty(conn);
- }
+ if (conn->local_rx_ctrl.prod_flags.write_blocked)
+ smc->sk.sk_data_ready(&smc->sk);
+ if (conn->local_rx_ctrl.prod_flags.urg_data_pending)
+ conn->urg_state = SMC_URG_NOTYET;
}
- /* piggy backed tx info */
/* trigger sndbuf consumer: RDMA write into peer RMBE and CDC */
- if (diff_cons && smc_tx_prepared_sends(conn)) {
+ if ((diff_cons && smc_tx_prepared_sends(conn)) ||
+ conn->local_rx_ctrl.prod_flags.cons_curs_upd_req ||
+ conn->local_rx_ctrl.prod_flags.urg_data_pending)
smc_tx_sndbuf_nonempty(conn);
- }
+
if (diff_cons && conn->urg_tx_pend &&
atomic_read(&conn->peer_rmbe_space) == conn->peer_rmbe_size) {
/* urg data confirmed by peer, indicate we're ready for more */
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 6/7] net/smc: check port_idx of ib event
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
From: Karsten Graul <kgraul@linux.ibm.com>
For robustness protect of higher port numbers than expected to avoid
setting bits behind our port_event_mask. In case of an DEVICE_FATAL
event all ports must be checked. The IB_EVENT_GID_CHANGE event is
provided in the global event handler, so handle it there. And handle a
QP_FATAL event instead of an DEVICE_FATAL event in the qp handler.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
net/smc/smc_ib.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 76487a16934e..0b244be24fe0 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -257,12 +257,20 @@ static void smc_ib_global_event_handler(struct ib_event_handler *handler,
smcibdev = container_of(handler, struct smc_ib_device, event_handler);
switch (ibevent->event) {
- case IB_EVENT_PORT_ERR:
case IB_EVENT_DEVICE_FATAL:
+ /* terminate all ports on device */
+ for (port_idx = 0; port_idx < SMC_MAX_PORTS; port_idx++)
+ set_bit(port_idx, &smcibdev->port_event_mask);
+ schedule_work(&smcibdev->port_event_work);
+ break;
+ case IB_EVENT_PORT_ERR:
case IB_EVENT_PORT_ACTIVE:
+ case IB_EVENT_GID_CHANGE:
port_idx = ibevent->element.port_num - 1;
- set_bit(port_idx, &smcibdev->port_event_mask);
- schedule_work(&smcibdev->port_event_work);
+ if (port_idx < SMC_MAX_PORTS) {
+ set_bit(port_idx, &smcibdev->port_event_mask);
+ schedule_work(&smcibdev->port_event_work);
+ }
break;
default:
break;
@@ -294,13 +302,13 @@ static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
u8 port_idx;
switch (ibevent->event) {
- case IB_EVENT_DEVICE_FATAL:
- case IB_EVENT_GID_CHANGE:
- case IB_EVENT_PORT_ERR:
+ case IB_EVENT_QP_FATAL:
case IB_EVENT_QP_ACCESS_ERR:
port_idx = ibevent->element.qp->port - 1;
- set_bit(port_idx, &smcibdev->port_event_mask);
- schedule_work(&smcibdev->port_event_work);
+ if (port_idx < SMC_MAX_PORTS) {
+ set_bit(port_idx, &smcibdev->port_event_mask);
+ schedule_work(&smcibdev->port_event_work);
+ }
break;
default:
break;
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 7/7] MAINTAINERS: add Karsten as SMC maintainer
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
Add Karsten as additional maintainer for Shared Memory Communications
(SMC) Sockets.
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 604bca2fc05d..f3af5cde6456 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13760,6 +13760,7 @@ F: drivers/misc/sgi-xp/
SHARED MEMORY COMMUNICATIONS (SMC) SOCKETS
M: Ursula Braun <ubraun@linux.ibm.com>
+M: Karsten Graul <kgraul@linux.ibm.com>
L: linux-s390@vger.kernel.org
W: http://www.ibm.com/developerworks/linux/linux390/
S: Supported
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 5/7] net/smc: check connections in smc_lgr_free_work
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20190212152956.71041-1-ubraun@linux.ibm.com>
From: Karsten Graul <kgraul@linux.ibm.com>
Remove the shortcut that smc_lgr_free() would skip the check for
existing connections when the link group is not in the link group list.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
---
net/smc/smc_core.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 349d789a9728..53a17cfa61af 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -160,8 +160,6 @@ static void smc_lgr_free_work(struct work_struct *work)
bool conns;
spin_lock_bh(&smc_lgr_list.lock);
- if (list_empty(&lgr->list))
- goto free;
read_lock_bh(&lgr->conns_lock);
conns = RB_EMPTY_ROOT(&lgr->conns_all);
read_unlock_bh(&lgr->conns_lock);
@@ -169,8 +167,8 @@ static void smc_lgr_free_work(struct work_struct *work)
spin_unlock_bh(&smc_lgr_list.lock);
return;
}
- list_del_init(&lgr->list); /* remove from smc_lgr_list */
-free:
+ if (!list_empty(&lgr->list))
+ list_del_init(&lgr->list); /* remove from smc_lgr_list */
spin_unlock_bh(&smc_lgr_list.lock);
if (!lgr->is_smcd && !lgr->terminating) {
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 0/7] net/smc: patches 2019-02-12
From: Ursula Braun @ 2019-02-12 15:29 UTC (permalink / raw)
To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
Dave,
here are patches for SMC:
* patches 1 and 3 optimize SMC-R tx logic
* patch 2 is a cleanup without functional change
* patch 4 optimizes rx logic
* patches 5 and 6 improve robustness in link group and IB event handling
* patch 7 establishes Karsten Graul as another SMC maintainer
Thanks, Ursula
Karsten Graul (6):
net/smc: reset cursor update required flag
net/smc: move wake up of close waiters to main send function
net/smc: no delay when waiting for free transfer buffers
net/smc: reduce amount of status updates to peer
net/smc: always check for connections in smc_lgr_free_work()
net/smc: check port_idx of ib event, update handled events
Ursula Braun (1):
MAINTAINERS: add Karsten as SMC maintainer
MAINTAINERS | 1 +
net/smc/smc_cdc.c | 29 ++++++++++++-----------------
net/smc/smc_core.c | 6 ++----
net/smc/smc_ib.c | 24 ++++++++++++++++--------
net/smc/smc_tx.c | 12 ++++++++----
5 files changed, 39 insertions(+), 33 deletions(-)
--
2.16.4
^ permalink raw reply
* Re: [iproute PATCH] man: ip-link: Describe promisc mode
From: Phil Sutter @ 2019-02-12 15:42 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, William Flanagan
In-Reply-To: <20190211113610.6c6604e4@hermes.lan>
Hi,
On Mon, Feb 11, 2019 at 11:36:10AM -0800, Stephen Hemminger wrote:
> On Mon, 11 Feb 2019 10:17:06 +0100
> Phil Sutter <phil@nwl.cc> wrote:
[...]
> > diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
> > index 73d37c190fffa..5c327f01b6b45 100644
> > --- a/man/man8/ip-link.8.in
> > +++ b/man/man8/ip-link.8.in
> > @@ -1780,6 +1780,14 @@ flag on the device. Indicates that address can change when interface goes down (
> > .B NOT
> > used by the Linux).
> >
> > +.TP
> > +.BR "promisc on " or " promisc off"
> > +change the
> > +.B PROMISC
> > +flag on the device. This requests receipt of all packets arriving at the NIC
> > +irrespective of their destination MAC address. It is typically used by traffic
> > +sniffers and also set by Linux bridges for their ports.
>
> This added sentence is confusing. The Linux bridge enables it by default,
> and if a sniffer wants to enable it then it is best done from the application.
> In either case the user should not need to directly set this through ip commands.
Well, "used by traffic sniffers" does not imply they don't set it by
themselves (at least not in the German accent I'm reading it :). And
there probably are ones that don't.
> Yes, there are a lots of incorrect web pages out there that say you need to
> set an interface into promiscious mode (with ifconfig) before adding it to a bridge.
> That might have been true 20 years ago, but hasn't been needed since Linux 2.4
In this case ip-link.8 would become a resource pointing out that bridges
do that by themselves nowadays.
> Bottom line, adding this to the documentation is not going to be helpful.
OK, so I'll send a v2 with that last sentence removed?
Thanks, Phil
^ permalink raw reply
* Re: [PATCH net] sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment
From: Neil Horman @ 2019-02-12 15:46 UTC (permalink / raw)
To: Xin Long
Cc: linux-kernel, network dev, linux-sctp, davem,
Marcelo Ricardo Leitner, hange-folder>?
In-Reply-To: <5b8187d1eabd52e4db7d3e4506d98c33571c1c83.1549968450.git.lucien.xin@gmail.com>
On Tue, Feb 12, 2019 at 06:47:30PM +0800, Xin Long wrote:
> Jianlin reported a panic when running sctp gso over gre over vlan device:
>
> [ 84.772930] RIP: 0010:do_csum+0x6d/0x170
> [ 84.790605] Call Trace:
> [ 84.791054] csum_partial+0xd/0x20
> [ 84.791657] gre_gso_segment+0x2c3/0x390
> [ 84.792364] inet_gso_segment+0x161/0x3e0
> [ 84.793071] skb_mac_gso_segment+0xb8/0x120
> [ 84.793846] __skb_gso_segment+0x7e/0x180
> [ 84.794581] validate_xmit_skb+0x141/0x2e0
> [ 84.795297] __dev_queue_xmit+0x258/0x8f0
> [ 84.795949] ? eth_header+0x26/0xc0
> [ 84.796581] ip_finish_output2+0x196/0x430
> [ 84.797295] ? skb_gso_validate_network_len+0x11/0x80
> [ 84.798183] ? ip_finish_output+0x169/0x270
> [ 84.798875] ip_output+0x6c/0xe0
> [ 84.799413] ? ip_append_data.part.50+0xc0/0xc0
> [ 84.800145] iptunnel_xmit+0x144/0x1c0
> [ 84.800814] ip_tunnel_xmit+0x62d/0x930 [ip_tunnel]
> [ 84.801699] gre_tap_xmit+0xac/0xf0 [ip_gre]
> [ 84.802395] dev_hard_start_xmit+0xa5/0x210
> [ 84.803086] sch_direct_xmit+0x14f/0x340
> [ 84.803733] __dev_queue_xmit+0x799/0x8f0
> [ 84.804472] ip_finish_output2+0x2e0/0x430
> [ 84.805255] ? skb_gso_validate_network_len+0x11/0x80
> [ 84.806154] ip_output+0x6c/0xe0
> [ 84.806721] ? ip_append_data.part.50+0xc0/0xc0
> [ 84.807516] sctp_packet_transmit+0x716/0xa10 [sctp]
> [ 84.808337] sctp_outq_flush+0xd7/0x880 [sctp]
>
> It was caused by SKB_GSO_CB(skb)->csum_start not set in sctp_gso_segment.
> sctp_gso_segment() calls skb_segment() with 'feature | NETIF_F_HW_CSUM',
> which causes SKB_GSO_CB(skb)->csum_start not to be set in skb_segment().
>
> For TCP/UDP, when feature supports HW_CSUM, CHECKSUM_PARTIAL will be set
> and gso_reset_checksum will be called to set SKB_GSO_CB(skb)->csum_start.
>
> So SCTP should do the same as TCP/UDP, to call gso_reset_checksum() when
> computing checksum in sctp_gso_segment.
>
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
> net/sctp/offload.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> index 123e9f2..edfcf16 100644
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -36,6 +36,7 @@ static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
> {
> skb->ip_summed = CHECKSUM_NONE;
> skb->csum_not_inet = 0;
> + gso_reset_checksum(skb, ~0);
> return sctp_compute_cksum(skb, skb_transport_offset(skb));
> }
>
> --
> 2.1.0
>
>
>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* Re: [PATCH net] sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate
From: Neil Horman @ 2019-02-12 15:46 UTC (permalink / raw)
To: Xin Long
Cc: linux-kernel, network dev, linux-sctp, davem,
Marcelo Ricardo Leitner
In-Reply-To: <0cb9e543c21495df48c3723044d6c9f64f238eca.1549968661.git.lucien.xin@gmail.com>
On Tue, Feb 12, 2019 at 06:51:01PM +0800, Xin Long wrote:
> In sctp_stream_init(), after sctp_stream_outq_migrate() freed the
> surplus streams' ext, but sctp_stream_alloc_out() returns -ENOMEM,
> stream->outcnt will not be set to 'outcnt'.
>
> With the bigger value on stream->outcnt, when closing the assoc and
> freeing its streams, the ext of those surplus streams will be freed
> again since those stream exts were not set to NULL after freeing in
> sctp_stream_outq_migrate(). Then the invalid-free issue reported by
> syzbot would be triggered.
>
> We fix it by simply setting them to NULL after freeing.
>
> Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
> Reported-by: syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
> net/sctp/stream.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index f246331..2936ed1 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -144,8 +144,10 @@ static void sctp_stream_outq_migrate(struct sctp_stream *stream,
> }
> }
>
> - for (i = outcnt; i < stream->outcnt; i++)
> + for (i = outcnt; i < stream->outcnt; i++) {
> kfree(SCTP_SO(stream, i)->ext);
> + SCTP_SO(stream, i)->ext = NULL;
> + }
> }
>
> static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
> --
> 2.1.0
>
>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* [PATCH net] net: neterion: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: Yang Wei @ 2019-02-12 15:47 UTC (permalink / raw)
To: netdev; +Cc: jdmason, davem, yang.wei9, albin_yang
From: Yang Wei <yang.wei9@zte.com.cn>
dev_consume_skb_irq() should be called when skb xmit done. It makes
drop profiles(dropwatch, perf) more friendly.
Signed-off-by: Yang Wei <yang.wei9@zte.com.cn>
---
drivers/net/ethernet/neterion/s2io.c | 2 +-
drivers/net/ethernet/neterion/vxge/vxge-main.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/neterion/s2io.c b/drivers/net/ethernet/neterion/s2io.c
index 82be900..feda964 100644
--- a/drivers/net/ethernet/neterion/s2io.c
+++ b/drivers/net/ethernet/neterion/s2io.c
@@ -3055,7 +3055,7 @@ static void tx_intr_handler(struct fifo_info *fifo_data)
/* Updating the statistics block */
swstats->mem_freed += skb->truesize;
- dev_kfree_skb_irq(skb);
+ dev_consume_skb_irq(skb);
get_info.offset++;
if (get_info.offset == get_info.fifo_len + 1)
diff --git a/drivers/net/ethernet/neterion/vxge/vxge-main.c b/drivers/net/ethernet/neterion/vxge/vxge-main.c
index 5ae3fa8..76410ae 100644
--- a/drivers/net/ethernet/neterion/vxge/vxge-main.c
+++ b/drivers/net/ethernet/neterion/vxge/vxge-main.c
@@ -114,7 +114,7 @@ static inline void VXGE_COMPLETE_VPATH_TX(struct vxge_fifo *fifo)
/* free SKBs */
for (temp = completed; temp != skb_ptr; temp++)
- dev_kfree_skb_irq(*temp);
+ dev_consume_skb_irq(*temp);
} while (more);
}
--
2.7.4
^ permalink raw reply related
* [PATCH net] net: atheros: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: Yang Wei @ 2019-02-12 15:51 UTC (permalink / raw)
To: netdev; +Cc: jcliburn, chris.snook, davem, yang.wei9, albin_yang
From: Yang Wei <yang.wei9@zte.com.cn>
dev_consume_skb_irq() should be called when skb xmit done. It makes
drop profiles(dropwatch, perf) more friendly.
Signed-off-by: Yang Wei <yang.wei9@zte.com.cn>
---
drivers/net/ethernet/atheros/atl1e/atl1e_main.c | 2 +-
drivers/net/ethernet/atheros/atlx/atl1.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
index 3164aad..9dfe6a9 100644
--- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
+++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
@@ -1259,7 +1259,7 @@ static bool atl1e_clean_tx_irq(struct atl1e_adapter *adapter)
}
if (tx_buffer->skb) {
- dev_kfree_skb_irq(tx_buffer->skb);
+ dev_consume_skb_irq(tx_buffer->skb);
tx_buffer->skb = NULL;
}
diff --git a/drivers/net/ethernet/atheros/atlx/atl1.c b/drivers/net/ethernet/atheros/atlx/atl1.c
index 63edc57..9e07b469 100644
--- a/drivers/net/ethernet/atheros/atlx/atl1.c
+++ b/drivers/net/ethernet/atheros/atlx/atl1.c
@@ -2088,7 +2088,7 @@ static int atl1_intr_tx(struct atl1_adapter *adapter)
}
if (buffer_info->skb) {
- dev_kfree_skb_irq(buffer_info->skb);
+ dev_consume_skb_irq(buffer_info->skb);
buffer_info->skb = NULL;
}
--
2.7.4
^ permalink raw reply related
* [PATCH net] net: apple: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: Yang Wei @ 2019-02-12 15:52 UTC (permalink / raw)
To: netdev; +Cc: davem, yang.wei9, albin_yang
From: Yang Wei <yang.wei9@zte.com.cn>
dev_consume_skb_irq() should be called in mace_interrupt() when skb
xmit done. It makes drop profiles(dropwatch, perf) more friendly.
Signed-off-by: Yang Wei <yang.wei9@zte.com.cn>
---
drivers/net/ethernet/apple/mace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/apple/mace.c b/drivers/net/ethernet/apple/mace.c
index 68b9ee4..4d9819d 100644
--- a/drivers/net/ethernet/apple/mace.c
+++ b/drivers/net/ethernet/apple/mace.c
@@ -764,7 +764,7 @@ static irqreturn_t mace_interrupt(int irq, void *dev_id)
dev->stats.tx_bytes += mp->tx_bufs[i]->len;
++dev->stats.tx_packets;
}
- dev_kfree_skb_irq(mp->tx_bufs[i]);
+ dev_consume_skb_irq(mp->tx_bufs[i]);
--mp->tx_active;
if (++i >= N_TX_RING)
i = 0;
--
2.7.4
^ permalink raw reply related
* [PATCH net] net: qualcomm: emac: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: Yang Wei @ 2019-02-12 15:49 UTC (permalink / raw)
To: netdev; +Cc: timur, davem, yang.wei9, albin_yang
From: Yang Wei <yang.wei9@zte.com.cn>
dev_consume_skb_irq() should be called in emac_mac_tx_process() when
skb xmit done. It makes drop profiles(dropwatch, perf) more friendly.
Signed-off-by: Yang Wei <yang.wei9@zte.com.cn>
---
drivers/net/ethernet/qualcomm/emac/emac-mac.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index 8d79031..20d2400 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -1204,7 +1204,7 @@ void emac_mac_tx_process(struct emac_adapter *adpt, struct emac_tx_queue *tx_q)
if (tpbuf->skb) {
pkts_compl++;
bytes_compl += tpbuf->skb->len;
- dev_kfree_skb_irq(tpbuf->skb);
+ dev_consume_skb_irq(tpbuf->skb);
tpbuf->skb = NULL;
}
--
2.7.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox