* Re: [ovs-discuss] [ovs-dev] [PATCH] datapath: fix crash when ipv6 fragment pkt recalculate L4 checksum
From: Tonghao Zhang @ 2021-12-10 4:48 UTC (permalink / raw)
To: zhounan (E)
Cc: netdev@vger.kernel.org, dev@openvswitch.org, bugs@openvswitch.org,
pravin.ovn@gmail.com, Lichunhe, liucheng (J),
Hejiajun (he jiajun, SOCF&uDF ), Greg Rose
In-Reply-To: <396da6f61fa948ac854531e935921dfc@huawei.com>
On Fri, Dec 10, 2021 at 10:59 AM zhounan (E) via discuss
<ovs-discuss@openvswitch.org> wrote:
>
> From: Zhou Nan <zhounan14@huawei.com>
>
> When we set ipv6 addr, we need to recalculate checksum of L4 header.
> In our testcase, after send ipv6 fragment package, KASAN detect "use after free" when calling function update_ipv6_checksum, and crash occurred after a while.
> If ipv6 package is fragment, and it is not first seg, we should not recalculate checksum of L4 header since this kind of package has no
> L4 header.
> To prevent crash, we set "recalc_csum" "false" when calling function "set_ipv6_addr".
> We also find that function skb_ensure_writable (make sure L4 header is writable) is helpful before calling inet_proto_csum_replace16 to recalculate checksum.
>
> Fixes: ada5efce102d6191e5c66fc385ba52a2d340ef50
> ("datapath: Fix IPv6 later frags parsing")
>
> Signed-off-by: Zhou Nan <zhounan14@huawei.com>
> ---
> datapath/actions.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/datapath/actions.c b/datapath/actions.c index fbf4457..52cf03e 100644
> --- a/datapath/actions.c
> +++ b/datapath/actions.c
> @@ -456,12 +456,21 @@ static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
> __be32 addr[4], const __be32 new_addr[4]) {
> int transport_len = skb->len - skb_transport_offset(skb);
> + int err;
>
> if (l4_proto == NEXTHDR_TCP) {
> + err = skb_ensure_writable(skb, skb_transport_offset(skb) +
> + sizeof(struct tcphdr));
> + if (unlikely(err))
> + return;
> if (likely(transport_len >= sizeof(struct tcphdr)))
> inet_proto_csum_replace16(&tcp_hdr(skb)->check, skb,
> addr, new_addr, true);
> } else if (l4_proto == NEXTHDR_UDP) {
> + err = skb_ensure_writable(skb, skb_transport_offset(skb) +
> + sizeof(struct udphdr));
> + if (unlikely(err))
> + return;
> if (likely(transport_len >= sizeof(struct udphdr))) {
> struct udphdr *uh = udp_hdr(skb);
>
> @@ -473,6 +482,10 @@ static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
> }
> }
> } else if (l4_proto == NEXTHDR_ICMP) {
> + err = skb_ensure_writable(skb, skb_transport_offset(skb) +
> + sizeof(struct icmp6hdr));
> + if (unlikely(err))
> + return;
> if (likely(transport_len >= sizeof(struct icmp6hdr)))
> inet_proto_csum_replace16(&icmp6_hdr(skb)->icmp6_cksum,
> skb, addr, new_addr, true);
> @@ -589,12 +602,15 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
> if (is_ipv6_mask_nonzero(mask->ipv6_src)) {
> __be32 *saddr = (__be32 *)&nh->saddr;
> __be32 masked[4];
> + bool recalc_csum = true;
>
> mask_ipv6_addr(saddr, key->ipv6_src, mask->ipv6_src, masked);
>
> if (unlikely(memcmp(saddr, masked, sizeof(masked)))) {
> + if (flow_key->ip.frag == OVS_FRAG_TYPE_LATER)
> + recalc_csum = false;
> set_ipv6_addr(skb, flow_key->ip.proto, saddr, masked,
> - true);
> + recalc_csum);
> memcpy(&flow_key->ipv6.addr.src, masked,
> sizeof(flow_key->ipv6.addr.src));
> }
> @@ -614,6 +630,8 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
> NEXTHDR_ROUTING,
> NULL, &flags)
> != NEXTHDR_ROUTING);
> + if (flow_key->ip.frag == OVS_FRAG_TYPE_LATER)
> + recalc_csum = false;
>
> set_ipv6_addr(skb, flow_key->ip.proto, daddr, masked,
> recalc_csum);
> --
> 2.27.0
>
> _______________________________________________
> discuss mailing list
As Gregory said, you should rebase your patch on linux upstream. and
patch is reviewd in netdev@vger.kernel.org mail list.
OvS kernel module in upstream is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/openvswitch
When the patch is applied in linux upstream, you can backport it.
Please see the section "Changes to Linux kernel components"
https://docs.openvswitch.org/en/latest/internals/contributing/backporting-patches/
> discuss@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
--
Best regards, Tonghao
^ permalink raw reply
* Re: [RFC PATCH v2 net-next 0/4] DSA master state tracking
From: Ansuel Smith @ 2021-12-10 3:37 UTC (permalink / raw)
To: Vladimir Oltean
Cc: netdev, David S. Miller, Jakub Kicinski, Andrew Lunn,
Vivien Didelot, Florian Fainelli
In-Reply-To: <20211209173927.4179375-1-vladimir.oltean@nxp.com>
On Thu, Dec 09, 2021 at 07:39:23PM +0200, Vladimir Oltean wrote:
> This patch set is provided solely for review purposes (therefore not to
> be applied anywhere) and for Ansuel to test whether they resolve the
> slowdown reported here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/
>
> The patches posted here are mainly to offer a consistent
> "master_state_change" chain of events to switches, without duplicates,
> and always starting with operational=true and ending with
> operational=false. This way, drivers should know when they can perform
> Ethernet-based register access, and need not care about more than that.
>
> Changes in v2:
> - dropped some useless patches
> - also check master operstate.
>
> Vladimir Oltean (4):
> net: dsa: provide switch operations for tracking the master state
> net: dsa: stop updating master MTU from master.c
> net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
> net: dsa: replay master state events in
> dsa_tree_{setup,teardown}_master
>
> include/net/dsa.h | 11 +++++++
> net/dsa/dsa2.c | 80 +++++++++++++++++++++++++++++++++++++++++++---
> net/dsa/dsa_priv.h | 13 ++++++++
> net/dsa/master.c | 29 ++---------------
> net/dsa/slave.c | 27 ++++++++++++++++
> net/dsa/switch.c | 15 +++++++++
> 6 files changed, 145 insertions(+), 30 deletions(-)
>
> --
> 2.25.1
>
Hi, I tested this v2 and I still have 2 ethernet mdio failing on init.
I don't think we have other way to track this. Am I wrong?
All works correctly with this and promisc_on_master.
If you have other test, feel free to send me other stuff to test.
(I'm starting to think the fail is caused by some delay that the switch
require to actually start accepting packet or from the reinit? But I'm
not sure... don't know if you notice something from the pcap)
--
Ansuel
^ permalink raw reply
* Re: [PATCH] sh_eth: Use dev_err_probe() helper
From: patchwork-bot+netdevbpf @ 2021-12-10 3:20 UTC (permalink / raw)
To: Geert Uytterhoeven; +Cc: s.shtylyov, davem, kuba, netdev, linux-renesas-soc
In-Reply-To: <2576cc15bdbb5be636640f491bcc087a334e2c02.1638959463.git.geert+renesas@glider.be>
Hello:
This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 8 Dec 2021 11:32:07 +0100 you wrote:
> Use the dev_err_probe() helper, instead of open-coding the same
> operation.
>
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
> drivers/net/ethernet/renesas/sh_eth.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
Here is the summary with links:
- sh_eth: Use dev_err_probe() helper
https://git.kernel.org/netdev/net-next/c/e5d75fc20b92
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH v3] selftests: net: Correct ping6 expected rc from 2 to 1
From: patchwork-bot+netdevbpf @ 2021-12-10 3:20 UTC (permalink / raw)
To: Jie2x Zhou
Cc: davem, kuba, shuah, dsahern, netdev, linux-kselftest,
linux-kernel, lkp, xinjianx.ma, zhijianx.li, philip.li
In-Reply-To: <20211209020230.37270-1-jie2x.zhou@intel.com>
Hello:
This patch was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 9 Dec 2021 10:02:30 +0800 you wrote:
> ./fcnal-test.sh -v -t ipv6_ping
> TEST: ping out, VRF bind - ns-B IPv6 LLA [FAIL]
> TEST: ping out, VRF bind - multicast IP [FAIL]
>
> ping6 is failing as it should.
> COMMAND: ip netns exec ns-A /bin/ping6 -c1 -w1 fe80::7c4c:bcff:fe66:a63a%red
> strace of ping6 shows it is failing with '1',
> so change the expected rc from 2 to 1.
>
> [...]
Here is the summary with links:
- [v3] selftests: net: Correct ping6 expected rc from 2 to 1
https://git.kernel.org/netdev/net/c/92816e262980
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next] net: x25: drop harmless check of !more
From: patchwork-bot+netdevbpf @ 2021-12-10 3:20 UTC (permalink / raw)
To: =?utf-8?q?J=CE=B5an_Sacren_=3Csakiwit=40gmail=2Ecom=3E?=
Cc: ms, davem, kuba, linux-x25, netdev
In-Reply-To: <20211208024732.142541-5-sakiwit@gmail.com>
Hello:
This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 8 Dec 2021 00:20:25 -0700 you wrote:
> From: Jean Sacren <sakiwit@gmail.com>
>
> 'more' is checked first. When !more is checked immediately after that,
> it is always true. We should drop this check.
>
> Signed-off-by: Jean Sacren <sakiwit@gmail.com>
>
> [...]
Here is the summary with links:
- [net-next] net: x25: drop harmless check of !more
https://git.kernel.org/netdev/net-next/c/9745177c9489
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH] net/netfilter/x_tables.c: Use kvalloc to make your code better
From: lizhe @ 2021-12-10 3:12 UTC (permalink / raw)
To: pablo, kadlec, fw, davem, kuba, sensor1010
Cc: netfilter-devel, coreteam, netdev, linux-kernel
Use kvzalloc () instead of kvmalloc () and memset
Signed-off-by: lizhe <sensor1010@163.com>
---
net/netfilter/x_tables.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 25524e393349..8d6ffed7d526 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1189,11 +1189,10 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
return NULL;
- info = kvmalloc(sz, GFP_KERNEL_ACCOUNT);
+ info = kvzalloc(sz, GFP_KERNEL_ACCOUNT);
if (!info)
return NULL;
- memset(info, 0, sizeof(*info));
info->size = size;
return info;
}
--
2.25.1
^ permalink raw reply related
* RE: [EXT] Re: [PATCH net-next] net: stmmac: bump tc when get underflow error from DMA descriptor
From: Xiaoliang Yang @ 2021-12-10 3:11 UTC (permalink / raw)
To: Jakub Kicinski, Joakim Zhang
Cc: davem@davemloft.net, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, peppe.cavallaro@st.com,
alexandre.torgue@foss.st.com, joabreu@synopsys.com,
Yannick Vignon, boon.leong.ong@intel.com, Jose.Abreu@synopsys.com,
mst@redhat.com, Joao.Pinto@synopsys.com, Mingkai Hu, Leo Li
In-Reply-To: <20211209184123.63117f42@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Hi Jakub,
On Fri, 10 Dec 2021 10:41:00 +0800 Jakub Kicinski wrote:
> > > net: stmmac: bump tc when get underflow error from DMA descriptor
> > >
> > > In DMA threshold mode, frame underflow errors may sometimes occur
> > > when the TC(threshold control) value is not enough. The TC value
> > > need to be bumped up in this case.
> > >
> > > There is no underflow interrupt bit on DMA_CH(#i)_Status of dwmac4,
> > > so the DMA threshold cannot be bumped up in stmmac_dma_interrupt().
> > > The i.mx8mp board observed an underflow error while running NFS
> > > boot, the NFS rootfs could not be mounted.
> > >
> > > The underflow error can be got from the DMA descriptor TDES3 on
> dwmac4.
> > > This patch bump up tc value once underflow error is got from TDES3.
> > >
> > > Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
> >
> > 5 queues with FIFO cut-through mode can work well after applying this
> patch.
>
> This never worked, correct? It's not a regression fix?
Yes, it's never worked when the underflow error is observed in the case of NFS boot on i.mx8mp. I'm not sure if other SoC have same issue in this case, but I think it's necessary to increase the threshold value in case of underflow error.
Do you mean that I need to send the patch as a bug fix to net branch?
Regards,
Xiaoliang
^ permalink raw reply
* Re: [PATCH bpf-next] libbpf: Skip the pinning of global data map for old kernels.
From: Shuyi Cheng @ 2021-12-10 3:02 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, open list:BPF (Safe dynamic programs and tools),
open list:BPF (Safe dynamic programs and tools)
In-Reply-To: <CAEf4BzbtQGnGZTLbTdy1GHK54f5S7YNFQak7BuEfaqGEwqNNJA@mail.gmail.com>
On 12/10/21 1:26 AM, Andrii Nakryiko wrote:
> On Thu, Dec 9, 2021 at 12:44 AM Shuyi Cheng
> <chengshuyi@linux.alibaba.com> wrote:
>>
>>
>> Fix error: "failed to pin map: Bad file descriptor, path:
>> /sys/fs/bpf/_rodata_str1_1."
>>
>> In the old kernel, the global data map will not be created, see [0]. So
>> we should skip the pinning of the global data map to avoid
>> bpf_object__pin_maps returning error.
>>
>> [0]: https://lore.kernel.org/bpf/20211123200105.387855-1-andrii@kernel.org
>>
>> Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>
>> ---
>> tools/lib/bpf/libbpf.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 6db0b5e8540e..d96cf49cebab 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -7884,6 +7884,10 @@ int bpf_object__pin_maps(struct bpf_object *obj,
>> const char *path)
>> char *pin_path = NULL;
>> char buf[PATH_MAX];
>>
>> + if (bpf_map__is_internal(map) &&
>> + !kernel_supports(obj, FEAT_GLOBAL_DATA))
>
>
> doing the same check in 3 different places sucks. Let's add "bool
> skipped" to struct bpf_map, which will be set in one place (at the map
> creation time) and then check during relocation and during pinning?
>
Agree, thanks.
regards,
Shuyi
>> + continue;
>> +
>> if (path) {
>> int len;
>>
>> --
>> 2.19.1.6.gb485710b
^ permalink raw reply
* [ovs-dev] [PATCH] datapath: fix crash when ipv6 fragment pkt recalculate L4 checksum
From: zhounan (E) @ 2021-12-10 2:59 UTC (permalink / raw)
To: netdev@vger.kernel.org, dev@openvswitch.org, bugs@openvswitch.org
Cc: liucheng (J), Hejiajun (he jiajun, SOCF&uDF ), Lichunhe,
Gregory Rose, pravin.ovn@gmail.com
In-Reply-To: <35aa84e0d1fe4bd1ad1bf6fb61c83338@huawei.com>
From: Zhou Nan <zhounan14@huawei.com>
When we set ipv6 addr, we need to recalculate checksum of L4 header.
In our testcase, after send ipv6 fragment package, KASAN detect "use after free" when calling function update_ipv6_checksum, and crash occurred after a while.
If ipv6 package is fragment, and it is not first seg, we should not recalculate checksum of L4 header since this kind of package has no
L4 header.
To prevent crash, we set "recalc_csum" "false" when calling function "set_ipv6_addr".
We also find that function skb_ensure_writable (make sure L4 header is writable) is helpful before calling inet_proto_csum_replace16 to recalculate checksum.
Fixes: ada5efce102d6191e5c66fc385ba52a2d340ef50
("datapath: Fix IPv6 later frags parsing")
Signed-off-by: Zhou Nan <zhounan14@huawei.com>
---
datapath/actions.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/datapath/actions.c b/datapath/actions.c index fbf4457..52cf03e 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -456,12 +456,21 @@ static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
__be32 addr[4], const __be32 new_addr[4]) {
int transport_len = skb->len - skb_transport_offset(skb);
+ int err;
if (l4_proto == NEXTHDR_TCP) {
+ err = skb_ensure_writable(skb, skb_transport_offset(skb) +
+ sizeof(struct tcphdr));
+ if (unlikely(err))
+ return;
if (likely(transport_len >= sizeof(struct tcphdr)))
inet_proto_csum_replace16(&tcp_hdr(skb)->check, skb,
addr, new_addr, true);
} else if (l4_proto == NEXTHDR_UDP) {
+ err = skb_ensure_writable(skb, skb_transport_offset(skb) +
+ sizeof(struct udphdr));
+ if (unlikely(err))
+ return;
if (likely(transport_len >= sizeof(struct udphdr))) {
struct udphdr *uh = udp_hdr(skb);
@@ -473,6 +482,10 @@ static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
}
}
} else if (l4_proto == NEXTHDR_ICMP) {
+ err = skb_ensure_writable(skb, skb_transport_offset(skb) +
+ sizeof(struct icmp6hdr));
+ if (unlikely(err))
+ return;
if (likely(transport_len >= sizeof(struct icmp6hdr)))
inet_proto_csum_replace16(&icmp6_hdr(skb)->icmp6_cksum,
skb, addr, new_addr, true);
@@ -589,12 +602,15 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
if (is_ipv6_mask_nonzero(mask->ipv6_src)) {
__be32 *saddr = (__be32 *)&nh->saddr;
__be32 masked[4];
+ bool recalc_csum = true;
mask_ipv6_addr(saddr, key->ipv6_src, mask->ipv6_src, masked);
if (unlikely(memcmp(saddr, masked, sizeof(masked)))) {
+ if (flow_key->ip.frag == OVS_FRAG_TYPE_LATER)
+ recalc_csum = false;
set_ipv6_addr(skb, flow_key->ip.proto, saddr, masked,
- true);
+ recalc_csum);
memcpy(&flow_key->ipv6.addr.src, masked,
sizeof(flow_key->ipv6.addr.src));
}
@@ -614,6 +630,8 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
NEXTHDR_ROUTING,
NULL, &flags)
!= NEXTHDR_ROUTING);
+ if (flow_key->ip.frag == OVS_FRAG_TYPE_LATER)
+ recalc_csum = false;
set_ipv6_addr(skb, flow_key->ip.proto, daddr, masked,
recalc_csum);
--
2.27.0
^ permalink raw reply
* Re: [PATCH bpf-next 0/2] Introduce TCP_ULP option for bpf_{set,get}sockopt
From: Tony Lu @ 2021-12-10 2:54 UTC (permalink / raw)
To: John Fastabend; +Cc: ast, daniel, andrii, bpf, netdev
In-Reply-To: <61b258ad273a9_6bfb2084d@john.notmuch>
On Thu, Dec 09, 2021 at 11:27:41AM -0800, John Fastabend wrote:
> Tony Lu wrote:
> > This patch set introduces a new option TCP_ULP for bpf_{set,get}sockopt
> > helper. The bpf prog can set and get TCP_ULP sock option on demand.
> >
> > With this, the bpf prog can set TCP_ULP based on strategies when socket
> > create or other's socket hook point. For example, the bpf prog can
> > control which socket should use tls or smc (WIP) ULP modules without
> > modifying the applications.
> >
> > Patch 1 replaces if statement with switch to make it easy to extend.
> >
> > Patch 2 introduces TCP_ULP sock option.
>
> Can you be a bit more specific on what ULP you are going to load on
> demand here and how that would work? For TLS I can't see how this will
> work, please elaborate. Because the user space side (e.g. openssl) behaves
> differently if running in kTLS vs uTLS modes I don't think you can
> from kernel side just flip it on? I'm a bit intrigued though on what
> might happen if we do did do this on an active socket, but seems it
> wouldn't be normal TLS with handshake and keys at that point? I'm
> not sure we need to block it from happening, but struggling to see
> how its useful at the moment.
>
> The smc case looks promising, but for that we need to get the order
> correct and merge smc first and then this series.
Yep, we are developing a set of patch to do with smc for transparent
replacement. The smc provides the ability to be compatible with TCP,
the applications can be replaced with smc without no side effects.
In most cases, it is impossible to modify the compiled application
binary or inject into applications' containers with LD_PRELOAD. So we
are using smc ULP to replace TCP with smc when socket create.
These patches will be sent out soon. I will send them after smc's
patches. Thank you.
>
> Also this will need a selftests.
I will fix it.
>
> Thanks,
> John
Thanks,
Tony Lu
^ permalink raw reply
* Re: [PATCH core-next] net/core: remove unneeded variable
From: Jakub Kicinski @ 2021-12-10 2:50 UTC (permalink / raw)
To: cgel.zte
Cc: davem, pablo, contact, justin.iurman, chi.minghao, netdev,
linux-kernel, Zeal Robot
In-Reply-To: <20211210022012.423994-1-chi.minghao@zte.com.cn>
On Fri, 10 Dec 2021 02:20:12 +0000 cgel.zte@gmail.com wrote:
> From: Minghao Chi <chi.minghao@zte.com.cn>
>
> Return status directly from function called.
>
> Reported-by: Zeal Robot <zealci@zte.com.cm>
> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
> ---
> net/core/lwtunnel.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index 2820aca2173a..c34248e358ac 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -63,11 +63,7 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
>
> struct lwtunnel_state *lwtunnel_state_alloc(int encap_len)
> {
> - struct lwtunnel_state *lws;
> -
> - lws = kzalloc(sizeof(*lws) + encap_len, GFP_ATOMIC);
> -
> - return lws;
> + return kzalloc(sizeof(*lws) + encap_len, GFP_ATOMIC);
> }
> EXPORT_SYMBOL_GPL(lwtunnel_state_alloc);
I don't think any of your "remove unneeded variable" patches are worth
applying, sorry.
This one doesn't even build.
^ permalink raw reply
* Re: [PATCH] selftests: icmp_redirect: pass xfail=0 to log_test() for non-xfail cases
From: Jakub Kicinski @ 2021-12-10 2:46 UTC (permalink / raw)
To: Po-Hsu Lin; +Cc: netdev, linux-kselftest, linux-kernel, davem, skhan
In-Reply-To: <20211208071151.63971-1-po-hsu.lin@canonical.com>
On Wed, 8 Dec 2021 15:11:51 +0800 Po-Hsu Lin wrote:
> If any sub-test in this icmp_redirect.sh is failing but not expected
> to fail. The script will complain:
> ./icmp_redirect.sh: line 72: [: 1: unary operator expected
>
> This is because when the sub-test is not expected to fail, we won't
> pass any value for the xfail local variable in log_test() and thus
> it's empty. Fix this by passing 0 as the 4th variable to log_test()
> for non-xfail cases.
>
> Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Thanks, could you please add a fixes tag (even if the breakage is only
present in linux-next) and CC David Ahern on v2?
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: bump tc when get underflow error from DMA descriptor
From: Jakub Kicinski @ 2021-12-10 2:41 UTC (permalink / raw)
To: Joakim Zhang, Xiaoliang Yang
Cc: davem@davemloft.net, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, peppe.cavallaro@st.com,
alexandre.torgue@foss.st.com, joabreu@synopsys.com,
Yannick Vignon, boon.leong.ong@intel.com, Jose.Abreu@synopsys.com,
mst@redhat.com, sonic.zhang@analog.com, Joao.Pinto@synopsys.com,
Mingkai Hu, Leo Li
In-Reply-To: <VI1PR04MB68009F16CAA80DCEBFA8F170E6709@VI1PR04MB6800.eurprd04.prod.outlook.com>
On Thu, 9 Dec 2021 01:31:52 +0000 Joakim Zhang wrote:
> > net: stmmac: bump tc when get underflow error from DMA descriptor
> >
> > In DMA threshold mode, frame underflow errors may sometimes occur
> > when the TC(threshold control) value is not enough. The TC value need to be
> > bumped up in this case.
> >
> > There is no underflow interrupt bit on DMA_CH(#i)_Status of dwmac4, so
> > the DMA threshold cannot be bumped up in stmmac_dma_interrupt(). The
> > i.mx8mp board observed an underflow error while running NFS boot, the
> > NFS rootfs could not be mounted.
> >
> > The underflow error can be got from the DMA descriptor TDES3 on dwmac4.
> > This patch bump up tc value once underflow error is got from TDES3.
> >
> > Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
>
> 5 queues with FIFO cut-through mode can work well after applying this patch.
This never worked, correct? It's not a regression fix?
^ permalink raw reply
* [net-next v3 2/2] net: sched: support hash/classid/cpuid selecting tx queue
From: xiangxia.m.yue @ 2021-12-10 2:36 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
Antoine Tenart, Wei Wang, Arnd Bergmann
In-Reply-To: <20211210023626.20905-1-xiangxia.m.yue@gmail.com>
From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
This patch allows users to select queue_mapping, range
from A to B. And users can use skb-hash, cgroup classid
and cpuid to select Tx queues. Then we can load balance
packets from A to B queue. The range is an unsigned 16bit
value in decimal format.
$ tc filter ... action skbedit queue_mapping hash-type normal A B
"skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit") is
enhanced with flags:
* SKBEDIT_F_QUEUE_MAPPING_HASH
* SKBEDIT_F_QUEUE_MAPPING_CLASSID
* SKBEDIT_F_QUEUE_MAPPING_CPUID
Use skb->hash, cgroup classid, or cpuid to distribute packets.
Then same range of tx queues can be shared for different flows,
cgroups, or CPUs in a variety of scenarios.
For example, flows F1 may share range R1 with flows F2. The best
way to do that is to set flag to SKBEDIT_F_QUEUE_MAPPING_HASH.
If cgroup C1 share the R1 with cgroup C2 .. Cn, use the
SKBEDIT_F_QUEUE_MAPPING_CLASSID. Of course, in some other scenario,
C1 uses R1, while Cn can use Rn.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
include/net/tc_act/tc_skbedit.h | 1 +
include/uapi/linux/tc_act/tc_skbedit.h | 8 +++
net/sched/act_skbedit.c | 74 ++++++++++++++++++++++++--
3 files changed, 79 insertions(+), 4 deletions(-)
diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index 00bfee70609e..ee96e0fa6566 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -17,6 +17,7 @@ struct tcf_skbedit_params {
u32 mark;
u32 mask;
u16 queue_mapping;
+ u16 mapping_mod;
u16 ptype;
struct rcu_head rcu;
};
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h
index 800e93377218..5642b095d206 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -29,6 +29,13 @@
#define SKBEDIT_F_PTYPE 0x8
#define SKBEDIT_F_MASK 0x10
#define SKBEDIT_F_INHERITDSFIELD 0x20
+#define SKBEDIT_F_QUEUE_MAPPING_HASH 0x40
+#define SKBEDIT_F_QUEUE_MAPPING_CLASSID 0x80
+#define SKBEDIT_F_QUEUE_MAPPING_CPUID 0x100
+
+#define SKBEDIT_F_QUEUE_MAPPING_HASH_MASK (SKBEDIT_F_QUEUE_MAPPING_HASH | \
+ SKBEDIT_F_QUEUE_MAPPING_CLASSID | \
+ SKBEDIT_F_QUEUE_MAPPING_CPUID)
struct tc_skbedit {
tc_gen;
@@ -45,6 +52,7 @@ enum {
TCA_SKBEDIT_PTYPE,
TCA_SKBEDIT_MASK,
TCA_SKBEDIT_FLAGS,
+ TCA_SKBEDIT_QUEUE_MAPPING_MAX,
__TCA_SKBEDIT_MAX
};
#define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 498feedad70a..0b0d65d7112e 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -10,6 +10,7 @@
#include <linux/kernel.h>
#include <linux/skbuff.h>
#include <linux/rtnetlink.h>
+#include <net/cls_cgroup.h>
#include <net/netlink.h>
#include <net/pkt_sched.h>
#include <net/ip.h>
@@ -23,6 +24,37 @@
static unsigned int skbedit_net_id;
static struct tc_action_ops act_skbedit_ops;
+static u16 tcf_skbedit_hash(struct tcf_skbedit_params *params,
+ struct sk_buff *skb)
+{
+ u16 queue_mapping = params->queue_mapping;
+ u16 mapping_mod = params->mapping_mod;
+ u32 mapping_hash_type = params->flags &
+ SKBEDIT_F_QUEUE_MAPPING_HASH_MASK;
+ u32 hash = 0;
+
+ if (!mapping_hash_type)
+ return netdev_cap_txqueue(skb->dev, queue_mapping);
+
+ switch (mapping_hash_type) {
+ case SKBEDIT_F_QUEUE_MAPPING_CLASSID:
+ hash = jhash_1word(task_get_classid(skb), 0);
+ break;
+ case SKBEDIT_F_QUEUE_MAPPING_HASH:
+ hash = skb_get_hash(skb);
+ break;
+ case SKBEDIT_F_QUEUE_MAPPING_CPUID:
+ hash = raw_smp_processor_id();
+ break;
+ default:
+ net_warn_ratelimited("The type of queue_mapping hash is not supported. 0x%x\n",
+ mapping_hash_type);
+ }
+
+ queue_mapping = queue_mapping + hash % mapping_mod;
+ return netdev_cap_txqueue(skb->dev, queue_mapping);
+}
+
static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
struct tcf_result *res)
{
@@ -57,10 +89,9 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
break;
}
}
- if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
- skb->dev->real_num_tx_queues > params->queue_mapping) {
+ if (params->flags & SKBEDIT_F_QUEUE_MAPPING) {
netdev_xmit_skip_txqueue();
- skb_set_queue_mapping(skb, params->queue_mapping);
+ skb_set_queue_mapping(skb, tcf_skbedit_hash(params, skb));
}
if (params->flags & SKBEDIT_F_MARK) {
skb->mark &= ~params->mask;
@@ -94,6 +125,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
[TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
[TCA_SKBEDIT_MASK] = { .len = sizeof(u32) },
[TCA_SKBEDIT_FLAGS] = { .len = sizeof(u64) },
+ [TCA_SKBEDIT_QUEUE_MAPPING_MAX] = { .len = sizeof(u16) },
};
static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
@@ -110,6 +142,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
struct tcf_skbedit *d;
u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL;
u16 *queue_mapping = NULL, *ptype = NULL;
+ u16 mapping_mod = 0;
bool exists = false;
int ret = 0, err;
u32 index;
@@ -154,7 +187,30 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
if (tb[TCA_SKBEDIT_FLAGS] != NULL) {
u64 *pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]);
+ u64 mapping_hash_type = *pure_flags &
+ SKBEDIT_F_QUEUE_MAPPING_HASH_MASK;
+ if (mapping_hash_type) {
+ u16 *queue_mapping_max;
+
+ /* Hash types are mutually exclusive. */
+ if (mapping_hash_type & (mapping_hash_type - 1))
+ return -EINVAL;
+
+ if (!tb[TCA_SKBEDIT_QUEUE_MAPPING_MAX])
+ return -EINVAL;
+ if (!tb[TCA_SKBEDIT_QUEUE_MAPPING])
+ return -EINVAL;
+
+ queue_mapping_max =
+ nla_data(tb[TCA_SKBEDIT_QUEUE_MAPPING_MAX]);
+
+ if (*queue_mapping_max < *queue_mapping)
+ return -EINVAL;
+
+ mapping_mod = *queue_mapping_max - *queue_mapping + 1;
+ flags |= mapping_hash_type;
+ }
if (*pure_flags & SKBEDIT_F_INHERITDSFIELD)
flags |= SKBEDIT_F_INHERITDSFIELD;
}
@@ -206,8 +262,10 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
params_new->flags = flags;
if (flags & SKBEDIT_F_PRIORITY)
params_new->priority = *priority;
- if (flags & SKBEDIT_F_QUEUE_MAPPING)
+ if (flags & SKBEDIT_F_QUEUE_MAPPING) {
params_new->queue_mapping = *queue_mapping;
+ params_new->mapping_mod = mapping_mod;
+ }
if (flags & SKBEDIT_F_MARK)
params_new->mark = *mark;
if (flags & SKBEDIT_F_PTYPE)
@@ -274,6 +332,14 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a,
goto nla_put_failure;
if (params->flags & SKBEDIT_F_INHERITDSFIELD)
pure_flags |= SKBEDIT_F_INHERITDSFIELD;
+ if (params->flags & SKBEDIT_F_QUEUE_MAPPING_HASH_MASK) {
+ if (nla_put_u16(skb, TCA_SKBEDIT_QUEUE_MAPPING_MAX,
+ params->queue_mapping + params->mapping_mod - 1))
+ goto nla_put_failure;
+
+ pure_flags |= params->flags &
+ SKBEDIT_F_QUEUE_MAPPING_HASH_MASK;
+ }
if (pure_flags != 0 &&
nla_put(skb, TCA_SKBEDIT_FLAGS, sizeof(pure_flags), &pure_flags))
goto nla_put_failure;
--
2.27.0
^ permalink raw reply related
* [net-next v3 1/2] net: sched: use queue_mapping to pick tx queue
From: xiangxia.m.yue @ 2021-12-10 2:36 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
Antoine Tenart, Wei Wang, Arnd Bergmann
In-Reply-To: <20211210023626.20905-1-xiangxia.m.yue@gmail.com>
From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
This patch fix issue:
* If we install tc filters with act_skbedit in clsact hook.
It doesn't work, because netdev_core_pick_tx() overwrites
queue_mapping.
$ tc filter ... action skbedit queue_mapping 1
And this patch is useful:
* We can use FQ + EDT to implement efficient policies. Tx queues
are picked by xps, ndo_select_queue of netdev driver, or skb hash
in netdev_core_pick_tx(). In fact, the netdev driver, and skb
hash are _not_ under control. xps uses the CPUs map to select Tx
queues, but we can't figure out which task_struct of pod/containter
running on this cpu in most case. We can use clsact filters to classify
one pod/container traffic to one Tx queue. Why ?
In containter networking environment, there are two kinds of pod/
containter/net-namespace. One kind (e.g. P1, P2), the high throughput
is key in these applications. But avoid running out of network resource,
the outbound traffic of these pods is limited, using or sharing one
dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
(e.g. Pn), the low latency of data access is key. And the traffic is not
limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
This choice provides two benefits. First, contention on the HTB/FQ Qdisc
lock is significantly reduced since fewer CPUs contend for the same queue.
More importantly, Qdisc contention can be eliminated completely if each
CPU has its own FIFO Qdisc for the second kind of pods.
There must be a mechanism in place to support classifying traffic based on
pods/container to different Tx queues. Note that clsact is outside of Qdisc
while Qdisc can run a classifier to select a sub-queue under the lock.
In general recording the decision in the skb seems a little heavy handed.
This patch introduces a per-CPU variable, suggested by Eric.
The skip txqueue flag will be cleared to avoid picking Tx queue in
next netdev, for example (not usual case):
eth0 (macvlan in Pod, skbedit queue_mapping) -> eth0.3 (vlan in Host)
-> eth0 (ixgbe in Host).
+----+ +----+ +----+
| P1 | | P2 | | Pn |
+----+ +----+ +----+
| | |
+-----------+-----------+
|
| clsact/skbedit
| MQ
v
+-----------+-----------+
| q0 | q1 | qn
v v v
HTB/FQ HTB/FQ ... FIFO
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
include/linux/netdevice.h | 21 +++++++++++++++++++++
net/core/dev.c | 6 +++++-
net/sched/act_skbedit.c | 4 +++-
3 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 65117f01d5f2..64f12a819246 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2997,6 +2997,7 @@ struct softnet_data {
/* written and read only by owning cpu: */
struct {
u16 recursion;
+ u8 skip_txqueue;
u8 more;
} xmit;
#ifdef CONFIG_RPS
@@ -4633,6 +4634,26 @@ static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_devi
return rc;
}
+static inline void netdev_xmit_skip_txqueue(void)
+{
+ __this_cpu_write(softnet_data.xmit.skip_txqueue, 1);
+}
+
+static inline bool netdev_xmit_txqueue_skipped(void)
+{
+ return __this_cpu_read(softnet_data.xmit.skip_txqueue);
+}
+
+static inline struct netdev_queue *
+netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
+{
+ int qm = skb_get_queue_mapping(skb);
+
+ /* Take effect only on current netdev. */
+ __this_cpu_write(softnet_data.xmit.skip_txqueue, 0);
+ return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
+}
+
int netdev_class_create_file_ns(const struct class_attribute *class_attr,
const void *ns);
void netdev_class_remove_file_ns(const struct class_attribute *class_attr,
diff --git a/net/core/dev.c b/net/core/dev.c
index aba8acc1238c..a64297a4cc89 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4069,7 +4069,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
else
skb_dst_force(skb);
- txq = netdev_core_pick_tx(dev, skb, sb_dev);
+ if (netdev_xmit_txqueue_skipped())
+ txq = netdev_tx_queue_mapping(dev, skb);
+ else
+ txq = netdev_core_pick_tx(dev, skb, sb_dev);
+
q = rcu_dereference_bh(txq->qdisc);
trace_net_dev_queue(skb);
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index d30ecbfc8f84..498feedad70a 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -58,8 +58,10 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
}
}
if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
- skb->dev->real_num_tx_queues > params->queue_mapping)
+ skb->dev->real_num_tx_queues > params->queue_mapping) {
+ netdev_xmit_skip_txqueue();
skb_set_queue_mapping(skb, params->queue_mapping);
+ }
if (params->flags & SKBEDIT_F_MARK) {
skb->mark &= ~params->mask;
skb->mark |= params->mark & params->mask;
--
2.27.0
^ permalink raw reply related
* [net-next v3 0/2] net: sched: allow user to select txqueue
From: xiangxia.m.yue @ 2021-12-10 2:36 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
David S. Miller, Jakub Kicinski, Jonathan Lemon, Eric Dumazet,
Alexander Lobakin, Paolo Abeni, Talal Ahmad, Kevin Hao,
Ilias Apalodimas, Kees Cook, Kumar Kartikeya Dwivedi,
Antoine Tenart, Wei Wang, Arnd Bergmann
From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Patch 1 allow user to select txqueue in clsact hook.
Patch 2 support skb-hash and classid to select txqueue.
Tonghao Zhang (2):
net: sched: use queue_mapping to pick tx queue
net: sched: support hash/classid/cpuid selecting tx queue
include/linux/netdevice.h | 21 +++++++
include/net/tc_act/tc_skbedit.h | 1 +
include/uapi/linux/tc_act/tc_skbedit.h | 8 +++
net/core/dev.c | 6 +-
net/sched/act_skbedit.c | 76 ++++++++++++++++++++++++--
5 files changed, 107 insertions(+), 5 deletions(-)
--
v2:
* 1/2 change skb->tc_skip_txqueue to per-cpu var, add more commit
* message.
* 2/2 optmize the codes.
v3:
* 2/2 fix the warning, add cpuid hash type.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
--
2.27.0
^ permalink raw reply
* [BPF PATCH for-next] cgroup/bpf: fast path for not loaded skb BPF filtering
From: Pavel Begunkov @ 2021-12-10 2:23 UTC (permalink / raw)
To: netdev, bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, linux-kernel, Pavel Begunkov
cgroup_bpf_enabled_key static key guards from overhead in cases where
no cgroup bpf program of a specific type is loaded in any cgroup. Turn
out that's not always good enough, e.g. when there are many cgroups but
ones that we're interesting in are without bpf. It's seen in server
environments, but the problem seems to be even wider as apparently
systemd loads some BPF affecting my laptop.
Profiles for small packet or zerocopy transmissions over fast network
show __cgroup_bpf_run_filter_skb() taking 2-3%, 1% of which is from
migrate_disable/enable(), and similarly on the receiving side. Also
got +4-5% of t-put for local testing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/bpf-cgroup.h | 24 +++++++++++++++++++++---
kernel/bpf/cgroup.c | 23 +++++++----------------
2 files changed, 28 insertions(+), 19 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 11820a430d6c..99b01201d7db 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -141,6 +141,9 @@ struct cgroup_bpf {
struct list_head progs[MAX_CGROUP_BPF_ATTACH_TYPE];
u32 flags[MAX_CGROUP_BPF_ATTACH_TYPE];
+ /* for each type tracks whether effective prog array is not empty */
+ unsigned long enabled_mask;
+
/* list of cgroup shared storages */
struct list_head storages;
@@ -219,11 +222,25 @@ int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, void *value);
int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
void *value, u64 flags);
+static inline bool __cgroup_bpf_type_enabled(struct cgroup_bpf *cgrp_bpf,
+ enum cgroup_bpf_attach_type atype)
+{
+ return test_bit(atype, &cgrp_bpf->enabled_mask);
+}
+
+#define CGROUP_BPF_TYPE_ENABLED(sk, atype) \
+({ \
+ struct cgroup *__cgrp = sock_cgroup_ptr(&(sk)->sk_cgrp_data); \
+ \
+ __cgroup_bpf_type_enabled(&__cgrp->bpf, (atype)); \
+})
+
/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb) \
({ \
int __ret = 0; \
- if (cgroup_bpf_enabled(CGROUP_INET_INGRESS)) \
+ if (cgroup_bpf_enabled(CGROUP_INET_INGRESS) && sk && \
+ CGROUP_BPF_TYPE_ENABLED((sk), CGROUP_INET_INGRESS)) \
__ret = __cgroup_bpf_run_filter_skb(sk, skb, \
CGROUP_INET_INGRESS); \
\
@@ -235,9 +252,10 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
int __ret = 0; \
if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
typeof(sk) __sk = sk_to_full_sk(sk); \
- if (sk_fullsock(__sk)) \
+ if (sk_fullsock(__sk) && \
+ CGROUP_BPF_TYPE_ENABLED(__sk, CGROUP_INET_EGRESS)) \
__ret = __cgroup_bpf_run_filter_skb(__sk, skb, \
- CGROUP_INET_EGRESS); \
+ CGROUP_INET_EGRESS); \
} \
__ret; \
})
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 2ca643af9a54..28c8d0d6ea45 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -272,6 +272,11 @@ static void activate_effective_progs(struct cgroup *cgrp,
enum cgroup_bpf_attach_type atype,
struct bpf_prog_array *old_array)
{
+ if (!bpf_prog_array_is_empty(old_array))
+ set_bit(atype, &cgrp->bpf.enabled_mask);
+ else
+ clear_bit(atype, &cgrp->bpf.enabled_mask);
+
old_array = rcu_replace_pointer(cgrp->bpf.effective[atype], old_array,
lockdep_is_held(&cgroup_mutex));
/* free prog array after grace period, since __cgroup_bpf_run_*()
@@ -1354,20 +1359,6 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
}
#ifdef CONFIG_NET
-static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp,
- enum cgroup_bpf_attach_type attach_type)
-{
- struct bpf_prog_array *prog_array;
- bool empty;
-
- rcu_read_lock();
- prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
- empty = bpf_prog_array_is_empty(prog_array);
- rcu_read_unlock();
-
- return empty;
-}
-
static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
struct bpf_sockopt_buf *buf)
{
@@ -1430,7 +1421,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
* attached to the hook so we don't waste time allocating
* memory and locking the socket.
*/
- if (__cgroup_bpf_prog_array_is_empty(cgrp, CGROUP_SETSOCKOPT))
+ if (!__cgroup_bpf_type_enabled(&cgrp->bpf, CGROUP_SETSOCKOPT))
return 0;
/* Allocate a bit more than the initial user buffer for
@@ -1526,7 +1517,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
* attached to the hook so we don't waste time allocating
* memory and locking the socket.
*/
- if (__cgroup_bpf_prog_array_is_empty(cgrp, CGROUP_GETSOCKOPT))
+ if (!__cgroup_bpf_type_enabled(&cgrp->bpf, CGROUP_GETSOCKOPT))
return retval;
ctx.optlen = max_optlen;
--
2.34.0
^ permalink raw reply related
* Re: [PATCH] bpf: return EOPNOTSUPP when JIT is needed and not possible
From: Jakub Kicinski @ 2021-12-10 2:23 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Ido Schimmel, John Fastabend, Thadeu Lima de Souza Cascardo, bpf,
netdev, ast, linux-kernel
In-Reply-To: <b294e66b-0bac-008b-52b4-6f1a90215baa@iogearbox.net>
On Fri, 10 Dec 2021 00:03:40 +0100 Daniel Borkmann wrote:
> > Similar issue was discussed in the past. See:
> > https://lore.kernel.org/netdev/20191204.125135.750458923752225025.davem@davemloft.net/
>
> With regards to ENOTSUPP exposure, if the consensus is that we should fix all
> occurences over to EOPNOTSUPP even if they've been exposed for quite some time
> (Jakub?),
Did you mean me? :) In case you did - I think we should avoid it
for new code but changing existing now seems risky. Alexei and Andrii
would know best but quick search of code bases at work reveals some
scripts looking for ENOTSUPP.
Thadeu, what motivated the change?
If we're getting those changes fixes based on checkpatch output maybe
there is a way to mute the checkpatch warnings when it's not run on a
diff?
> we could give this patch a try maybe via bpf-next and see if anyone complains.
>
> Thadeu, I think you also need to fix up BPF selftests as test_verifier, to mention
> one example (there are also bunch of others under tools/testing/selftests/), is
> checking for ENOTSUPP specifically..
^ permalink raw reply
* [PATCH core-next] net/core: remove unneeded variable
From: cgel.zte @ 2021-12-10 2:20 UTC (permalink / raw)
To: davem
Cc: kuba, pablo, contact, justin.iurman, chi.minghao, netdev,
linux-kernel, Zeal Robot
From: Minghao Chi <chi.minghao@zte.com.cn>
Return status directly from function called.
Reported-by: Zeal Robot <zealci@zte.com.cm>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
---
net/core/lwtunnel.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 2820aca2173a..c34248e358ac 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -63,11 +63,7 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
struct lwtunnel_state *lwtunnel_state_alloc(int encap_len)
{
- struct lwtunnel_state *lws;
-
- lws = kzalloc(sizeof(*lws) + encap_len, GFP_ATOMIC);
-
- return lws;
+ return kzalloc(sizeof(*lws) + encap_len, GFP_ATOMIC);
}
EXPORT_SYMBOL_GPL(lwtunnel_state_alloc);
--
2.25.1
^ permalink raw reply related
* [PATCH] net/batman-adv:remove unneeded variable
From: cgel.zte @ 2021-12-10 2:19 UTC (permalink / raw)
To: mareklindner
Cc: sw, a, sven, davem, kuba, b.a.t.m.a.n, netdev, linux-kernel,
Minghao Chi, Zeal Robot
From: Minghao Chi <chi.minghao@zte.com.cn>
Return status directly from function called.
Reported-by: Zeal Robot <zealci@zte.com.cm>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
---
net/batman-adv/network-coding.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c
index 0a7f1d36a6a8..0c300476d335 100644
--- a/net/batman-adv/network-coding.c
+++ b/net/batman-adv/network-coding.c
@@ -58,13 +58,9 @@ static int batadv_nc_recv_coded_packet(struct sk_buff *skb,
*/
int __init batadv_nc_init(void)
{
- int ret;
-
/* Register our packet type */
- ret = batadv_recv_handler_register(BATADV_CODED,
+ return batadv_recv_handler_register(BATADV_CODED,
batadv_nc_recv_coded_packet);
-
- return ret;
}
/**
--
2.25.1
^ permalink raw reply related
* [PATCH] ethernet:octeontx2:remove unneeded variable
From: cgel.zte @ 2021-12-10 2:16 UTC (permalink / raw)
To: sgoutham
Cc: lcherian, gakula, jerinj, hkelam, sbhatta, davem, kuba, netdev,
linux-kernel, Minghao Chi, Zeal Robot
From: Minghao Chi <chi.minghao@zte.com.cn>
Return status directly from function called.
Reported-by: Zeal Robot <zealci@zte.com.cm>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
---
drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
index 2ca182a4ce82..05694cd5ed15 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
@@ -815,12 +815,10 @@ int rvu_mbox_handler_cgx_features_get(struct rvu *rvu,
u32 rvu_cgx_get_fifolen(struct rvu *rvu)
{
struct mac_ops *mac_ops;
- u32 fifo_len;
mac_ops = get_mac_ops(rvu_first_cgx_pdata(rvu));
- fifo_len = mac_ops ? mac_ops->fifo_len : 0;
- return fifo_len;
+ return mac_ops ? mac_ops->fifo_len : 0;
}
static int rvu_cgx_config_intlbk(struct rvu *rvu, u16 pcifunc, bool en)
--
2.25.1
^ permalink raw reply related
* Re: [PATCH bpf-next v2 1/3] net: Parse IPv6 ext headers from TCP sock_ops
From: Jakub Kicinski @ 2021-12-10 2:01 UTC (permalink / raw)
To: Mathieu Jadin
Cc: bpf, KP Singh, netdev, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Eric Dumazet, David S. Miller, Joe Stringer,
David Ahern, Hideaki YOSHIFUJI
In-Reply-To: <20211207225635.113904-1-mathjadin@gmail.com>
On Tue, 7 Dec 2021 23:56:33 +0100 Mathieu Jadin wrote:
> Add a flag that, if set, triggers the call of eBPF program for each
> packet holding an IPv6 extension header. Also add a sock_ops operator
> that identifies such call.
>
> This change uses skb_data and skb_data_end introduced for TCP options'
> parsing but these pointer cover the IPv6 header and its extension
> headers.
>
> For instance, this change allows to read an eBPF sock_ops program to
> read complex Segment Routing Headers carrying complex messages in TLV or
> observing its intermediate segments as soon as they are received.
Can you share example use cases this opens up?
^ permalink raw reply
* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
From: Martin KaFai Lau @ 2021-12-10 1:37 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
Eric Dumazet, Jakub Kicinski, kernel-team
In-Reply-To: <b7989f8a-3f04-5186-a9f1-50f101575cfa@iogearbox.net>
On Thu, Dec 09, 2021 at 01:58:52PM +0100, Daniel Borkmann wrote:
> > Daniel, do you have suggestion on where to temporarily store
> > the forwarded EDT so that the bpf@ingress can access?
>
> Hm, was thinking maybe moving skb->skb_mstamp_ns into the shared info as
> in skb_hwtstamps(skb)->skb_mstamp_ns could work. In other words, as a union
> with hwtstamp to not bloat it further. And TCP stack as well as everything
> else (like sch_fq) could switch to it natively (hwtstamp might only be used
> on RX or TX completion from driver side if I'm not mistaken).
>
> But then while this would solve the netns transfer, we would run into the
> /same/ issue again when implementing a hairpinning LB where we loop from RX
> to TX given this would have to be cleared somewhere again if driver populates
> hwtstamp, so not really feasible and bloating shared info with a second
> tstamp would bump it by one cacheline. :(
If the edt is set at skb_hwtstamps,
skb->tstamp probably needs to be re-populated for the bpf@tc-egress
but should be minor since there is a skb_at_tc_ingress() test.
It seems fq does not need shinfo now, so that will be an extra cacheline to
bring... hmm
> A cleaner BUT still non-generic solution compared to the previous diff I could
> think of might be the below. So no change in behavior in general, but if the
> bpf@ingress@veth@host needs to access the original tstamp, it could do so
> via existing mapping we already have in BPF, and then it could transfer it
> for all or certain traffic (up to the prog) via BPF code setting ...
>
> skb->tstamp = skb->hwtstamp
>
> ... and do the redirect from there to the phys dev with BPF_F_KEEP_TSTAMP
> flag. Minimal intrusive, but unfortunately only accessible for BPF. Maybe use
> of skb_hwtstamps(skb)->nststamp could be extended though (?)
I like the idea of the possibility in temporarily storing a future mono EDT
in skb_shared_hwtstamps.
It may open up some possibilities. Not sure how that may look like yet
but I will try to develop on this.
I may have to separate the fwd-edt problem from __sk_buff->tstamp accessibility
@ingress to keep it simple first.
will try to make it generic also before scaling back to a bpf-specific solution.
Thanks for the code and the idea !
^ permalink raw reply
* [PATCH net-next v8 5/6] stmmac: dwmac-mediatek: add support for mt8195
From: Biao Huang @ 2021-12-10 1:31 UTC (permalink / raw)
To: davem, Jakub Kicinski, Rob Herring
Cc: Matthias Brugger, Giuseppe Cavallaro, Alexandre Torgue,
Jose Abreu, Maxime Coquelin, Biao Huang, netdev, devicetree,
linux-kernel, linux-arm-kernel, linux-mediatek, linux-stm32,
srv_heupstream, macpaul.lin, angelogioacchino.delregno, dkirjanov
In-Reply-To: <20211210013129.811-1-biao.huang@mediatek.com>
Add Ethernet support for MediaTek SoCs from the mt8195 family.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Acked-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
---
.../ethernet/stmicro/stmmac/dwmac-mediatek.c | 253 +++++++++++++++++-
1 file changed, 252 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
index 8747aa4403e8..bbb94aeee104 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
@@ -39,6 +39,33 @@
#define ETH_FINE_DLY_GTXC BIT(1)
#define ETH_FINE_DLY_RXC BIT(0)
+/* Peri Configuration register for mt8195 */
+#define MT8195_PERI_ETH_CTRL0 0xFD0
+#define MT8195_RMII_CLK_SRC_INTERNAL BIT(28)
+#define MT8195_RMII_CLK_SRC_RXC BIT(27)
+#define MT8195_ETH_INTF_SEL GENMASK(26, 24)
+#define MT8195_RGMII_TXC_PHASE_CTRL BIT(22)
+#define MT8195_EXT_PHY_MODE BIT(21)
+#define MT8195_DLY_GTXC_INV BIT(12)
+#define MT8195_DLY_GTXC_ENABLE BIT(5)
+#define MT8195_DLY_GTXC_STAGES GENMASK(4, 0)
+
+#define MT8195_PERI_ETH_CTRL1 0xFD4
+#define MT8195_DLY_RXC_INV BIT(25)
+#define MT8195_DLY_RXC_ENABLE BIT(18)
+#define MT8195_DLY_RXC_STAGES GENMASK(17, 13)
+#define MT8195_DLY_TXC_INV BIT(12)
+#define MT8195_DLY_TXC_ENABLE BIT(5)
+#define MT8195_DLY_TXC_STAGES GENMASK(4, 0)
+
+#define MT8195_PERI_ETH_CTRL2 0xFD8
+#define MT8195_DLY_RMII_RXC_INV BIT(25)
+#define MT8195_DLY_RMII_RXC_ENABLE BIT(18)
+#define MT8195_DLY_RMII_RXC_STAGES GENMASK(17, 13)
+#define MT8195_DLY_RMII_TXC_INV BIT(12)
+#define MT8195_DLY_RMII_TXC_ENABLE BIT(5)
+#define MT8195_DLY_RMII_TXC_STAGES GENMASK(4, 0)
+
struct mac_delay_struct {
u32 tx_delay;
u32 rx_delay;
@@ -57,11 +84,13 @@ struct mediatek_dwmac_plat_data {
int num_clks_to_config;
bool rmii_clk_from_mac;
bool rmii_rxc;
+ bool mac_wol;
};
struct mediatek_dwmac_variant {
int (*dwmac_set_phy_interface)(struct mediatek_dwmac_plat_data *plat);
int (*dwmac_set_delay)(struct mediatek_dwmac_plat_data *plat);
+ void (*dwmac_fix_mac_speed)(void *priv, unsigned int speed);
/* clock ids to be requested */
const char * const *clk_list;
@@ -77,6 +106,10 @@ static const char * const mt2712_dwmac_clk_l[] = {
"axi", "apb", "mac_main", "ptp_ref", "rmii_internal"
};
+static const char * const mt8195_dwmac_clk_l[] = {
+ "axi", "apb", "mac_cg", "mac_main", "ptp_ref", "rmii_internal"
+};
+
static int mt2712_set_interface(struct mediatek_dwmac_plat_data *plat)
{
int rmii_clk_from_mac = plat->rmii_clk_from_mac ? RMII_CLK_SRC_INTERNAL : 0;
@@ -267,6 +300,204 @@ static const struct mediatek_dwmac_variant mt2712_gmac_variant = {
.tx_delay_max = 17600,
};
+static int mt8195_set_interface(struct mediatek_dwmac_plat_data *plat)
+{
+ int rmii_clk_from_mac = plat->rmii_clk_from_mac ? MT8195_RMII_CLK_SRC_INTERNAL : 0;
+ int rmii_rxc = plat->rmii_rxc ? MT8195_RMII_CLK_SRC_RXC : 0;
+ u32 intf_val = 0;
+
+ /* The clock labeled as "rmii_internal" in mt8195_dwmac_clk_l is needed
+ * only in RMII(when MAC provides the reference clock), and useless for
+ * RGMII/MII/RMII(when PHY provides the reference clock).
+ * num_clks_to_config indicates the real number of clocks should be
+ * configured, equals to (plat->variant->num_clks - 1) in default for all the case,
+ * then +1 for rmii_clk_from_mac case.
+ */
+ plat->num_clks_to_config = plat->variant->num_clks - 1;
+
+ /* select phy interface in top control domain */
+ switch (plat->phy_mode) {
+ case PHY_INTERFACE_MODE_MII:
+ intf_val |= FIELD_PREP(MT8195_ETH_INTF_SEL, PHY_INTF_MII);
+ break;
+ case PHY_INTERFACE_MODE_RMII:
+ if (plat->rmii_clk_from_mac)
+ plat->num_clks_to_config++;
+ intf_val |= (rmii_rxc | rmii_clk_from_mac);
+ intf_val |= FIELD_PREP(MT8195_ETH_INTF_SEL, PHY_INTF_RMII);
+ break;
+ case PHY_INTERFACE_MODE_RGMII:
+ case PHY_INTERFACE_MODE_RGMII_TXID:
+ case PHY_INTERFACE_MODE_RGMII_RXID:
+ case PHY_INTERFACE_MODE_RGMII_ID:
+ intf_val |= FIELD_PREP(MT8195_ETH_INTF_SEL, PHY_INTF_RGMII);
+ break;
+ default:
+ dev_err(plat->dev, "phy interface not supported\n");
+ return -EINVAL;
+ }
+
+ /* MT8195 only support external PHY */
+ intf_val |= MT8195_EXT_PHY_MODE;
+
+ regmap_write(plat->peri_regmap, MT8195_PERI_ETH_CTRL0, intf_val);
+
+ return 0;
+}
+
+static void mt8195_delay_ps2stage(struct mediatek_dwmac_plat_data *plat)
+{
+ struct mac_delay_struct *mac_delay = &plat->mac_delay;
+
+ /* 290ps per stage */
+ mac_delay->tx_delay /= 290;
+ mac_delay->rx_delay /= 290;
+}
+
+static void mt8195_delay_stage2ps(struct mediatek_dwmac_plat_data *plat)
+{
+ struct mac_delay_struct *mac_delay = &plat->mac_delay;
+
+ /* 290ps per stage */
+ mac_delay->tx_delay *= 290;
+ mac_delay->rx_delay *= 290;
+}
+
+static int mt8195_set_delay(struct mediatek_dwmac_plat_data *plat)
+{
+ struct mac_delay_struct *mac_delay = &plat->mac_delay;
+ u32 gtxc_delay_val = 0, delay_val = 0, rmii_delay_val = 0;
+
+ mt8195_delay_ps2stage(plat);
+
+ switch (plat->phy_mode) {
+ case PHY_INTERFACE_MODE_MII:
+ delay_val |= FIELD_PREP(MT8195_DLY_TXC_ENABLE, !!mac_delay->tx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_TXC_STAGES, mac_delay->tx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_TXC_INV, mac_delay->tx_inv);
+
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_ENABLE, !!mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_STAGES, mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_INV, mac_delay->rx_inv);
+ break;
+ case PHY_INTERFACE_MODE_RMII:
+ if (plat->rmii_clk_from_mac) {
+ /* case 1: mac provides the rmii reference clock,
+ * and the clock output to TXC pin.
+ * The egress timing can be adjusted by RMII_TXC delay macro circuit.
+ * The ingress timing can be adjusted by RMII_RXC delay macro circuit.
+ */
+ rmii_delay_val |= FIELD_PREP(MT8195_DLY_RMII_TXC_ENABLE,
+ !!mac_delay->tx_delay);
+ rmii_delay_val |= FIELD_PREP(MT8195_DLY_RMII_TXC_STAGES,
+ mac_delay->tx_delay);
+ rmii_delay_val |= FIELD_PREP(MT8195_DLY_RMII_TXC_INV,
+ mac_delay->tx_inv);
+
+ rmii_delay_val |= FIELD_PREP(MT8195_DLY_RMII_RXC_ENABLE,
+ !!mac_delay->rx_delay);
+ rmii_delay_val |= FIELD_PREP(MT8195_DLY_RMII_RXC_STAGES,
+ mac_delay->rx_delay);
+ rmii_delay_val |= FIELD_PREP(MT8195_DLY_RMII_RXC_INV,
+ mac_delay->rx_inv);
+ } else {
+ /* case 2: the rmii reference clock is from external phy,
+ * and the property "rmii_rxc" indicates which pin(TXC/RXC)
+ * the reference clk is connected to. The reference clock is a
+ * received signal, so rx_delay/rx_inv are used to indicate
+ * the reference clock timing adjustment
+ */
+ if (plat->rmii_rxc) {
+ /* the rmii reference clock from outside is connected
+ * to RXC pin, the reference clock will be adjusted
+ * by RXC delay macro circuit.
+ */
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_ENABLE,
+ !!mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_STAGES,
+ mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_INV,
+ mac_delay->rx_inv);
+ } else {
+ /* the rmii reference clock from outside is connected
+ * to TXC pin, the reference clock will be adjusted
+ * by TXC delay macro circuit.
+ */
+ delay_val |= FIELD_PREP(MT8195_DLY_TXC_ENABLE,
+ !!mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_TXC_STAGES,
+ mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_TXC_INV,
+ mac_delay->rx_inv);
+ }
+ }
+ break;
+ case PHY_INTERFACE_MODE_RGMII:
+ case PHY_INTERFACE_MODE_RGMII_TXID:
+ case PHY_INTERFACE_MODE_RGMII_RXID:
+ case PHY_INTERFACE_MODE_RGMII_ID:
+ gtxc_delay_val |= FIELD_PREP(MT8195_DLY_GTXC_ENABLE, !!mac_delay->tx_delay);
+ gtxc_delay_val |= FIELD_PREP(MT8195_DLY_GTXC_STAGES, mac_delay->tx_delay);
+ gtxc_delay_val |= FIELD_PREP(MT8195_DLY_GTXC_INV, mac_delay->tx_inv);
+
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_ENABLE, !!mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_STAGES, mac_delay->rx_delay);
+ delay_val |= FIELD_PREP(MT8195_DLY_RXC_INV, mac_delay->rx_inv);
+
+ break;
+ default:
+ dev_err(plat->dev, "phy interface not supported\n");
+ return -EINVAL;
+ }
+
+ regmap_update_bits(plat->peri_regmap,
+ MT8195_PERI_ETH_CTRL0,
+ MT8195_RGMII_TXC_PHASE_CTRL |
+ MT8195_DLY_GTXC_INV |
+ MT8195_DLY_GTXC_ENABLE |
+ MT8195_DLY_GTXC_STAGES,
+ gtxc_delay_val);
+ regmap_write(plat->peri_regmap, MT8195_PERI_ETH_CTRL1, delay_val);
+ regmap_write(plat->peri_regmap, MT8195_PERI_ETH_CTRL2, rmii_delay_val);
+
+ mt8195_delay_stage2ps(plat);
+
+ return 0;
+}
+
+static void mt8195_fix_mac_speed(void *priv, unsigned int speed)
+{
+ struct mediatek_dwmac_plat_data *priv_plat = priv;
+
+ if ((phy_interface_mode_is_rgmii(priv_plat->phy_mode))) {
+ /* prefer 2ns fixed delay which is controlled by TXC_PHASE_CTRL,
+ * when link speed is 1Gbps with RGMII interface,
+ * Fall back to delay macro circuit for 10/100Mbps link speed.
+ */
+ if (speed == SPEED_1000)
+ regmap_update_bits(priv_plat->peri_regmap,
+ MT8195_PERI_ETH_CTRL0,
+ MT8195_RGMII_TXC_PHASE_CTRL |
+ MT8195_DLY_GTXC_ENABLE |
+ MT8195_DLY_GTXC_INV |
+ MT8195_DLY_GTXC_STAGES,
+ MT8195_RGMII_TXC_PHASE_CTRL);
+ else
+ mt8195_set_delay(priv_plat);
+ }
+}
+
+static const struct mediatek_dwmac_variant mt8195_gmac_variant = {
+ .dwmac_set_phy_interface = mt8195_set_interface,
+ .dwmac_set_delay = mt8195_set_delay,
+ .dwmac_fix_mac_speed = mt8195_fix_mac_speed,
+ .clk_list = mt8195_dwmac_clk_l,
+ .num_clks = ARRAY_SIZE(mt8195_dwmac_clk_l),
+ .dma_bit_mask = 35,
+ .rx_delay_max = 9280,
+ .tx_delay_max = 9280,
+};
+
static int mediatek_dwmac_config_dt(struct mediatek_dwmac_plat_data *plat)
{
struct mac_delay_struct *mac_delay = &plat->mac_delay;
@@ -307,6 +538,7 @@ static int mediatek_dwmac_config_dt(struct mediatek_dwmac_plat_data *plat)
mac_delay->rx_inv = of_property_read_bool(plat->np, "mediatek,rxc-inverse");
plat->rmii_rxc = of_property_read_bool(plat->np, "mediatek,rmii-rxc");
plat->rmii_clk_from_mac = of_property_read_bool(plat->np, "mediatek,rmii-clk-from-mac");
+ plat->mac_wol = of_property_read_bool(plat->np, "mediatek,mac-wol");
return 0;
}
@@ -383,6 +615,7 @@ static int mediatek_dwmac_clks_config(void *priv, bool enabled)
return ret;
}
+
static int mediatek_dwmac_probe(struct platform_device *pdev)
{
struct mediatek_dwmac_plat_data *priv_plat;
@@ -420,7 +653,7 @@ static int mediatek_dwmac_probe(struct platform_device *pdev)
return PTR_ERR(plat_dat);
plat_dat->interface = priv_plat->phy_mode;
- plat_dat->use_phy_wol = 1;
+ plat_dat->use_phy_wol = priv_plat->mac_wol ? 0 : 1;
plat_dat->riwt_off = 1;
plat_dat->maxmtu = ETH_DATA_LEN;
plat_dat->addr64 = priv_plat->variant->dma_bit_mask;
@@ -428,7 +661,23 @@ static int mediatek_dwmac_probe(struct platform_device *pdev)
plat_dat->init = mediatek_dwmac_init;
plat_dat->exit = mediatek_dwmac_exit;
plat_dat->clks_config = mediatek_dwmac_clks_config;
+ if (priv_plat->variant->dwmac_fix_mac_speed)
+ plat_dat->fix_mac_speed = priv_plat->variant->dwmac_fix_mac_speed;
+ plat_dat->safety_feat_cfg = devm_kzalloc(&pdev->dev,
+ sizeof(*plat_dat->safety_feat_cfg),
+ GFP_KERNEL);
+ if (!plat_dat->safety_feat_cfg)
+ return -ENOMEM;
+ plat_dat->safety_feat_cfg->tsoee = 1;
+ plat_dat->safety_feat_cfg->mrxpee = 0;
+ plat_dat->safety_feat_cfg->mestee = 1;
+ plat_dat->safety_feat_cfg->mrxee = 1;
+ plat_dat->safety_feat_cfg->mtxee = 1;
+ plat_dat->safety_feat_cfg->epsi = 0;
+ plat_dat->safety_feat_cfg->edpp = 1;
+ plat_dat->safety_feat_cfg->prtyen = 1;
+ plat_dat->safety_feat_cfg->tmouten = 1;
mediatek_dwmac_init(pdev, priv_plat);
ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
@@ -443,6 +692,8 @@ static int mediatek_dwmac_probe(struct platform_device *pdev)
static const struct of_device_id mediatek_dwmac_match[] = {
{ .compatible = "mediatek,mt2712-gmac",
.data = &mt2712_gmac_variant },
+ { .compatible = "mediatek,mt8195-gmac",
+ .data = &mt8195_gmac_variant },
{ }
};
--
2.25.1
^ permalink raw reply related
* [PATCH net-next v8 4/6] net: dt-bindings: dwmac: Convert mediatek-dwmac to DT schema
From: Biao Huang @ 2021-12-10 1:31 UTC (permalink / raw)
To: davem, Jakub Kicinski, Rob Herring
Cc: Matthias Brugger, Giuseppe Cavallaro, Alexandre Torgue,
Jose Abreu, Maxime Coquelin, Biao Huang, netdev, devicetree,
linux-kernel, linux-arm-kernel, linux-mediatek, linux-stm32,
srv_heupstream, macpaul.lin, angelogioacchino.delregno, dkirjanov
In-Reply-To: <20211210013129.811-1-biao.huang@mediatek.com>
Convert mediatek-dwmac to DT schema, and delete old mediatek-dwmac.txt.
And there are some changes in .yaml than .txt, others almost keep the same:
1. compatible "const: snps,dwmac-4.20".
2. delete "snps,reset-active-low;" in example, since driver remove this
property long ago.
3. add "snps,reset-delay-us = <0 10000 10000>" in example.
4. the example is for rgmii interface, keep related properties only.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
---
.../bindings/net/mediatek-dwmac.txt | 91 ----------
.../bindings/net/mediatek-dwmac.yaml | 156 ++++++++++++++++++
2 files changed, 156 insertions(+), 91 deletions(-)
delete mode 100644 Documentation/devicetree/bindings/net/mediatek-dwmac.txt
create mode 100644 Documentation/devicetree/bindings/net/mediatek-dwmac.yaml
diff --git a/Documentation/devicetree/bindings/net/mediatek-dwmac.txt b/Documentation/devicetree/bindings/net/mediatek-dwmac.txt
deleted file mode 100644
index afbcaebf062e..000000000000
--- a/Documentation/devicetree/bindings/net/mediatek-dwmac.txt
+++ /dev/null
@@ -1,91 +0,0 @@
-MediaTek DWMAC glue layer controller
-
-This file documents platform glue layer for stmmac.
-Please see stmmac.txt for the other unchanged properties.
-
-The device node has following properties.
-
-Required properties:
-- compatible: Should be "mediatek,mt2712-gmac" for MT2712 SoC
-- reg: Address and length of the register set for the device
-- interrupts: Should contain the MAC interrupts
-- interrupt-names: Should contain a list of interrupt names corresponding to
- the interrupts in the interrupts property, if available.
- Should be "macirq" for the main MAC IRQ
-- clocks: Must contain a phandle for each entry in clock-names.
-- clock-names: The name of the clock listed in the clocks property. These are
- "axi", "apb", "mac_main", "ptp_ref", "rmii_internal" for MT2712 SoC.
-- mac-address: See ethernet.txt in the same directory
-- phy-mode: See ethernet.txt in the same directory
-- mediatek,pericfg: A phandle to the syscon node that control ethernet
- interface and timing delay.
-
-Optional properties:
-- mediatek,tx-delay-ps: TX clock delay macro value. Default is 0.
- It should be defined for RGMII/MII interface.
- It should be defined for RMII interface when the reference clock is from MT2712 SoC.
-- mediatek,rx-delay-ps: RX clock delay macro value. Default is 0.
- It should be defined for RGMII/MII interface.
- It should be defined for RMII interface.
-Both delay properties need to be a multiple of 170 for RGMII interface,
-or will round down. Range 0~31*170.
-Both delay properties need to be a multiple of 550 for MII/RMII interface,
-or will round down. Range 0~31*550.
-
-- mediatek,rmii-rxc: boolean property, if present indicates that the RMII
- reference clock, which is from external PHYs, is connected to RXC pin
- on MT2712 SoC.
- Otherwise, is connected to TXC pin.
-- mediatek,rmii-clk-from-mac: boolean property, if present indicates that
- MT2712 SoC provides the RMII reference clock, which outputs to TXC pin only.
-- mediatek,txc-inverse: boolean property, if present indicates that
- 1. tx clock will be inversed in MII/RGMII case,
- 2. tx clock inside MAC will be inversed relative to reference clock
- which is from external PHYs in RMII case, and it rarely happen.
- 3. the reference clock, which outputs to TXC pin will be inversed in RMII case
- when the reference clock is from MT2712 SoC.
-- mediatek,rxc-inverse: boolean property, if present indicates that
- 1. rx clock will be inversed in MII/RGMII case.
- 2. reference clock will be inversed when arrived at MAC in RMII case, when
- the reference clock is from external PHYs.
- 3. the inside clock, which be sent to MAC, will be inversed in RMII case when
- the reference clock is from MT2712 SoC.
-- assigned-clocks: mac_main and ptp_ref clocks
-- assigned-clock-parents: parent clocks of the assigned clocks
-
-Example:
- eth: ethernet@1101c000 {
- compatible = "mediatek,mt2712-gmac";
- reg = <0 0x1101c000 0 0x1300>;
- interrupts = <GIC_SPI 237 IRQ_TYPE_LEVEL_LOW>;
- interrupt-names = "macirq";
- phy-mode ="rgmii-rxid";
- mac-address = [00 55 7b b5 7d f7];
- clock-names = "axi",
- "apb",
- "mac_main",
- "ptp_ref",
- "rmii_internal";
- clocks = <&pericfg CLK_PERI_GMAC>,
- <&pericfg CLK_PERI_GMAC_PCLK>,
- <&topckgen CLK_TOP_ETHER_125M_SEL>,
- <&topckgen CLK_TOP_ETHER_50M_SEL>,
- <&topckgen CLK_TOP_ETHER_50M_RMII_SEL>;
- assigned-clocks = <&topckgen CLK_TOP_ETHER_125M_SEL>,
- <&topckgen CLK_TOP_ETHER_50M_SEL>,
- <&topckgen CLK_TOP_ETHER_50M_RMII_SEL>;
- assigned-clock-parents = <&topckgen CLK_TOP_ETHERPLL_125M>,
- <&topckgen CLK_TOP_APLL1_D3>,
- <&topckgen CLK_TOP_ETHERPLL_50M>;
- power-domains = <&scpsys MT2712_POWER_DOMAIN_AUDIO>;
- mediatek,pericfg = <&pericfg>;
- mediatek,tx-delay-ps = <1530>;
- mediatek,rx-delay-ps = <1530>;
- mediatek,rmii-rxc;
- mediatek,txc-inverse;
- mediatek,rxc-inverse;
- snps,txpbl = <1>;
- snps,rxpbl = <1>;
- snps,reset-gpio = <&pio 87 GPIO_ACTIVE_LOW>;
- snps,reset-active-low;
- };
diff --git a/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml b/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml
new file mode 100644
index 000000000000..9207266a6e69
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml
@@ -0,0 +1,156 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/mediatek-dwmac.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: MediaTek DWMAC glue layer controller
+
+maintainers:
+ - Biao Huang <biao.huang@mediatek.com>
+
+description:
+ This file documents platform glue layer for stmmac.
+
+# We need a select here so we don't match all nodes with 'snps,dwmac'
+select:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - mediatek,mt2712-gmac
+ required:
+ - compatible
+
+allOf:
+ - $ref: "snps,dwmac.yaml#"
+
+properties:
+ compatible:
+ oneOf:
+ - items:
+ - enum:
+ - mediatek,mt2712-gmac
+ - const: snps,dwmac-4.20a
+
+ clocks:
+ items:
+ - description: AXI clock
+ - description: APB clock
+ - description: MAC Main clock
+ - description: PTP clock
+ - description: RMII reference clock provided by MAC
+
+ clock-names:
+ items:
+ - const: axi
+ - const: apb
+ - const: mac_main
+ - const: ptp_ref
+ - const: rmii_internal
+
+ mediatek,pericfg:
+ $ref: /schemas/types.yaml#/definitions/phandle
+ description:
+ The phandle to the syscon node that control ethernet
+ interface and timing delay.
+
+ mediatek,tx-delay-ps:
+ description:
+ The internal TX clock delay (provided by this driver) in nanoseconds.
+ For MT2712 RGMII interface, Allowed value need to be a multiple of 170,
+ or will round down. Range 0~31*170.
+ For MT2712 RMII/MII interface, Allowed value need to be a multiple of 550,
+ or will round down. Range 0~31*550.
+
+ mediatek,rx-delay-ps:
+ description:
+ The internal RX clock delay (provided by this driver) in nanoseconds.
+ For MT2712 RGMII interface, Allowed value need to be a multiple of 170,
+ or will round down. Range 0~31*170.
+ For MT2712 RMII/MII interface, Allowed value need to be a multiple of 550,
+ or will round down. Range 0~31*550.
+
+ mediatek,rmii-rxc:
+ type: boolean
+ description:
+ If present, indicates that the RMII reference clock, which is from external
+ PHYs, is connected to RXC pin. Otherwise, is connected to TXC pin.
+
+ mediatek,rmii-clk-from-mac:
+ type: boolean
+ description:
+ If present, indicates that MAC provides the RMII reference clock, which
+ outputs to TXC pin only.
+
+ mediatek,txc-inverse:
+ type: boolean
+ description:
+ If present, indicates that
+ 1. tx clock will be inversed in MII/RGMII case,
+ 2. tx clock inside MAC will be inversed relative to reference clock
+ which is from external PHYs in RMII case, and it rarely happen.
+ 3. the reference clock, which outputs to TXC pin will be inversed in RMII case
+ when the reference clock is from MAC.
+
+ mediatek,rxc-inverse:
+ type: boolean
+ description:
+ If present, indicates that
+ 1. rx clock will be inversed in MII/RGMII case.
+ 2. reference clock will be inversed when arrived at MAC in RMII case, when
+ the reference clock is from external PHYs.
+ 3. the inside clock, which be sent to MAC, will be inversed in RMII case when
+ the reference clock is from MAC.
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - interrupt-names
+ - clocks
+ - clock-names
+ - phy-mode
+ - mediatek,pericfg
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/clock/mt2712-clk.h>
+ #include <dt-bindings/gpio/gpio.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/interrupt-controller/irq.h>
+ #include <dt-bindings/power/mt2712-power.h>
+
+ eth: ethernet@1101c000 {
+ compatible = "mediatek,mt2712-gmac", "snps,dwmac-4.20a";
+ reg = <0x1101c000 0x1300>;
+ interrupts = <GIC_SPI 237 IRQ_TYPE_LEVEL_LOW>;
+ interrupt-names = "macirq";
+ phy-mode ="rgmii-rxid";
+ mac-address = [00 55 7b b5 7d f7];
+ clock-names = "axi",
+ "apb",
+ "mac_main",
+ "ptp_ref",
+ "rmii_internal";
+ clocks = <&pericfg CLK_PERI_GMAC>,
+ <&pericfg CLK_PERI_GMAC_PCLK>,
+ <&topckgen CLK_TOP_ETHER_125M_SEL>,
+ <&topckgen CLK_TOP_ETHER_50M_SEL>,
+ <&topckgen CLK_TOP_ETHER_50M_RMII_SEL>;
+ assigned-clocks = <&topckgen CLK_TOP_ETHER_125M_SEL>,
+ <&topckgen CLK_TOP_ETHER_50M_SEL>,
+ <&topckgen CLK_TOP_ETHER_50M_RMII_SEL>;
+ assigned-clock-parents = <&topckgen CLK_TOP_ETHERPLL_125M>,
+ <&topckgen CLK_TOP_APLL1_D3>,
+ <&topckgen CLK_TOP_ETHERPLL_50M>;
+ power-domains = <&scpsys MT2712_POWER_DOMAIN_AUDIO>;
+ mediatek,pericfg = <&pericfg>;
+ mediatek,tx-delay-ps = <1530>;
+ snps,txpbl = <1>;
+ snps,rxpbl = <1>;
+ snps,reset-gpio = <&pio 87 GPIO_ACTIVE_LOW>;
+ snps,reset-delays-us = <0 10000 10000>;
+ };
--
2.25.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox