* Re: [PATCH 29/38] cls_flower: Convert handle_idr to XArray
From: Matthew Wilcox @ 2019-08-21 0:50 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20190820.165834.1420751898749952901.davem@davemloft.net>
On Tue, Aug 20, 2019 at 04:58:34PM -0700, David Miller wrote:
> From: Matthew Wilcox <willy@infradead.org>
> Date: Tue, 20 Aug 2019 15:32:50 -0700
>
> > - idr_replace(&head->handle_idr, fnew, fnew->handle);
> > + xa_store(&head->filters, fnew->handle, fnew, 0);
>
> Passing a gfp_t of zero? :-)
Yes! We know we'll never do an allocation here because we're replacing
an entry that already exists. It wouldn't harm us to pass a real
GFP flag, so I'll probably just change that. It might help a future
implementation, and it will definitely save confusion.
^ permalink raw reply
* Re: [PATCH 23/38] cls_api: Convert tcf_net to XArray
From: Matthew Wilcox @ 2019-08-21 0:52 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20190820.165728.2062957580528299761.davem@davemloft.net>
On Tue, Aug 20, 2019 at 04:57:28PM -0700, David Miller wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> >
> > This module doesn't use the allocating functionality; convert it to a
> > plain XArray instead of an allocating one. I've left struct tcf_net
> > in place in case more objects are added to it in future, although
> > it now only contains an XArray. We don't need to call xa_destroy()
> > if the array is empty, so I've removed the contents of tcf_net_exit()
> > -- if it can be called with entries still in place, then it shoud call
> > xa_destroy() instead.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>
> I don't know if the net exit can be invoked with entires still in place,
> however if the tcf_net_exit() function is made empty it should be removed
> along with the assignment to the per-netns ops.
Thanks! I wasn't sure what the rule was for these ops. I'll fix that up.
^ permalink raw reply
* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: Alexander Duyck @ 2019-08-21 1:17 UTC (permalink / raw)
To: Ilya Maximets
Cc: Netdev, LKML, bpf, David S. Miller, Björn Töpel,
Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
William Tu
In-Reply-To: <625791af-c656-1e42-b60e-b3a5cedcb4c4@samsung.com>
On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 20.08.2019 18:35, Alexander Duyck wrote:
> > On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>
> >> Tx code doesn't clear the descriptor status after cleaning.
> >> So, if the budget is larger than number of used elems in a ring, some
> >> descriptors will be accounted twice and xsk_umem_complete_tx will move
> >> prod_tail far beyond the prod_head breaking the comletion queue ring.
> >>
> >> Fix that by limiting the number of descriptors to clean by the number
> >> of used descriptors in the tx ring.
> >>
> >> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> >> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >
> > I'm not sure this is the best way to go. My preference would be to
> > have something in the ring that would prevent us from racing which I
> > don't think this really addresses. I am pretty sure this code is safe
> > on x86 but I would be worried about weak ordered systems such as
> > PowerPC.
> >
> > It might make sense to look at adding the eop_desc logic like we have
> > in the regular path with a proper barrier before we write it and after
> > we read it. So for example we could hold of on writing the bytecount
> > value until the end of an iteration and call smp_wmb before we write
> > it. Then on the cleanup we could read it and if it is non-zero we take
> > an smp_rmb before proceeding further to process the Tx descriptor and
> > clearing the value. Otherwise this code is going to just keep popping
> > up with issues.
>
> But, unlike regular case, xdp zero-copy xmit and clean for particular
> tx ring always happens in the same NAPI context and even on the same
> CPU core.
>
> I saw the 'eop_desc' manipulations in regular case and yes, we could
> use 'next_to_watch' field just as a flag of descriptor existence,
> but it seems unnecessarily complicated. Am I missing something?
>
So is it always in the same NAPI context?. I forgot, I was thinking
that somehow the socket could possibly make use of XDP for transmit.
As far as the logic to use I would be good with just using a value you
are already setting such as the bytecount value. All that would need
to happen is to guarantee that the value is cleared in the Tx path. So
if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
theoretically just use that as well to flag that a descriptor has been
populated and is ready to be cleaned. Assuming the logic about this
all being in the same NAPI context anyway you wouldn't need to mess
with the barrier stuff I mentioned before.
- Alex
^ permalink raw reply
* Re: [RFC 1/4] Add usb_get_address and usb_set_address support
From: Andrew Lunn @ 2019-08-21 1:22 UTC (permalink / raw)
To: Charles.Hyde
Cc: linux-usb, linux-acpi, gregkh, Mario.Limonciello, oliver, netdev,
nic_swsd
In-Reply-To: <1566339522507.45056@Dellteam.com>
On Tue, Aug 20, 2019 at 10:18:42PM +0000, Charles.Hyde@dellteam.com wrote:
> The core USB driver message.c is missing get/set address functionality
> that stops ifconfig from being able to push MAC addresses out to USB
> based ethernet devices. Without this functionality, some USB devices
> stop responding to ethernet packets when using ifconfig to change MAC
> addresses.
Hi Charles
ifconfig has been deprecated for years, maybe a decade. Please
reference the current tools, iproute2.
Andrew
^ permalink raw reply
* Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for wider use
From: Joseph Qi @ 2019-08-21 1:29 UTC (permalink / raw)
To: Andy Shevchenko, Mark Fasheh, Joel Becker, ocfs2-devel,
Ariel Elior, Sudarsana Kalluru, GR-everest-linux-l2,
David S. Miller, netdev, Colin Ian King
In-Reply-To: <20190820163112.50818-1-andriy.shevchenko@linux.intel.com>
On 19/8/21 00:31, Andy Shevchenko wrote:
> There are users already and will be more of BITS_TO_BYTES() macro.
> Move it to bitops.h for wider use.
>
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> ---
> drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h | 1 -
> fs/ocfs2/dlm/dlmcommon.h | 4 ----
> include/linux/bitops.h | 1 +
> 3 files changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> index 066765fbef06..0a59a09ef82f 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> @@ -296,7 +296,6 @@ static inline void bnx2x_dcb_config_qm(struct bnx2x *bp, enum cos_mode mode,
> * possible, the driver should only write the valid vnics into the internal
> * ram according to the appropriate port mode.
> */
> -#define BITS_TO_BYTES(x) ((x)/8)>
I don't think this is a equivalent replace, or it is in fact
wrong before?
> /* CMNG constants, as derived from system spec calculations */
>
> diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
> index aaf24548b02a..0463dce65bb2 100644
> --- a/fs/ocfs2/dlm/dlmcommon.h
> +++ b/fs/ocfs2/dlm/dlmcommon.h
> @@ -688,10 +688,6 @@ struct dlm_begin_reco
> __be32 pad2;
> };
>
> -
> -#define BITS_PER_BYTE 8
> -#define BITS_TO_BYTES(bits) (((bits)+BITS_PER_BYTE-1)/BITS_PER_BYTE)
> -
For ocfs2 part, it looks good to me.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> struct dlm_query_join_request
> {
> u8 node_idx;
> diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> index cf074bce3eb3..79d80f5ddf7b 100644
> --- a/include/linux/bitops.h
> +++ b/include/linux/bitops.h
> @@ -5,6 +5,7 @@
> #include <linux/bits.h>
>
> #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
> +#define BITS_TO_BYTES(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE)
> #define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
>
> extern unsigned int __sw_hweight8(unsigned int w);
>
^ permalink raw reply
* Re: [RFC 3/4] Move ACPI functionality out of r8152 driver
From: Andrew Lunn @ 2019-08-21 1:30 UTC (permalink / raw)
To: Charles.Hyde
Cc: linux-usb, linux-acpi, gregkh, Mario.Limonciello, oliver, netdev,
nic_swsd
In-Reply-To: <1566339738195.2913@Dellteam.com>
On Tue, Aug 20, 2019 at 10:22:18PM +0000, Charles.Hyde@dellteam.com wrote:
> This change moves ACPI functionality out of the Realtek r8152 driver to
> its own source and header file, making it available to other drivers as
> needed now and into the future. At the time this ACPI snippet was
> introduced in 2016, only the Realtek driver made use of it in support of
> Dell's enterprise IT policy efforts. There comes now a need for this
> same support in a different driver, also in support of Dell's enterprise
> IT policy efforts.
>
> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
> Cc: Mario Limonciello <mario.limonciello@dell.com>
> Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
> Cc: linux-usb@vger.kernel.org
> Cc: linux-acpi@vger.kernel.org
> ---
> drivers/net/usb/r8152.c | 44 ++++-------------------------------------
> lib/Makefile | 3 ++-
Hi Charles
I think your forgot to add the new files?
Andrew
^ permalink raw reply
* [v4] ocelot_ace: fix action of trap
From: Yangbo Lu @ 2019-08-21 1:59 UTC (permalink / raw)
To: netdev, David S . Miller, Allan W . Nielsen, Alexandre Belloni,
Microchip Linux Driver Support, Andrew Lunn
Cc: Yangbo Lu
The trap action should be copying the frame to CPU and
dropping it for forwarding, but current setting was just
copying frame to CPU.
Fixes: b596229448dd ("net: mscc: ocelot: Add support for tcam")
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Acked-by: Allan W. Nielsen <allan.nielsen@microchip.com>
---
Changes for v4:
- Added ACK and Fixes info in commit message.
Changes for v3:
- Set MASK_MODE to 1 for dropping forwarding.
- Dropped other patches of patch-set.
Changes for v2:
- None.
---
drivers/net/ethernet/mscc/ocelot_ace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mscc/ocelot_ace.c b/drivers/net/ethernet/mscc/ocelot_ace.c
index 39aca1a..86fc6e6 100644
--- a/drivers/net/ethernet/mscc/ocelot_ace.c
+++ b/drivers/net/ethernet/mscc/ocelot_ace.c
@@ -317,7 +317,7 @@ static void is2_action_set(struct vcap_data *data,
break;
case OCELOT_ACL_ACTION_TRAP:
VCAP_ACT_SET(PORT_MASK, 0x0);
- VCAP_ACT_SET(MASK_MODE, 0x0);
+ VCAP_ACT_SET(MASK_MODE, 0x1);
VCAP_ACT_SET(POLICE_ENA, 0x0);
VCAP_ACT_SET(POLICE_IDX, 0x0);
VCAP_ACT_SET(CPU_QU_NUM, 0x0);
--
2.7.4
^ permalink raw reply related
* Re: [PATCH 1/1] net: rds: add service level support in rds-info
From: Zhu Yanjun @ 2019-08-21 2:10 UTC (permalink / raw)
To: Doug Ledford, davem, netdev, linux-rdma, rds-devel
In-Reply-To: <f3de2e40f1bc2eb219d3056ee954747db90dbbb4.camel@redhat.com>
Hi,Doug
My reply is in line.
On 2019/8/20 23:28, Doug Ledford wrote:
> On Mon, 2019-08-19 at 20:52 -0400, Zhu Yanjun wrote:
>> diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
>> index fd6b5f6..cba368e 100644
>> --- a/include/uapi/linux/rds.h
>> +++ b/include/uapi/linux/rds.h
>> @@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
>> __u32 rdma_mr_max;
>> __u32 rdma_mr_size;
>> __u8 tos;
>> + __u8 sl;
>> __u32 cache_allocs;
>> };
>>
>> @@ -265,6 +266,7 @@ struct rds6_info_rdma_connection {
>> __u32 rdma_mr_max;
>> __u32 rdma_mr_size;
>> __u8 tos;
>> + __u8 sl;
>> __u32 cache_allocs;
>> };
>>
> This is a user space API break (as was the prior patch mentioned
> below)...
>
>> The commit fe3475af3bdf ("net: rds: add per rds connection cache
>> statistics") adds cache_allocs in struct rds_info_rdma_connection
>> as below:
>> struct rds_info_rdma_connection {
>> ...
>> __u32 rdma_mr_max;
>> __u32 rdma_mr_size;
>> __u8 tos;
>> __u32 cache_allocs;
>> };
>> The peer struct in rds-tools of struct rds_info_rdma_connection is as
>> below:
>> struct rds_info_rdma_connection {
>> ...
>> uint32_t rdma_mr_max;
>> uint32_t rdma_mr_size;
>> uint8_t tos;
>> uint8_t sl;
>> uint32_t cache_allocs;
>> };
> Why are the user space rds tools not using the kernel provided abi
> files?
Perhaps it is a long story.
>
> In order to know if this ABI breakage is safe, we need to know what
> versions of rds-tools are out in the wild and have their own headers
> that we need to match up with.
From my works in LAB and in the customer's host, rds-tools 2.0.7 is the
popular
version. Other versions rds-tools are used less.
> Are there any versions of rds-tools that
> actually use the kernel provided headers?
"the kernel provided headers", do you mean include/uapi/linux/rds.h?
I checked the rds-tools source code. I do not find any version of
rds-tools us this header files.
> Are there any other users of
> uapi/linux/rds.h besides rds-tools?
Not sure. But in Oracle, there are some rds applications. I am not sure
whether these rds applications
will use include/uapi/linux/rds.h file or not.
I will investigate it.
>
> Once the kernel and rds-tools package are in sync,
After this commit is merged into mailine, the kernel and rds-tools
package are in sync.
I will make investigations about rds-tools using the kernel header
include/uapi/linux/rds.h.
Thanks a lot for your comments.
Zhu Yanjun
> rds-tools needs to be
> modified to use the kernel header and proper ABI maintenance needs to be
> started.
>
^ permalink raw reply
* [PATCHv3 0/2] fix dev null pointer dereference when send packets larger than mtu in collect_md mode
From: Hangbin Liu @ 2019-08-21 2:09 UTC (permalink / raw)
To: netdev
Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
Eric Dumazet, Hangbin Liu
In-Reply-To: <20190815060904.19426-1-liuhangbin@gmail.com>
When we send a packet larger than PMTU, we need to reply with
icmp_send(ICMP_FRAG_NEEDED) or icmpv6_send(ICMPV6_PKT_TOOBIG).
But with collect_md mode, kernel will crash while accessing the dst dev
as __metadata_dst_init() init dst->dev to NULL by default. Here is what
the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here
- ip6gre_xmit_ipv6
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmpv6_send
...
- decode_session4
- oif = skb_dst(skb)->dev->ifindex; <-- here
- decode_session6
- oif = skb_dst(skb)->dev->ifindex; <-- here
We could not fix it in __metadata_dst_init() as there is no dev supplied.
Look in to the __icmp_send()/decode_session{4,6} code we could find the dst
dev is actually not needed. In __icmp_send(), we could get the net by skb->dev.
For decode_session{4,6}, as it was called by xfrm_decode_session_reverse()
in this scenario, the oif is not used by
fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
The reproducer is easy:
ovs-vsctl add-br br0
ip link set br0 up
ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$dst_addr
ip link set gre0 up
ip addr add ${local_gre6}/64 dev br0
ping6 $remote_gre6 -s 1500
The kernel will crash like
[40595.821651] BUG: kernel NULL pointer dereference, address: 0000000000000108
[40595.822411] #PF: supervisor read access in kernel mode
[40595.822949] #PF: error_code(0x0000) - not-present page
[40595.823492] PGD 0 P4D 0
[40595.823767] Oops: 0000 [#1] SMP PTI
[40595.824139] CPU: 0 PID: 2831 Comm: handler12 Not tainted 5.2.0 #57
[40595.824788] Hardware name: Red Hat KVM, BIOS 1.11.1-3.module+el8.1.0+2983+b2ae9c0a 04/01/2014
[40595.825680] RIP: 0010:__xfrm_decode_session+0x6b/0x930
[40595.826219] Code: b7 c0 00 00 00 b8 06 00 00 00 66 85 d2 0f b7 ca 48 0f 45 c1 44 0f b6 2c 06 48 8b 47 58 48 83 e0 fe 0f 84 f4 04 00 00 48 8b 00 <44> 8b 80 08 01 00 00 41 f6 c4 01 4c 89 e7
ba 58 00 00 00 0f 85 47
[40595.828155] RSP: 0018:ffffc90000a73438 EFLAGS: 00010286
[40595.828705] RAX: 0000000000000000 RBX: ffff8881329d7100 RCX: 0000000000000000
[40595.829450] RDX: 0000000000000000 RSI: ffff8881339e70ce RDI: ffff8881329d7100
[40595.830191] RBP: ffffc90000a73470 R08: 0000000000000000 R09: 000000000000000a
[40595.830936] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a73490
[40595.831682] R13: 000000000000002c R14: ffff888132ff1301 R15: ffff8881329d7100
[40595.832427] FS: 00007f5bfcfd6700(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
[40595.833266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40595.833883] CR2: 0000000000000108 CR3: 000000013a368000 CR4: 00000000000006f0
[40595.834633] Call Trace:
[40595.835392] ? rt6_multipath_hash+0x4c/0x390
[40595.835853] icmpv6_route_lookup+0xcb/0x1d0
[40595.836296] ? icmpv6_xrlim_allow+0x3e/0x140
[40595.836751] icmp6_send+0x537/0x840
[40595.837125] icmpv6_send+0x20/0x30
[40595.837494] tnl_update_pmtu.isra.27+0x19d/0x2a0 [ip_tunnel]
[40595.838088] ip_md_tunnel_xmit+0x1b6/0x510 [ip_tunnel]
[40595.838633] gre_tap_xmit+0x10c/0x160 [ip_gre]
[40595.839103] dev_hard_start_xmit+0x93/0x200
[40595.839551] sch_direct_xmit+0x101/0x2d0
[40595.839967] __dev_queue_xmit+0x69f/0x9c0
[40595.840399] do_execute_actions+0x1717/0x1910 [openvswitch]
[40595.840987] ? validate_set.isra.12+0x2f5/0x3d0 [openvswitch]
[40595.841596] ? reserve_sfa_size+0x31/0x130 [openvswitch]
[40595.842154] ? __ovs_nla_copy_actions+0x1b4/0xad0 [openvswitch]
[40595.842778] ? __kmalloc_reserve.isra.50+0x2e/0x80
[40595.843285] ? should_failslab+0xa/0x20
[40595.843696] ? __kmalloc+0x188/0x220
[40595.844078] ? __alloc_skb+0x97/0x270
[40595.844472] ovs_execute_actions+0x47/0x120 [openvswitch]
[40595.845041] ovs_packet_cmd_execute+0x27d/0x2b0 [openvswitch]
[40595.845648] genl_family_rcv_msg+0x3a8/0x430
[40595.846101] genl_rcv_msg+0x47/0x90
[40595.846476] ? __alloc_skb+0x83/0x270
[40595.846866] ? genl_family_rcv_msg+0x430/0x430
[40595.847335] netlink_rcv_skb+0xcb/0x100
[40595.847777] genl_rcv+0x24/0x40
[40595.848113] netlink_unicast+0x17f/0x230
[40595.848535] netlink_sendmsg+0x2ed/0x3e0
[40595.848951] sock_sendmsg+0x4f/0x60
[40595.849323] ___sys_sendmsg+0x2bd/0x2e0
[40595.849733] ? sock_poll+0x6f/0xb0
[40595.850098] ? ep_scan_ready_list.isra.14+0x20b/0x240
[40595.850634] ? _cond_resched+0x15/0x30
[40595.851032] ? ep_poll+0x11b/0x440
[40595.851401] ? _copy_to_user+0x22/0x30
[40595.851799] __sys_sendmsg+0x58/0xa0
[40595.852180] do_syscall_64+0x5b/0x190
[40595.852574] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[40595.853105] RIP: 0033:0x7f5c00038c7d
[40595.853489] Code: c7 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 8e f7 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d7 f7 ff ff 48 89
d0 48 83 c4 08 48 3d 01
[40595.855443] RSP: 002b:00007f5bfcf73c00 EFLAGS: 00003293 ORIG_RAX: 000000000000002e
[40595.856244] RAX: ffffffffffffffda RBX: 00007f5bfcf74a60 RCX: 00007f5c00038c7d
[40595.856990] RDX: 0000000000000000 RSI: 00007f5bfcf73c60 RDI: 0000000000000015
[40595.857736] RBP: 0000000000000004 R08: 0000000000000b7c R09: 0000000000000110
[40595.858613] R10: 0001000800050004 R11: 0000000000003293 R12: 000055c2d8329da0
[40595.859401] R13: 00007f5bfcf74120 R14: 0000000000000347 R15: 00007f5bfcf73c60
[40595.860185] Modules linked in: ip_gre ip_tunnel gre openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc bochs_drm ttm drm_kms_helper drm pcspkr joydev i2c_piix4 qemu_fw_cfg xfs libcrc32c virtio_net net_failover serio_raw failover ata_generic virtio_blk pata_acpi floppy
[40595.863155] CR2: 0000000000000108
[40595.863551] ---[ end trace 22209bbcacb4addd ]---
v3: only replace pkg to packets in cover letter. So I didn't update the version
info in the follow up patches.
v2: fix it in __icmp_send() and decode_session{4,6} separately instead of
updating shared dst dev in {ip_md, ip6}_tunnel_xmit.
Hangbin Liu (2):
ipv4/icmp: fix rt dst dev null pointer dereference
xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md
mode
net/ipv4/icmp.c | 5 ++++-
net/xfrm/xfrm_policy.c | 4 ++--
2 files changed, 6 insertions(+), 3 deletions(-)
--
2.19.2
^ permalink raw reply
* [PATCHv3 1/2] ipv4/icmp: fix rt dst dev null pointer dereference
From: Hangbin Liu @ 2019-08-21 2:09 UTC (permalink / raw)
To: netdev
Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
Eric Dumazet, Hangbin Liu
In-Reply-To: <20190821020954.21165-1-liuhangbin@gmail.com>
In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the reason we need rt->dst.dev is to get the net.
So we can just get it from skb->dev, just like commit 8d9336704521
("ipv6: make icmp6_send() robust against null skb->dev") did.
Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
net/ipv4/icmp.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1510e951f451..5f00c9d18b02 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -582,7 +582,10 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
if (!rt)
goto out;
- net = dev_net(rt->dst.dev);
+
+ if (!skb_in->dev)
+ goto out;
+ net = dev_net(skb_in->dev);
/*
* Find the original header. It is expected to be valid, of course.
--
2.19.2
^ permalink raw reply related
* [PATCHv3 2/2] xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
From: Hangbin Liu @ 2019-08-21 2:09 UTC (permalink / raw)
To: netdev
Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
Eric Dumazet, Hangbin Liu
In-Reply-To: <20190821020954.21165-1-liuhangbin@gmail.com>
In decode_session{4,6} there is a possibility that the skb dst dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv6
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmpv6_send
- icmpv6_route_lookup
- xfrm_decode_session_reverse
- decode_session4
- oif = skb_dst(skb)->dev->ifindex; <-- here
- decode_session6
- oif = skb_dst(skb)->dev->ifindex; <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the skb_dst(skb)->dev is actually not needed as we
called decode_session{4,6} via xfrm_decode_session_reverse(), so oif is not
used by: fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
So make a dst dev check here should be clean and safe.
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
net/xfrm/xfrm_policy.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 8ca637a72697..ec94f5795ea4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -3269,7 +3269,7 @@ decode_session4(struct sk_buff *skb, struct flowi *fl, bool reverse)
struct flowi4 *fl4 = &fl->u.ip4;
int oif = 0;
- if (skb_dst(skb))
+ if (skb_dst(skb) && skb_dst(skb)->dev)
oif = skb_dst(skb)->dev->ifindex;
memset(fl4, 0, sizeof(struct flowi4));
@@ -3387,7 +3387,7 @@ decode_session6(struct sk_buff *skb, struct flowi *fl, bool reverse)
nexthdr = nh[nhoff];
- if (skb_dst(skb))
+ if (skb_dst(skb) && skb_dst(skb)->dev)
oif = skb_dst(skb)->dev->ifindex;
memset(fl6, 0, sizeof(struct flowi6));
--
2.19.2
^ permalink raw reply related
* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-21 2:42 UTC (permalink / raw)
To: Cornelia Huck
Cc: Christophe de Dinechin, Alex Williamson, Jiri Pirko,
David S . Miller, Kirti Wankhede, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, cjia, netdev@vger.kernel.org
In-Reply-To: <20190820183106.1680d0d9.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 20, 2019 10:01 PM
> > > > Option-1: mdev index
> > > > Introduce an optional mdev index/handle as u32 during mdev create
> time.
> > > > User passes mdev index/handle as input.
> > > >
> > > > phys_port_name=mIndex=m%u
> > > > mdev_index will be available in sysfs as mdev attribute for udev
> > > > to name the
> > > mdev's netdev.
> > > >
> > > > example mdev create command:
> > > > UUID=$(uuidgen)
> > > > echo $UUID index=10 >
> > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > > > example netdevs:
> > > > repnetdev=ens2f0_m10 /*ens2f0 is parent PF's netdevice */
> > > > mdev_netdev=enm10
> > > >
> > > > Pros:
> > > > 1. mdevctl and any other existing tools are unaffected.
> > > > 2. netdev stack, ovs and other switching platforms are unaffected.
> > > > 3. achieves unique phys_port_name for representor netdev 4.
> > > > achieves unique mdev eth netdev name for the mdev using udev/systemd
> extension.
> > > > 5. Aligns well with mdev and netdev subsystem and similar to
> > > > existing sriov
> > > bdf's.
> > > >
> > > > Option-2: shorter mdev name
> > > > Extend mdev to have shorter mdev device name in addition to UUID.
> > > > such as 'foo', 'bar'.
> > > > Mdev will continue to have UUID.
>
> I fail to understand how 'uses uuid' and 'allow shorter device name'
> are supposed to play together?
>
Each mdev will have uuid as today. Instead of naming device based on UUID, name it based on explicit name given by the user.
Again, I want to repeat, this name parameter is optional.
> > > > phys_port_name=mdev_name
> > > >
> > > > Pros:
> > > > 1. All same as option-1, except mdevctl needs upgrade for newer usage.
> > > > It is common practice to upgrade iproute2 package along with the kernel.
> > > > Similar practice to be done with mdevctl.
> > > > 2. Newer users of mdevctl who wants to work with non_UUID names,
> > > > will use
> > > newer mdevctl/tools.
> > > > Cons:
> > > > 1. Dual naming scheme of mdev might affect some of the existing tools.
> > > > It's unclear how/if it actually affects.
> > > > mdevctl [2] is very recently developed and can be enhanced for
> > > > dual naming
> > > scheme.
>
> The main problem is not tools we know about (i.e. mdevctl), but those we don't
> know about.
>
Well, if it not part of the distros, there is very little can do about it by kernel.
I tried mdevctl with mdev named using non UUID and it were able to list them.
> IOW, this (and the IFNAMESIZ change, which seems even worse) are the
> options I would not want at all.
>
Ok.
> > > >
> > > > Option-3: mdev uuid alias
> > > > Instead of shorter mdev name or mdev index, have alpha-numeric
> > > > name
> > > alias.
> > > > Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
> > > > example mdev create command:
> > > > UUID=$(uuidgen)
> > > > echo $UUID alias=foo >
> > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > > > example netdevs:
> > > > examle netdevs:
> > > > repnetdev = ens2f0_mfoo
> > > > mdev_netdev=enmfoo
> > > >
> > > > Pros:
> > > > 1. All same as option-1.
> > > > 2. Doesn't affect existing mdev naming scheme.
> > > > Cons:
> > > > 1. Index scheme of option-1 is better which can number large
> > > > number of
> > > mdevs with fewer characters, simplifying the management tool.
> > >
> > > I believe that Alex pointed out another "Cons" to all three options,
> > > which is that it forces user-space to resolve potential race
> > > conditions when creating an index or short name or alias.
> > >
> > This race condition exists for at least two subsystems that I know of, i.e.
> netdev and rdma.
> > If a device with a given name exists, subsystem returns error.
> > When user space gets error code EEXIST, and it can picks up different
> identifier(s).
>
> If you decouple device creation and setting the alias/index, you make the issue
> visible and thus much more manageable.
>
I thought about it. It has two issues.
1. user should be able to set this only once. Repeatedly setting it requires changing/notifying it.
2. setting alias translating in creating devlink port doesn't sound correct.
Because if user attempts to reset to different value, it required unregistration, reregistration.
All of such race conditions handling it not worth it.
So setting the index, I liked Alex's term more 'instance number', at instance creation time is lot more simple.
> >
> > > Also, what happens if `index=10` is not provided on the command-line?
> > > Does that make the device unusable for your purpose?
> > Yes, it is unusable to an extent.
> > Currently we have DEVLINK_PORT_FLAVOUR_PCI_VF in
> > include/uapi/linux/devlink.h Similar to it, we need to have
> DEVLINK_PORT_FLAVOUR_MDEV for mdev eswitch ports.
> > This port flavour needs to generate phys_port_name(). This should be user
> parameter driven.
> > Because representor netdevice name is generated based on this parameter.
>
> I'm also unsure how the extra parameter is supposed to work; writing it to the
> create attribute does not sound right.
>
Why? When you create a device it takes multiple mandatory and optional parameters.
This is common for netdev (vxlan, vlan, macvlan, ipvlan, gre and more).
> mdevctl supports setting additional parameters on an already created device
> (see the examples provided for vfio-ap), so going that route would actually
> work out of the box from the tooling side.
>
I explained that setting and re-setting attributes for instance create time value is not worth.
> What you would need is some kind of synchronization/locking to make sure that
> you only link up to the other device after the extra attribute has been set and
> that you don't allow to change it as long as it is associated with the other side. I
> do not know enough about the actual devices to suggest something here; if you
> need userspace cooperation, maybe uevents would be an option.
^ permalink raw reply
* Re: VRF notes when using ipv6 and flushing tables.
From: David Ahern @ 2019-08-21 3:02 UTC (permalink / raw)
To: Ben Greear, netdev
In-Reply-To: <8977a25e-29c1-5375-cc97-950dc7c2eb0f@candelatech.com>
On 8/20/19 2:27 PM, Ben Greear wrote:
> I recently spend a few days debugging what in the end was user error on
> my part.
>
> Here are my notes in hope they help someone else.
>
> First, 'ip -6 route show vrf vrfX' will not show some of the
> routes (like local routes) that will show up with
> 'ip -6 route show table X', where X == vrfX's table-id
>
> If you run 'ip -6 route flush table X', then you will loose all of the auto
> generated routes, including anycast, ff00::/8, and local routes.
>
> ff00::/8 is needed for neigh discovery to work (probably among other
> things)
>
> local route is needed or packets won't actually be accepted up the stack
> (I think that is the symptom at least)
>
> Not sure exactly what anycast does, but I'm guessing it is required for
> something useful.
>
> You must manually re-add those to the table unless you for certain know
> that
> you do not need them for whatever reason.
>
sorry you went through such a long and painful debugging session.
yes, the kernel doc for VRF needs to be updated that 'ip route show vrf
X' and 'ip route show table X' are different ('show vrf' mimics the main
table in not showing local, broadcast, anycast; 'table vrf' shows all).
A suggestion for others: the documentation and selftests directory have
a lot of VRF examples now. If something basic is not working (e.g., arp
or neigh discovery), see if it works there and if so compare the outputs
of the route table along the way.
^ permalink raw reply
* [PATCH] net: Add the same IP detection for duplicate address.
From: Dongxu Liu @ 2019-08-21 3:20 UTC (permalink / raw)
To: davem, kuznet, yoshfuji, netdev, linux-kernel
The network sends an ARP REQUEST packet to determine
whether there is a host with the same IP.
Windows and some other hosts may send the source IP
address instead of 0.
When IN_DEV_ORCONF(in_dev, DROP_GRATUITOUS_ARP) is enable,
the REQUEST will be dropped.
When IN_DEV_ORCONF(in_dev, DROP_GRATUITOUS_ARP) is disable,
The case should be added to the IP conflict handling process.
Signed-off-by: Dongxu Liu <liudongxu3@huawei.com>
---
net/ipv4/arp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 05eb42f..a51c921 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -801,7 +801,7 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
GFP_ATOMIC);
/* Special case: IPv4 duplicate address detection packet (RFC2131) */
- if (sip == 0) {
+ if (sip == 0 || sip == tip) {
if (arp->ar_op == htons(ARPOP_REQUEST) &&
inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL &&
!arp_ignore(in_dev, sip, tip))
--
2.12.3
^ permalink raw reply related
* Re: [PATCH net-next 3/6] net: dsa: Delete the VID from the upstream port as well
From: Vivien Didelot @ 2019-08-21 3:30 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Florian Fainelli, Andrew Lunn, Ido Schimmel, Roopa Prabhu,
nikolay, David S. Miller, netdev
In-Reply-To: <CA+h21hodsDTwPHY1pxQA-ucu6FU7rkOQa7Y4HJGZC0fRd8zmDA@mail.gmail.com>
On Wed, 21 Aug 2019 01:09:39 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> I mean I made an argument already for the hack in 4/6 ("Don't program
> the VLAN as pvid on the upstream port"). If the hack gets accepted
> like that, I have no further need of any change in the implicit VLAN
> configuration. But it's still a hack, so in that sense it would be
> nicer to not need it and have a better amount of control.
How come you simply cannot ignore the PVID flag for the CPU port in the
driver directly, as mv88e6xxx does in preference of the Marvell specific
"unmodified" mode? What PVID are you programming on the CPU port already?
Thanks,
Vivien
^ permalink raw reply
* Re: [pull request][net-next v2 00/16] Mellanox, mlx5 devlink RX health reporters
From: Jakub Kicinski @ 2019-08-21 3:36 UTC (permalink / raw)
To: Saeed Mahameed; +Cc: David S. Miller, netdev@vger.kernel.org
In-Reply-To: <20190820202352.2995-1-saeedm@mellanox.com>
On Tue, 20 Aug 2019 20:24:10 +0000, Saeed Mahameed wrote:
> This patchset introduces changes in mlx5 devlink health reporters.
> The highlight of these changes is adding a new reporter: RX reporter
>
> mlx5 RX reporter: reports and recovers from timeouts and RX completion
> error.
>
> 1) Perform TX reporter cleanup. In order to maintain the
> code flow as similar as possible between RX and TX reporters, start the
> set with cleanup.
>
> 2) Prepare for code sharing, generalize and move shared
> functionality.
>
> 3) Refactor and extend TX reporter diagnostics information
> to align the TX reporter diagnostics output with the RX reporter's
> diagnostics output.
>
> 4) Add helper functions Patch 11: Add RX reporter, initially
> supports only the diagnostics call back.
>
> 5) Change ICOSQ (Internal Operations Send Queue) open/close flow to
> avoid race between interface down and completion error recovery.
>
> 6) Introduce recovery flows for RX ring population timeout on ICOSQ,
> and for completion errors on ICOSQ and on RQ (Regular receive queues).
>
> 7) Include RX reporters in mlx5 documentation.
>
> 8) Last two patches of this series, are trivial fixes for previously
> submitted patches on this release cycle.
Not really something I can competently ack, but FWIW doesn't raise any
red flags for me.
^ permalink raw reply
* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-21 3:42 UTC (permalink / raw)
To: Alex Williamson
Cc: Jiri Pirko, David S . Miller, Kirti Wankhede, Cornelia Huck,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org, cjia,
netdev@vger.kernel.org
In-Reply-To: <20190820111904.75515f58@x1.home>
> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, August 20, 2019 10:49 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>
> On Tue, 20 Aug 2019 08:58:02 +0000
> Parav Pandit <parav@mellanox.com> wrote:
>
> > + Dave.
> >
> > Hi Jiri, Dave, Alex, Kirti, Cornelia,
> >
> > Please provide your feedback on it, how shall we proceed?
> >
> > Short summary of requirements.
> > For a given mdev (mediated device [1]), there is one representor
> > netdevice and devlink port in switchdev mode (similar to SR-IOV VF),
> > And there is one netdevice for the actual mdev when mdev is probed.
> >
> > (a) representor netdev and devlink port should be able derive
> > phys_port_name(). So that representor netdev name can be built
> > deterministically across reboots.
> >
> > (b) for mdev's netdevice, mdev's device should have an attribute.
> > This attribute can be used by udev rules/systemd or something else to
> > rename netdev name deterministically.
> >
> > (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> > A simple grep IFNAMSIZ in stack hints hundreds of users of IFNAMSIZ in
> > drivers, uapi, netlink, boot config area and more. Changing IFNAMSIZ
> > for a mdev bus doesn't really look reasonable option to me.
>
> How many characters do we really have to work with? Your examples below
> prepend various characters, ex. option-1 results in ens2f0_m10 or enm10. Do
> the extra 8 or 3 characters in these count against IFNAMSIZ?
>
Maximum 15. Last is null termination.
Some udev rules setting by user prefix the PF netdev interface. I took such example below where ens2f0 netdev named is prefixed.
Some prefer not to prefix.
> > Hence, I would like to discuss below options.
> >
> > Option-1: mdev index
> > Introduce an optional mdev index/handle as u32 during mdev create
> > time. User passes mdev index/handle as input.
> >
> > phys_port_name=mIndex=m%u
> > mdev_index will be available in sysfs as mdev attribute for udev to
> > name the mdev's netdev.
> >
> > example mdev create command:
> > UUID=$(uuidgen)
> > echo $UUID index=10
> > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
>
> Nit, IIRC previous discussions of additional parameters used comma separators,
> ex. echo $UUID,index=10 >...
>
Yes, ok.
> > > example netdevs:
> > repnetdev=ens2f0_m10 /*ens2f0 is parent PF's netdevice */
>
> Is the parent really relevant in the name?
No. I just picked one udev example who prefixed the parent netdev name.
But there are users who do not prefix it.
> Tools like mdevctl are meant to
> provide persistence, creating the same mdev devices on the same parent, but
> that's simply the easiest policy decision. We can also imagine that multiple
> parent devices might support a specified mdev type and policies factoring in
> proximity, load-balancing, power consumption, etc might be weighed such that
> we really don't want to promote userspace creating dependencies on the
> parent association.
>
> > mdev_netdev=enm10
> >
> > Pros:
> > 1. mdevctl and any other existing tools are unaffected.
> > 2. netdev stack, ovs and other switching platforms are unaffected.
> > 3. achieves unique phys_port_name for representor netdev 4. achieves
> > unique mdev eth netdev name for the mdev using udev/systemd extension.
> > 5. Aligns well with mdev and netdev subsystem and similar to existing
> > sriov bdf's.
>
> A user provided index seems strange to me. It's not really an index, just a user
> specified instance number. Presumably you have the user providing this
> because if it really were an index, then the value depends on the creation order
> and persistence is lost. Now the user needs to both avoid uuid collision as well
> as "index" number collision. The uuid namespace is large enough to mostly
> ignore this, but this is not. This seems like a burden.
>
I liked the term 'instance number', which is lot better way to say than index/handle.
Yes, user needs to avoid both the collision.
UUID collision should not occur in most cases, they way UUID are generated.
So practically users needs to pick unique 'instance number', similar to how it picks unique netdev names.
Burden to user comes from the requirement to get uniqueness.
> > Option-2: shorter mdev name
> > Extend mdev to have shorter mdev device name in addition to UUID.
> > such as 'foo', 'bar'.
> > Mdev will continue to have UUID.
> > phys_port_name=mdev_name
> >
> > Pros:
> > 1. All same as option-1, except mdevctl needs upgrade for newer usage.
> > It is common practice to upgrade iproute2 package along with the
> > kernel. Similar practice to be done with mdevctl.
> > 2. Newer users of mdevctl who wants to work with non_UUID names, will
> > use newer mdevctl/tools. Cons:
> > 1. Dual naming scheme of mdev might affect some of the existing tools.
> > It's unclear how/if it actually affects.
> > mdevctl [2] is very recently developed and can be enhanced for dual
> > naming scheme.
>
> I think we've already nak'ed this one, the device namespace becomes
> meaningless if the name becomes just a string where a uuid might be an
> example string. mdevs are named by uuid.
>
> > Option-3: mdev uuid alias
> > Instead of shorter mdev name or mdev index, have alpha-numeric name
> > alias. Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
> > example mdev create command:
> > UUID=$(uuidgen)
> > echo $UUID alias=foo
> > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > > example netdevs:
> > examle netdevs:
> > repnetdev = ens2f0_mfoo
> > mdev_netdev=enmfoo
> >
> > Pros:
> > 1. All same as option-1.
> > 2. Doesn't affect existing mdev naming scheme.
> > Cons:
> > 1. Index scheme of option-1 is better which can number large number of
> > mdevs with fewer characters, simplifying the management tool.
>
> No better than option-1, simply a larger secondary namespace, but still
> requires the user to come up with two independent names for the device.
>
> > Option-4: extend IFNAMESZ to be 64 bytes Extended IFNAMESZ from 16 to
> > 64 bytes phys_port_name=mdev_UUID_string mdev_netdev_name=enmUUID
> >
> > Pros:
> > 1. Doesn't require mdev extension
> > Cons:
> > 1. netdev stack, driver, uapi, user space, boot config wide changes 2.
> > Possible user space extensions who assumed name size being 16
> > characters 3. Single device type demands namesize change for all
> > netdev types
>
> What about an alias based on the uuid? For example, we use 160-bit sha1s
> daily with git (uuids are only 128-bit), but we generally don't reference git
> commits with the full 20 character string. Generally 12 characters is
> recommended to avoid ambiguity. Could mdev automatically create an
> abbreviated sha1 alias for the device? If so, how many characters should we
> use and what do we do on collision? The colliding device could add enough
> alias characters to disambiguate (we likely couldn't re-alias the existing device
> to disambiguate, but I'm not sure it matters, userspace has sysfs to associate
> aliases). Ex.
>
> UUID=$(uuidgen)
> ALIAS=$(echo $UUID | sha1sum | colrm 13)
>
I explained in previous reply to Cornelia, we should set UUID and ALIAS at the same time.
Setting is via different sysfs attribute is lot code burden with no extra benefit.
> Since there seems to be some prefix overhead, as I ask about above in how
> many characters we actually have to work with in IFNAMESZ, maybe we start
> with 8 characters (matching your "index" namespace) and expand as necessary
> for disambiguation. If we can eliminate overhead in IFNAMESZ, let's start with
> 12. Thanks,
>
If user is going to choose the alias, why does it have to be limited to sha1?
Or you just told it as an example?
It can be an alpha-numeric string.
Instead of mdev imposing number of characters on the alias, it should be best left to the user.
Because in future if netdev improves on the naming scheme, mdev will be limiting it, which is not right.
So not restricting alias size seems right to me.
User configuring mdev for networking devices in a given kernel knows what user is doing.
So user can choose alias name size as it finds suitable.
> Alex
^ permalink raw reply
* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-21 3:57 UTC (permalink / raw)
To: Cornelia Huck, Alex Williamson
Cc: Jiri Pirko, David S . Miller, Kirti Wankhede, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, cjia, netdev@vger.kernel.org
In-Reply-To: <20190820195519.47d6fd6a.cohuck@redhat.com>
> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Tuesday, August 20, 2019 11:25 PM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Parav Pandit <parav@mellanox.com>; Jiri Pirko <jiri@mellanox.com>;
> David S . Miller <davem@davemloft.net>; Kirti Wankhede
> <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>
> On Tue, 20 Aug 2019 11:19:04 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
> > What about an alias based on the uuid? For example, we use 160-bit
> > sha1s daily with git (uuids are only 128-bit), but we generally don't
> > reference git commits with the full 20 character string. Generally 12
> > characters is recommended to avoid ambiguity. Could mdev
> > automatically create an abbreviated sha1 alias for the device? If so,
> > how many characters should we use and what do we do on collision? The
> > colliding device could add enough alias characters to disambiguate (we
> > likely couldn't re-alias the existing device to disambiguate, but I'm
> > not sure it matters, userspace has sysfs to associate aliases). Ex.
> >
> > UUID=$(uuidgen)
> > ALIAS=$(echo $UUID | sha1sum | colrm 13)
> >
> > Since there seems to be some prefix overhead, as I ask about above in
> > how many characters we actually have to work with in IFNAMESZ, maybe
> > we start with 8 characters (matching your "index" namespace) and
> > expand as necessary for disambiguation. If we can eliminate overhead
> > in IFNAMESZ, let's start with 12. Thanks,
> >
> > Alex
>
> I really like that idea, and it seems the best option proposed yet, as we don't
> need to create a secondary identifier.
User setting this alias at mdev creation time and exposed via sysfs as read only attribute works.
Exposing that as
const char *mdev_alias(struct mdev_device *dev) to vendor drivers..
^ permalink raw reply
* Re: [PATCH 08/38] nfp: Convert to XArray
From: Jakub Kicinski @ 2019-08-21 3:59 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: netdev
In-Reply-To: <20190820223259.22348-9-willy@infradead.org>
On Tue, 20 Aug 2019 15:32:29 -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> A minor change in semantics where we simply store into the XArray rather
> than insert; this only matters if there could already be something stored
> at that index, and from my reading of the code that can't happen.
>
> Use xa_for_each() rather than xas_for_each() as none of these loops
> appear to be performance-critical.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Looks reasonable (indeed IIRC there should not be anything at the
index we try to store to). I'll try to test tomorrow (CCing maintainers
could speed things up a little.. 🤭)
> @@ -285,9 +275,9 @@ static void
> nfp_abm_qdisc_clear_mq(struct net_device *netdev, struct nfp_abm_link *alink,
> struct nfp_qdisc *qdisc)
> {
> - struct radix_tree_iter iter;
> unsigned int mq_refs = 0;
> - void __rcu **slot;
> + unsigned long index;
> + struct nfp_qdisc *mq;
Could you keep the variables sorted longest to shortest as is customary
in networking code if you respin?
^ permalink raw reply
* Re: [PATCH] net: pch_gbe: Fix memory leaks
From: Wenwen Wang @ 2019-08-21 4:10 UTC (permalink / raw)
To: David Miller
Cc: Richard Fontana, Allison Randal, Alexios Zavras,
Greg Kroah-Hartman, Thomas Gleixner,
open list:NETWORKING [GENERAL], open list, Wenwen Wang
In-Reply-To: <20190815.135111.1048854967874803531.davem@davemloft.net>
On Thu, Aug 15, 2019 at 4:51 PM David Miller <davem@davemloft.net> wrote:
>
> From: Wenwen Wang <wenwen@cs.uga.edu>
> Date: Thu, 15 Aug 2019 16:46:05 -0400
>
> > On Thu, Aug 15, 2019 at 4:42 PM David Miller <davem@davemloft.net> wrote:
> >>
> >> From: Wenwen Wang <wenwen@cs.uga.edu>
> >> Date: Thu, 15 Aug 2019 16:03:39 -0400
> >>
> >> > On Thu, Aug 15, 2019 at 3:34 PM David Miller <davem@davemloft.net> wrote:
> >> >>
> >> >> From: Wenwen Wang <wenwen@cs.uga.edu>
> >> >> Date: Tue, 13 Aug 2019 20:33:45 -0500
> >> >>
> >> >> > In pch_gbe_set_ringparam(), if netif_running() returns false, 'tx_old' and
> >> >> > 'rx_old' are not deallocated, leading to memory leaks. To fix this issue,
> >> >> > move the free statements after the if branch.
> >> >> >
> >> >> > Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
> >> >>
> >> >> Why would they be "deallocated"? They are still assigned to
> >> >> adapter->tx_ring and adapter->rx_ring.
> >> >
> >> > 'adapter->tx_ring' and 'adapter->rx_ring' has been covered by newly
> >> > allocated 'txdr' and 'rxdr' respectively before this if statement.
> >>
> >> That only happens inside of the if() statement, that's why rx_old and
> >> tx_old are only freed in that code path.
> >
> > That happens not only inside of the if statement, but also before the
> > if statement, just after 'txdr' and 'rxdr' are allocated.
>
> Then the assignments inside of the if() statement are redundant.
>
> Something doesn't add up here, please make the code consistent.
Thanks for your suggestion! I will remove the assignments inside of
the if() statement.
Wenwen
^ permalink raw reply
* [PATCH v2] net: pch_gbe: Fix memory leaks
From: Wenwen Wang @ 2019-08-21 4:20 UTC (permalink / raw)
To: Wenwen Wang
Cc: David S. Miller, Richard Fontana, Alexios Zavras, Allison Randal,
Greg Kroah-Hartman, Thomas Gleixner, open list:NETWORKING DRIVERS,
open list
In pch_gbe_set_ringparam(), if netif_running() returns false, 'tx_old' and
'rx_old' are not deallocated, leading to memory leaks. To fix this issue,
move the free statements to the outside of the if() statement.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
---
drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_ethtool.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_ethtool.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_ethtool.c
index 1a3008e..cb43919 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_ethtool.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_ethtool.c
@@ -340,12 +340,10 @@ static int pch_gbe_set_ringparam(struct net_device *netdev,
goto err_setup_tx;
pch_gbe_free_rx_resources(adapter, rx_old);
pch_gbe_free_tx_resources(adapter, tx_old);
- kfree(tx_old);
- kfree(rx_old);
- adapter->rx_ring = rxdr;
- adapter->tx_ring = txdr;
err = pch_gbe_up(adapter);
}
+ kfree(tx_old);
+ kfree(rx_old);
return err;
err_setup_tx:
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Alex Williamson @ 2019-08-21 4:20 UTC (permalink / raw)
To: Parav Pandit
Cc: Jiri Pirko, David S . Miller, Kirti Wankhede, Cornelia Huck,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org, cjia,
netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB486686D3C311F3C61BE0997DD1AA0@AM0PR05MB4866.eurprd05.prod.outlook.com>
On Wed, 21 Aug 2019 03:42:25 +0000
Parav Pandit <parav@mellanox.com> wrote:
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, August 20, 2019 10:49 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> > <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >
> > On Tue, 20 Aug 2019 08:58:02 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > + Dave.
> > >
> > > Hi Jiri, Dave, Alex, Kirti, Cornelia,
> > >
> > > Please provide your feedback on it, how shall we proceed?
> > >
> > > Short summary of requirements.
> > > For a given mdev (mediated device [1]), there is one representor
> > > netdevice and devlink port in switchdev mode (similar to SR-IOV VF),
> > > And there is one netdevice for the actual mdev when mdev is probed.
> > >
> > > (a) representor netdev and devlink port should be able derive
> > > phys_port_name(). So that representor netdev name can be built
> > > deterministically across reboots.
> > >
> > > (b) for mdev's netdevice, mdev's device should have an attribute.
> > > This attribute can be used by udev rules/systemd or something else to
> > > rename netdev name deterministically.
> > >
> > > (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> > > A simple grep IFNAMSIZ in stack hints hundreds of users of IFNAMSIZ in
> > > drivers, uapi, netlink, boot config area and more. Changing IFNAMSIZ
> > > for a mdev bus doesn't really look reasonable option to me.
> >
> > How many characters do we really have to work with? Your examples below
> > prepend various characters, ex. option-1 results in ens2f0_m10 or enm10. Do
> > the extra 8 or 3 characters in these count against IFNAMSIZ?
> >
> Maximum 15. Last is null termination.
> Some udev rules setting by user prefix the PF netdev interface. I took such example below where ens2f0 netdev named is prefixed.
> Some prefer not to prefix.
>
> > > Hence, I would like to discuss below options.
> > >
> > > Option-1: mdev index
> > > Introduce an optional mdev index/handle as u32 during mdev create
> > > time. User passes mdev index/handle as input.
> > >
> > > phys_port_name=mIndex=m%u
> > > mdev_index will be available in sysfs as mdev attribute for udev to
> > > name the mdev's netdev.
> > >
> > > example mdev create command:
> > > UUID=$(uuidgen)
> > > echo $UUID index=10
> > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> >
> > Nit, IIRC previous discussions of additional parameters used comma separators,
> > ex. echo $UUID,index=10 >...
> >
> Yes, ok.
>
> > > > example netdevs:
> > > repnetdev=ens2f0_m10 /*ens2f0 is parent PF's netdevice */
> >
> > Is the parent really relevant in the name?
> No. I just picked one udev example who prefixed the parent netdev name.
> But there are users who do not prefix it.
>
> > Tools like mdevctl are meant to
> > provide persistence, creating the same mdev devices on the same parent, but
> > that's simply the easiest policy decision. We can also imagine that multiple
> > parent devices might support a specified mdev type and policies factoring in
> > proximity, load-balancing, power consumption, etc might be weighed such that
> > we really don't want to promote userspace creating dependencies on the
> > parent association.
> >
> > > mdev_netdev=enm10
> > >
> > > Pros:
> > > 1. mdevctl and any other existing tools are unaffected.
> > > 2. netdev stack, ovs and other switching platforms are unaffected.
> > > 3. achieves unique phys_port_name for representor netdev 4. achieves
> > > unique mdev eth netdev name for the mdev using udev/systemd extension.
> > > 5. Aligns well with mdev and netdev subsystem and similar to existing
> > > sriov bdf's.
> >
> > A user provided index seems strange to me. It's not really an index, just a user
> > specified instance number. Presumably you have the user providing this
> > because if it really were an index, then the value depends on the creation order
> > and persistence is lost. Now the user needs to both avoid uuid collision as well
> > as "index" number collision. The uuid namespace is large enough to mostly
> > ignore this, but this is not. This seems like a burden.
> >
> I liked the term 'instance number', which is lot better way to say than index/handle.
> Yes, user needs to avoid both the collision.
> UUID collision should not occur in most cases, they way UUID are generated.
> So practically users needs to pick unique 'instance number', similar to how it picks unique netdev names.
>
> Burden to user comes from the requirement to get uniqueness.
>
> > > Option-2: shorter mdev name
> > > Extend mdev to have shorter mdev device name in addition to UUID.
> > > such as 'foo', 'bar'.
> > > Mdev will continue to have UUID.
> > > phys_port_name=mdev_name
> > >
> > > Pros:
> > > 1. All same as option-1, except mdevctl needs upgrade for newer usage.
> > > It is common practice to upgrade iproute2 package along with the
> > > kernel. Similar practice to be done with mdevctl.
> > > 2. Newer users of mdevctl who wants to work with non_UUID names, will
> > > use newer mdevctl/tools. Cons:
> > > 1. Dual naming scheme of mdev might affect some of the existing tools.
> > > It's unclear how/if it actually affects.
> > > mdevctl [2] is very recently developed and can be enhanced for dual
> > > naming scheme.
> >
> > I think we've already nak'ed this one, the device namespace becomes
> > meaningless if the name becomes just a string where a uuid might be an
> > example string. mdevs are named by uuid.
> >
> > > Option-3: mdev uuid alias
> > > Instead of shorter mdev name or mdev index, have alpha-numeric name
> > > alias. Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
> > > example mdev create command:
> > > UUID=$(uuidgen)
> > > echo $UUID alias=foo
> > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > > > example netdevs:
> > > examle netdevs:
> > > repnetdev = ens2f0_mfoo
> > > mdev_netdev=enmfoo
> > >
> > > Pros:
> > > 1. All same as option-1.
> > > 2. Doesn't affect existing mdev naming scheme.
> > > Cons:
> > > 1. Index scheme of option-1 is better which can number large number of
> > > mdevs with fewer characters, simplifying the management tool.
> >
> > No better than option-1, simply a larger secondary namespace, but still
> > requires the user to come up with two independent names for the device.
> >
> > > Option-4: extend IFNAMESZ to be 64 bytes Extended IFNAMESZ from 16 to
> > > 64 bytes phys_port_name=mdev_UUID_string mdev_netdev_name=enmUUID
> > >
> > > Pros:
> > > 1. Doesn't require mdev extension
> > > Cons:
> > > 1. netdev stack, driver, uapi, user space, boot config wide changes 2.
> > > Possible user space extensions who assumed name size being 16
> > > characters 3. Single device type demands namesize change for all
> > > netdev types
> >
> > What about an alias based on the uuid? For example, we use 160-bit sha1s
> > daily with git (uuids are only 128-bit), but we generally don't reference git
> > commits with the full 20 character string. Generally 12 characters is
> > recommended to avoid ambiguity. Could mdev automatically create an
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > abbreviated sha1 alias for the device? If so, how many characters should we
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > use and what do we do on collision? The colliding device could add enough
> > alias characters to disambiguate (we likely couldn't re-alias the existing device
> > to disambiguate, but I'm not sure it matters, userspace has sysfs to associate
> > aliases). Ex.
> >
> > UUID=$(uuidgen)
> > ALIAS=$(echo $UUID | sha1sum | colrm 13)
> >
> I explained in previous reply to Cornelia, we should set UUID and ALIAS at the same time.
> Setting is via different sysfs attribute is lot code burden with no extra benefit.
Just an example of the alias, not proposing how it's set. In fact,
proposing that the user does not set it, mdev-core provides one
automatically.
> > Since there seems to be some prefix overhead, as I ask about above in how
> > many characters we actually have to work with in IFNAMESZ, maybe we start
> > with 8 characters (matching your "index" namespace) and expand as necessary
> > for disambiguation. If we can eliminate overhead in IFNAMESZ, let's start with
> > 12. Thanks,
> >
> If user is going to choose the alias, why does it have to be limited to sha1?
> Or you just told it as an example?
>
> It can be an alpha-numeric string.
No, I'm proposing a different solution where mdev-core creates an alias
based on an abbreviated sha1. The user does not provide the alias.
> Instead of mdev imposing number of characters on the alias, it should be best left to the user.
> Because in future if netdev improves on the naming scheme, mdev will be limiting it, which is not right.
> So not restricting alias size seems right to me.
> User configuring mdev for networking devices in a given kernel knows what user is doing.
> So user can choose alias name size as it finds suitable.
That's not what I'm proposing, please read again. Thanks,
Alex
^ permalink raw reply
* Re: [EXT] [PATCH] qed: Add cleanup in qed_slowpath_start()
From: Wenwen Wang @ 2019-08-21 4:31 UTC (permalink / raw)
To: Sudarsana Reddy Kalluru
Cc: Ariel Elior, GR-everest-linux-l2, David S. Miller,
open list:QLOGIC QL4xxx ETHERNET DRIVER, open list, Wenwen Wang
In-Reply-To: <MN2PR18MB2528D8046DFC6BB880D8EFF6D3D20@MN2PR18MB2528.namprd18.prod.outlook.com>
On Tue, Aug 13, 2019 at 6:46 AM Sudarsana Reddy Kalluru
<skalluru@marvell.com> wrote:
>
> > -----Original Message-----
> > From: Wenwen Wang <wenwen@cs.uga.edu>
> > Sent: Tuesday, August 13, 2019 3:35 PM
> > To: Wenwen Wang <wenwen@cs.uga.edu>
> > Cc: Ariel Elior <aelior@marvell.com>; GR-everest-linux-l2 <GR-everest-linux-
> > l2@marvell.com>; David S. Miller <davem@davemloft.net>; open
> > list:QLOGIC QL4xxx ETHERNET DRIVER <netdev@vger.kernel.org>; open list
> > <linux-kernel@vger.kernel.org>
> > Subject: [EXT] [PATCH] qed: Add cleanup in qed_slowpath_start()
> >
> > External Email
> >
> > ----------------------------------------------------------------------
> > If qed_mcp_send_drv_version() fails, no cleanup is executed, leading to
> > memory leaks. To fix this issue, redirect the execution to the label 'err3'
> > before returning the error.
> >
> > Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
> > ---
> > drivers/net/ethernet/qlogic/qed/qed_main.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c
> > b/drivers/net/ethernet/qlogic/qed/qed_main.c
> > index 829dd60..d16a251 100644
> > --- a/drivers/net/ethernet/qlogic/qed/qed_main.c
> > +++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
> > @@ -1325,7 +1325,7 @@ static int qed_slowpath_start(struct qed_dev
> > *cdev,
> > &drv_version);
> > if (rc) {
> > DP_NOTICE(cdev, "Failed sending drv version
> > command\n");
> > - return rc;
> > + goto err3;
>
> In this case, we might need to free the ll2-buf allocated at the below path (?),
> 1312 /* Allocate LL2 interface if needed */
> 1313 if (QED_LEADING_HWFN(cdev)->using_ll2) {
> 1314 rc = qed_ll2_alloc_if(cdev);
> May be by adding a new goto label 'err4'.
Thanks for your suggestion! I will rework the patch.
Wenwen
>
> > }
> > }
> >
> > --
> > 2.7.4
>
^ permalink raw reply
* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Richard Cochran @ 2019-08-21 4:38 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Mark Brown, Hubert Feurstein, mlichvar, Andrew Lunn,
Florian Fainelli, linux-spi, netdev
In-Reply-To: <CA+h21hr4UcoJK7upNJjG0ibtX7CkF=akxVdrb--1AJn6-z=sUQ@mail.gmail.com>
On Tue, Aug 20, 2019 at 06:57:10PM +0300, Vladimir Oltean wrote:
> What if all we need is just a mini-"phc2sys-over-Ethernet" that runs
> on a kernel thread internally to DSA? We say that DSA switches are
> "slave" to the "master" netdevice - their PTP clock can be the same.
> I am fairly confident that the sja1105 at least can be configured in
> hardware to work in this mode. One just needs to enable the CPU port
> in its own reachability matrix. None of the switch ports is really a
> "CPU port" hardware speaking.
I did consider this method when working on an early version of the
marvell driver. At least for that chip, there was no way to get the
time stamps on the port attached to the CPU port.
The DSA system (as I understand it) does not allow using the Linux
interface acting as the CPU port in a way that would allow taking time
stamps with the existing code base. You might find a way, but I guess
it won't be easy.
Overall, the PTP switch use case is well supported by Linux. The
synchronization of the management CPU to the PTP, while nice to have,
is not required to implement a Transparent Clock. Your specific
application might require it, but honestly, if the management CPU
needs good synchronization, then you really aught to feed a PPS from
the switch into a gpio (for example) on the CPU.
Using SPI or MDIO or I2C or UART as a synchronization bus is not a
wise solution.
Having said that, I don't oppose improving the situation for these
slow, non-deterministic serial buses, but you will have to sell your
solution to the maintainers of said buses.
Thanks,
Richard
^ permalink raw reply
* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-21 4:40 UTC (permalink / raw)
To: Alex Williamson
Cc: Jiri Pirko, David S . Miller, Kirti Wankhede, Cornelia Huck,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org, cjia,
netdev@vger.kernel.org
In-Reply-To: <20190820222051.7aeafb69@x1.home>
> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, August 21, 2019 9:51 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>
> On Wed, 21 Aug 2019 03:42:25 +0000
> Parav Pandit <parav@mellanox.com> wrote:
>
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, August 20, 2019 10:49 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > > <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>;
> > > Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> > > netdev@vger.kernel.org
> > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >
> > > On Tue, 20 Aug 2019 08:58:02 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > + Dave.
> > > >
> > > > Hi Jiri, Dave, Alex, Kirti, Cornelia,
> > > >
> > > > Please provide your feedback on it, how shall we proceed?
> > > >
> > > > Short summary of requirements.
> > > > For a given mdev (mediated device [1]), there is one representor
> > > > netdevice and devlink port in switchdev mode (similar to SR-IOV
> > > > VF), And there is one netdevice for the actual mdev when mdev is probed.
> > > >
> > > > (a) representor netdev and devlink port should be able derive
> > > > phys_port_name(). So that representor netdev name can be built
> > > > deterministically across reboots.
> > > >
> > > > (b) for mdev's netdevice, mdev's device should have an attribute.
> > > > This attribute can be used by udev rules/systemd or something else
> > > > to rename netdev name deterministically.
> > > >
> > > > (c) IFNAMSIZ of 16 bytes is too small to fit whole UUID.
> > > > A simple grep IFNAMSIZ in stack hints hundreds of users of
> > > > IFNAMSIZ in drivers, uapi, netlink, boot config area and more.
> > > > Changing IFNAMSIZ for a mdev bus doesn't really look reasonable option
> to me.
> > >
> > > How many characters do we really have to work with? Your examples
> > > below prepend various characters, ex. option-1 results in ens2f0_m10
> > > or enm10. Do the extra 8 or 3 characters in these count against IFNAMSIZ?
> > >
> > Maximum 15. Last is null termination.
> > Some udev rules setting by user prefix the PF netdev interface. I took such
> example below where ens2f0 netdev named is prefixed.
> > Some prefer not to prefix.
> >
> > > > Hence, I would like to discuss below options.
> > > >
> > > > Option-1: mdev index
> > > > Introduce an optional mdev index/handle as u32 during mdev create
> > > > time. User passes mdev index/handle as input.
> > > >
> > > > phys_port_name=mIndex=m%u
> > > > mdev_index will be available in sysfs as mdev attribute for udev
> > > > to name the mdev's netdev.
> > > >
> > > > example mdev create command:
> > > > UUID=$(uuidgen)
> > > > echo $UUID index=10
> > > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > >
> > > Nit, IIRC previous discussions of additional parameters used comma
> > > separators, ex. echo $UUID,index=10 >...
> > >
> > Yes, ok.
> >
> > > > > example netdevs:
> > > > repnetdev=ens2f0_m10 /*ens2f0 is parent PF's netdevice */
> > >
> > > Is the parent really relevant in the name?
> > No. I just picked one udev example who prefixed the parent netdev name.
> > But there are users who do not prefix it.
> >
> > > Tools like mdevctl are meant to
> > > provide persistence, creating the same mdev devices on the same
> > > parent, but that's simply the easiest policy decision. We can also
> > > imagine that multiple parent devices might support a specified mdev
> > > type and policies factoring in proximity, load-balancing, power
> > > consumption, etc might be weighed such that we really don't want to
> > > promote userspace creating dependencies on the parent association.
> > >
> > > > mdev_netdev=enm10
> > > >
> > > > Pros:
> > > > 1. mdevctl and any other existing tools are unaffected.
> > > > 2. netdev stack, ovs and other switching platforms are unaffected.
> > > > 3. achieves unique phys_port_name for representor netdev 4.
> > > > achieves unique mdev eth netdev name for the mdev using udev/systemd
> extension.
> > > > 5. Aligns well with mdev and netdev subsystem and similar to
> > > > existing sriov bdf's.
> > >
> > > A user provided index seems strange to me. It's not really an
> > > index, just a user specified instance number. Presumably you have
> > > the user providing this because if it really were an index, then the
> > > value depends on the creation order and persistence is lost. Now
> > > the user needs to both avoid uuid collision as well as "index"
> > > number collision. The uuid namespace is large enough to mostly ignore
> this, but this is not. This seems like a burden.
> > >
> > I liked the term 'instance number', which is lot better way to say than
> index/handle.
> > Yes, user needs to avoid both the collision.
> > UUID collision should not occur in most cases, they way UUID are generated.
> > So practically users needs to pick unique 'instance number', similar to how it
> picks unique netdev names.
> >
> > Burden to user comes from the requirement to get uniqueness.
> >
> > > > Option-2: shorter mdev name
> > > > Extend mdev to have shorter mdev device name in addition to UUID.
> > > > such as 'foo', 'bar'.
> > > > Mdev will continue to have UUID.
> > > > phys_port_name=mdev_name
> > > >
> > > > Pros:
> > > > 1. All same as option-1, except mdevctl needs upgrade for newer usage.
> > > > It is common practice to upgrade iproute2 package along with the
> > > > kernel. Similar practice to be done with mdevctl.
> > > > 2. Newer users of mdevctl who wants to work with non_UUID names,
> > > > will use newer mdevctl/tools. Cons:
> > > > 1. Dual naming scheme of mdev might affect some of the existing tools.
> > > > It's unclear how/if it actually affects.
> > > > mdevctl [2] is very recently developed and can be enhanced for
> > > > dual naming scheme.
> > >
> > > I think we've already nak'ed this one, the device namespace becomes
> > > meaningless if the name becomes just a string where a uuid might be
> > > an example string. mdevs are named by uuid.
> > >
> > > > Option-3: mdev uuid alias
> > > > Instead of shorter mdev name or mdev index, have alpha-numeric
> > > > name alias. Alias is an optional mdev sysfs attribute such as 'foo', 'bar'.
> > > > example mdev create command:
> > > > UUID=$(uuidgen)
> > > > echo $UUID alias=foo
> > > > > /sys/class/net/ens2f0/mdev_supported_types/mlx5_core_mdev/create
> > > > > example netdevs:
> > > > examle netdevs:
> > > > repnetdev = ens2f0_mfoo
> > > > mdev_netdev=enmfoo
> > > >
> > > > Pros:
> > > > 1. All same as option-1.
> > > > 2. Doesn't affect existing mdev naming scheme.
> > > > Cons:
> > > > 1. Index scheme of option-1 is better which can number large
> > > > number of mdevs with fewer characters, simplifying the management
> tool.
> > >
> > > No better than option-1, simply a larger secondary namespace, but
> > > still requires the user to come up with two independent names for the
> device.
> > >
> > > > Option-4: extend IFNAMESZ to be 64 bytes Extended IFNAMESZ from 16
> > > > to
> > > > 64 bytes phys_port_name=mdev_UUID_string
> mdev_netdev_name=enmUUID
> > > >
> > > > Pros:
> > > > 1. Doesn't require mdev extension
> > > > Cons:
> > > > 1. netdev stack, driver, uapi, user space, boot config wide changes 2.
> > > > Possible user space extensions who assumed name size being 16
> > > > characters 3. Single device type demands namesize change for all
> > > > netdev types
> > >
> > > What about an alias based on the uuid? For example, we use 160-bit
> > > sha1s daily with git (uuids are only 128-bit), but we generally
> > > don't reference git commits with the full 20 character string.
> > > Generally 12 characters is recommended to avoid ambiguity. Could
> > > mdev automatically create an
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > abbreviated sha1 alias for the device? If so, how many characters
> > > should we
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > use and what do we do on collision? The colliding device could add
> > > enough alias characters to disambiguate (we likely couldn't re-alias
> > > the existing device to disambiguate, but I'm not sure it matters,
> > > userspace has sysfs to associate aliases). Ex.
> > >
> > > UUID=$(uuidgen)
> > > ALIAS=$(echo $UUID | sha1sum | colrm 13)
> > >
> > I explained in previous reply to Cornelia, we should set UUID and ALIAS at the
> same time.
> > Setting is via different sysfs attribute is lot code burden with no extra benefit.
>
> Just an example of the alias, not proposing how it's set. In fact, proposing that
> the user does not set it, mdev-core provides one automatically.
>
> > > Since there seems to be some prefix overhead, as I ask about above
> > > in how many characters we actually have to work with in IFNAMESZ,
> > > maybe we start with 8 characters (matching your "index" namespace)
> > > and expand as necessary for disambiguation. If we can eliminate
> > > overhead in IFNAMESZ, let's start with 12. Thanks,
> > >
> > If user is going to choose the alias, why does it have to be limited to sha1?
> > Or you just told it as an example?
> >
> > It can be an alpha-numeric string.
>
> No, I'm proposing a different solution where mdev-core creates an alias based
> on an abbreviated sha1. The user does not provide the alias.
>
> > Instead of mdev imposing number of characters on the alias, it should be best
> left to the user.
> > Because in future if netdev improves on the naming scheme, mdev will be
> limiting it, which is not right.
> > So not restricting alias size seems right to me.
> > User configuring mdev for networking devices in a given kernel knows what
> user is doing.
> > So user can choose alias name size as it finds suitable.
>
> That's not what I'm proposing, please read again. Thanks,
I understood your point. But mdev doesn't know how user is going to use udev/systemd to name the netdev.
So even if mdev chose to pick 12 characters, it could result in collision.
Hence the proposal to provide the alias by the user, as user know the best policy for its use case in the environment its using.
So 12 character sha1 method will still work by user.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox