Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next iproute2 v2 1/2] devlink: Update kernel header to commit
From: Parav Pandit @ 2019-07-10 12:39 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern, jiri, Parav Pandit
In-Reply-To: <20190701183017.25407-1-parav@mellanox.com>

Update kernel header to commit:
e41b6bf3cdd4 ("devlink: Introduce PCI VF port flavour and port attribute")

Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 include/uapi/linux/devlink.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 6544824a..fc195cbd 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -169,6 +169,14 @@ enum devlink_port_flavour {
 	DEVLINK_PORT_FLAVOUR_DSA, /* Distributed switch architecture
 				   * interconnect port.
 				   */
+	DEVLINK_PORT_FLAVOUR_PCI_PF, /* Represents eswitch port for
+				      * the PCI PF. It is an internal
+				      * port that faces the PCI PF.
+				      */
+	DEVLINK_PORT_FLAVOUR_PCI_VF, /* Represents eswitch port
+				      * for the PCI VF. It is an internal
+				      * port that faces the PCI VF.
+				      */
 };
 
 enum devlink_param_cmode {
@@ -337,6 +345,9 @@ enum devlink_attr {
 	DEVLINK_ATTR_FLASH_UPDATE_STATUS_DONE,	/* u64 */
 	DEVLINK_ATTR_FLASH_UPDATE_STATUS_TOTAL,	/* u64 */
 
+	DEVLINK_ATTR_PORT_PCI_PF_NUMBER,	/* u16 */
+	DEVLINK_ATTR_PORT_PCI_VF_NUMBER,	/* u16 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
-- 
2.19.2


^ permalink raw reply related

* Re: [PATCH net-next v6 06/15] ethtool: netlink bitset handling
From: Michal Kubecek @ 2019-07-10 12:38 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190709141817.GE2301@nanopsycho.orion>

On Tue, Jul 09, 2019 at 04:18:17PM +0200, Jiri Pirko wrote:
> 
> I understand. So how about avoid the bitfield all together and just
> have array of either bits of strings or combinations?
> 
> ETHTOOL_CMD_SETTINGS_SET (U->K)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>        ETHTOOL_A_SETTINGS_PRIV_FLAG
>            ETHTOOL_A_FLAG_NAME = "legacy-rx"
> 	   ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> 
> or the same with index instead of string
> 
> ETHTOOL_CMD_SETTINGS_SET (U->K)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 0
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> 
> 
> For set you can combine both when you want to set multiple bits:
> 
> ETHTOOL_CMD_SETTINGS_SET (U->K)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 2
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 8
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
> 
> 
> For get this might be a bit bigger message:
> 
> ETHTOOL_CMD_SETTINGS_GET_REPLY (K->U)
>     ETHTOOL_A_HEADER
>         ETHTOOL_A_DEV_NAME = "eth3"
>     ETHTOOL_A_SETTINGS_PRIV_FLAGS
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 0
>             ETHTOOL_A_FLAG_NAME = "legacy-rx"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 1
>             ETHTOOL_A_FLAG_NAME = "vf-ipsec"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)
>         ETHTOOL_A_SETTINGS_PRIV_FLAG
>             ETHTOOL_A_FLAG_INDEX = 8
>             ETHTOOL_A_FLAG_NAME = "something-else"
>  	    ETHTOOL_A_FLAG_VALUE   (NLA_FLAG)

This is perfect for "one shot" applications but not so much for long
running ones, either "ethtool --monitor" or management or monitoring
daemons. Repeating the names in every notification message would be
a waste, it's much more convenient to load the strings only once and
cache them. Even if we omit the names in notifications (and possibly the
GET replies if client opts for it), this format still takes 12-16 bytes
per bit.

So the problem I'm trying to address is that there are two types of
clients with very different mode of work and different preferences.

Looking at the bitset.c, I would rather say that most of the complexity
and ugliness comes from dealing with both unsigned long based bitmaps
and u32 based ones. Originally, there were functions working with
unsigned long based bitmaps and the variants with "32" suffix were
wrappers around them which converted u32 bitmaps to unsigned long ones
and back. This became a problem when kernel started issuing warnings
about variable length arrays as getting rid of them meant two kmalloc()
and two kfree() for each u32 bitmap operation, even if most of the
bitmaps are in rather short in practice.

Maybe the wrapper could do something like

int ethnl_put_bitset32(const u32 *value, const u32 *mask,
		       unsigned int size,  ...)
{
	unsigned long fixed_value[2], fixed_mask[2];
	unsigned long *tmp_value = fixed_value;
	unsigned long *tmp_mask = fixed_mask;

	if (size > sizeof(fixed_value) * BITS_PER_BYTE) {
		tmp_value = bitmap_alloc(size);
		if (!tmp_value)
			return -ENOMEM;
		tmp_mask = bitmap_alloc(size);
		if (!tmp_mask) {
			kfree(tmp_value);
			return -ENOMEM;
		}
	}

	bitmap_from_arr32(tmp_value, value, size);
	bitmap_from_arr32(tmp_mask, mask, size);
	ret = ethnl_put_bitset(tmp_value, tmp_mask, size, ...);
}

This way we would make bitset.c code cleaner while avoiding allocating
short bitmaps (which is the most common case). 

Michal

^ permalink raw reply

* Re: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"
From: David Ahern @ 2019-07-10 12:27 UTC (permalink / raw)
  To: Jan Szewczyk, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <AM6PR07MB56397A8BC53D9A525BC9C489F2F00@AM6PR07MB5639.eurprd07.prod.outlook.com>

[ adding netdev so others can chime in ]

On 7/10/19 2:28 AM, Jan Szewczyk wrote:
> Hi guys!
> 
> We can see different behavior of one of our commands that supposed to
> show pmtu information.
> 
> It’s using netlink message RTM_GETROUTE to get the information and in
> Linux kernel version 4.12 after sending big packet (and triggering
> “packet too big”) there is an entry with PMTU and expiration time.
> 
> In the version 4.18 unfortunately the entry looks different and there is
> no PMTU information.

Can you try with 4.19.58 (latest stable release for 4.19)? Perhaps there
was a bugfix that is missing from 4.18.

The kernel has 2 commands under tools/testing/selftests/net -- pmtu.sh
and icmp_redirect.sh -- that verify exceptions are created and use 'ip
ro get' to verify the mtu.


> 
> I can see that in your commit
> https://github.com/torvalds/linux/commit/d4ead6b34b67fd711639324b6465a050bcb197d4,
> these lines disappeared from route.c:
> 
>  
> 
>      if (rt->rt6i_pmtu)
> 
>            metrics[RTAX_MTU - 1] = rt->rt6i_pmtu;
> 
>  
> 
> I’m very beginner in linux kernel code, can you help me and tell me if
> that could cause this different behavior?
> 
>  
> 
>  
> 
> BR,
> 
> Jan Szewczyk
> 


^ permalink raw reply

* Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Gal Pressman @ 2019-07-10 12:19 UTC (permalink / raw)
  To: Michal Kalderon, ariel.elior, jgg, dledford; +Cc: linux-rdma, davem, netdev
In-Reply-To: <20190709141735.19193-2-michal.kalderon@marvell.com>

On 09/07/2019 17:17, Michal Kalderon wrote:
> Create some common API's for adding entries to a xa_mmap.
> Searching for an entry and freeing one.
> 
> The code was copied from the efa driver almost as is, just renamed
> function to be generic and not efa specific.
> 
> Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>

Reviewed-by: Gal Pressman <galpress@amazon.com>

^ permalink raw reply

* Re: [PATCH net-next v6 04/15] ethtool: introduce ethtool netlink interface
From: Michal Kubecek @ 2019-07-10 12:12 UTC (permalink / raw)
  To: netdev
  Cc: Jiri Pirko, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, Johannes Berg,
	linux-kernel
In-Reply-To: <20190709134212.GD2301@nanopsycho.orion>

On Tue, Jul 09, 2019 at 03:42:12PM +0200, Jiri Pirko wrote:
> Mon, Jul 08, 2019 at 10:22:19PM CEST, mkubecek@suse.cz wrote:
> >On Mon, Jul 08, 2019 at 09:26:29PM +0200, Jiri Pirko wrote:
> >> Mon, Jul 08, 2019 at 07:27:29PM CEST, mkubecek@suse.cz wrote:
> >> >
> >> >There are two reasons for this design. First is to reduce the number of
> >> >requests needed to get the information. This is not so much a problem of
> >> >ethtool itself; the only existing commands that would result in multiple
> >> >request messages would be "ethtool <dev>" and "ethtool -s <dev>". Maybe
> >> >also "ethtool -x/-X <dev>" but even if the indirection table and hash
> >> >key have different bits assigned now, they don't have to be split even
> >> >if we split other commands. It may be bigger problem for daemons wanting
> >> >to keep track of system configuration which would have to issue many
> >> >requests whenever a new device appears.
> >> >
> >> >Second reason is that with 8-bit genetlink command/message id, the space
> >> >is not as infinite as it might seem. I counted quickly, right now the
> >> >full series uses 14 ids for kernel messages, with split you propose it
> >> >would most likely grow to 44. For full implementation of all ethtool
> >> >functionality, we could get to ~60 ids. It's still only 1/4 of the
> >> >available space but it's not clear what the future development will look
> >> >like. We would certainly need to be careful not to start allocating new
> >> >commands for single parameters and try to be foreseeing about what can
> >> >be grouped together. But we will need to do that in any case.
> >> >
> >> >On kernel side, splitting existing messages would make some things a bit
> >> >easier. It would also reduce the number of scenarios where only part of
> >> >requested information is available or only part of a SET request fails.
> >> 
> >> Okay, I got your point. So why don't we look at if from the other angle.
> >> Why don't we have only single get/set command that would be in general
> >> used to get/set ALL info from/to the kernel. Where we can have these
> >> bits (perhaps rather varlen bitfield) to for user to indicate which data
> >> is he interested in? This scales. The other commands would be
> >> just for action.
> >> 
> >> Something like RTM_GETLINK/RTM_SETLINK. Makes sense?
> >
> >It's certainly an option but at the first glance it seems as just moving
> >what I tried to avoid one level lower. It would work around the u8 issue
> >(but as Johannes pointed out, we can handle it with genetlink when/if
> >the time comes). We would almost certainly have to split the replies
> >into multiple messages to keep the packet size reasonable. I'll have to
> >think more about the consequences for both kernel and userspace.
> >
> >My gut feeling is that out of the two extreme options (one universal
> >message type and message types corresponding to current infomask bits),
> >the latter is more appealing. After all, ethtool has been gathering
> >features that would need those ~60 message types for 20 years.
> 
> Yeah, but I think that we have to do one or another. Anything in between
> makes the code complex and uapi confusing. Let's start clean :)

I'll split the messages for v7.

Michal

^ permalink raw reply

* Re: [PATCH v6 rdma-next 2/6] RDMA/efa: Use the common mmap_xa helpers
From: Gal Pressman @ 2019-07-10 12:09 UTC (permalink / raw)
  To: Michal Kalderon, ariel.elior, jgg, dledford; +Cc: linux-rdma, davem, netdev
In-Reply-To: <20190709141735.19193-3-michal.kalderon@marvell.com>

On 09/07/2019 17:17, Michal Kalderon wrote:
> Remove the functions related to managing the mmap_xa database.
> This code was copied to the ib_core. Use the common API's instead.
> 
> Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>

Thanks Michal,
Acked-by: Gal Pressman <galpress@amazon.com>

^ permalink raw reply

* Re: [PATCH v12 1/5] can: m_can: Create a m_can platform framework
From: Dan Murphy @ 2019-07-10 12:08 UTC (permalink / raw)
  To: wg, mkl, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <6fa79302-ad32-7f43-f9d5-af70aa789284@ti.com>

Hello

On 6/17/19 10:09 AM, Dan Murphy wrote:
> Marc
>
> On 6/10/19 11:35 AM, Dan Murphy wrote:
>> Bump
>>
>> On 6/6/19 8:16 AM, Dan Murphy wrote:
>>> Marc
>>>
>>> Bump
>>>
>>> On 5/31/19 6:51 AM, Dan Murphy wrote:
>>>> Marc
>>>>
>>>> On 5/15/19 3:54 PM, Dan Murphy wrote:
>>>>> Marc
>>>>>
>>>>> On 5/9/19 11:11 AM, Dan Murphy wrote:
>>>>>> Create a m_can platform framework that peripheral
>>>>>> devices can register to and use common code and register sets.
>>>>>> The peripheral devices may provide read/write and configuration
>>>>>> support of the IP.
>>>>>>
>>>>>> Acked-by: Wolfgang Grandegger <wg@grandegger.com>
>>>>>> Signed-off-by: Dan Murphy <dmurphy@ti.com>
>>>>>> ---
>>>>>>
>>>>>> v12 - Update the m_can_read/write functions to create a backtrace 
>>>>>> if the callback
>>>>>> pointer is NULL. - https://lore.kernel.org/patchwork/patch/1052302/
>>>>>>
>>>>> Is this able to be merged now?
>>>>
>>>> ping
>
> Wondering if there is anything else we need to do?
>
> The part has officially shipped and we had hoped to have driver 
> support in Linux as part of the announcement.
>
Is this being sent in a PR for 5.3?

Dan


> Dan
>
>
>>>>
>>>>
>>>>> Dan
>>>>>
>>>>> <snip>

^ permalink raw reply

* [PATCH v2 bpf] selftests/bpf: fix bpf_target_sparc check
From: Ilya Leoshkevich @ 2019-07-10 11:56 UTC (permalink / raw)
  To: bpf, netdev; +Cc: andrii.nakryiko, Ilya Leoshkevich

bpf_helpers.h fails to compile on sparc: the code should be checking
for defined(bpf_target_sparc), but checks simply for bpf_target_sparc.

Also change #ifdef bpf_target_powerpc to #if defined() for consistency.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---

v1->v2: bpf_target_powerpc change

 tools/testing/selftests/bpf/bpf_helpers.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 5f6f9e7aba2a..0214797518ce 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -440,10 +440,10 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
 
 #endif
 
-#ifdef bpf_target_powerpc
+#if defined(bpf_target_powerpc)
 #define BPF_KPROBE_READ_RET_IP(ip, ctx)		({ (ip) = (ctx)->link; })
 #define BPF_KRETPROBE_READ_RET_IP		BPF_KPROBE_READ_RET_IP
-#elif bpf_target_sparc
+#elif defined(bpf_target_sparc)
 #define BPF_KPROBE_READ_RET_IP(ip, ctx)		({ (ip) = PT_REGS_RET(ctx); })
 #define BPF_KRETPROBE_READ_RET_IP		BPF_KPROBE_READ_RET_IP
 #else
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH iproute2 master 1/3] devlink: Change devlink health dump show command to dumpit
From: Jiri Pirko @ 2019-07-10 11:47 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: Stephen Hemminger, netdev, moshe, ayal
In-Reply-To: <1562756601-19171-2-git-send-email-tariqt@mellanox.com>

Wed, Jul 10, 2019 at 01:03:19PM CEST, tariqt@mellanox.com wrote:
>From: Aya Levin <ayal@mellanox.com>
>
>Although devlink health dump show command is given per reporter, it
>returns large amounts of data. Trying to use the doit cb results in
>OUT-OF-BUFFER error. This complementary patch raises the DUMP flag in
>order to invoke the dumpit cb. We're safe as no existing drivers
>implement the dump health reporter option yet.
>
>Fixes: 041e6e651a8e ("devlink: Add devlink health dump show command")
>Signed-off-by: Aya Levin <ayal@mellanox.com>
>Signed-off-by: Tariq Toukan <tariqt@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 3/4] selftests/bpf: make PT_REGS_* work in userspace
From: Ilya Leoshkevich @ 2019-07-10 11:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Stanislav Fomichev, Y Song, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Adrian Ratiu, david.daney
In-Reply-To: <CAEf4BzY0kO2si_ajouYNfsauaWdHkj042++bLaHe1W_G885i=g@mail.gmail.com>

> Am 09.07.2019 um 19:48 schrieb Andrii Nakryiko <andrii.nakryiko@gmail.com>:
> 
> On Tue, Jul 9, 2019 at 8:19 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>> 
>> Right now, on certain architectures, these macros are usable only with
>> kernel headers. This patch makes it possible to use them with userspace
>> headers and, as a consequence, not only in BPF samples, but also in BPF
>> selftests.
>> 
>> On s390, provide the forward declaration of struct pt_regs and cast it
>> to user_pt_regs in PT_REGS_* macros. This is necessary, because instead
>> of the full struct pt_regs, s390 exposes only its first member
>> user_pt_regs to userspace, and bpf_helpers.h is used with both userspace
>> (in selftests) and kernel (in samples) headers. It was added in commit
>> 466698e654e8 ("s390/bpf: correct broken uapi for
>> BPF_PROG_TYPE_PERF_EVENT program type").
>> 
>> Ditto on arm64.
>> 
>> On x86, provide userspace versions of PT_REGS_* macros. Unlike s390 and
>> arm64, x86 provides struct pt_regs to both userspace and kernel, however,
>> with different member names.
>> 
>> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
>> ---
> 
> Just curious, what did you use as a reference for which register
> corresponds to which PARM, RET, etc for different archs? I've tried to
> look it up the other day, and it wasn't as straightforward to find as
> I hoped for, so maybe I'm missing something obvious.

For this particular change I did not have to look it up, because it all
was already in the code, I just needed to adapt it to userspace headers.
Normally I would google for „abi supplement“ to find this information.
A lazy way would be to simply ask the (cross-)compiler:

cat <<HERE | aarch64-linux-gnu-gcc -x c -O3 -S - -o -
int f(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j);
int g() { return -f(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); }
HERE

I’ve just double checked the supported arches, and noticed that:

#define PT_REGS_PARM5(x) ((x)->uregs[4])
for bpf_target_arm (arm-linux-gnueabihf) looks wrong:
the 5th parameter should be passed on stack. This observation matches
[1].

#define PT_REGS_RC(x) ((x)->regs[1])
for bpf_target_mips (mips64el-linux-gnuabi64) also looks wrong:
the return value should be in register 2. This observation matches [2].

Since I’m not an expert on those architectures, my conclusions could be
incorrect (e.g. becase a different ABI is normally used in practice).
Adrian and David, could you please correct me if I’m wrong?

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf
[2] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf

>> tools/testing/selftests/bpf/bpf_helpers.h | 61 +++++++++++++++--------
>> 1 file changed, 41 insertions(+), 20 deletions(-)
>> 
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index 73071a94769a..212ec564e5c3 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> @@ -358,6 +358,7 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
>> 
>> #if defined(bpf_target_x86)
>> 
>> +#ifdef __KERNEL__
>> #define PT_REGS_PARM1(x) ((x)->di)
>> #define PT_REGS_PARM2(x) ((x)->si)
>> #define PT_REGS_PARM3(x) ((x)->dx)
>> @@ -368,19 +369,35 @@ static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
>> #define PT_REGS_RC(x) ((x)->ax)
>> #define PT_REGS_SP(x) ((x)->sp)
>> #define PT_REGS_IP(x) ((x)->ip)
>> +#else
>> +#define PT_REGS_PARM1(x) ((x)->rdi)
>> +#define PT_REGS_PARM2(x) ((x)->rsi)
>> +#define PT_REGS_PARM3(x) ((x)->rdx)
>> +#define PT_REGS_PARM4(x) ((x)->rcx)
>> +#define PT_REGS_PARM5(x) ((x)->r8)
>> +#define PT_REGS_RET(x) ((x)->rsp)
>> +#define PT_REGS_FP(x) ((x)->rbp)
>> +#define PT_REGS_RC(x) ((x)->rax)
>> +#define PT_REGS_SP(x) ((x)->rsp)
>> +#define PT_REGS_IP(x) ((x)->rip)
> 
> Will this also work for 32-bit x86?

Thanks, this is a good catch: this builds, but makes 64-bit accesses, as if it used the 64-bit
variant of pt_regs. I will fix this.

^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Toke Høiland-Jørgensen @ 2019-07-10 11:39 UTC (permalink / raw)
  To: Ido Schimmel, Jakub Kicinski
  Cc: David Miller, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
	pablo, pieter.jansenvanvuuren, andrew, f.fainelli, vivien.didelot,
	idosch, Alexei Starovoitov
In-Reply-To: <20190710112011.GA552@splinter>

Ido Schimmel <idosch@idosch.org> writes:

> On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
>> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
>> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
>> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
>> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
>> > > > > From: Ido Schimmel <idosch@idosch.org>
>> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
>> > > > >     
>> > > > > > Users have several ways to debug the kernel and understand why a packet
>> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
>> > > > > > utilities trace kfree_skb(), which is the function called when a packet
>> > > > > > is freed as part of a failure. The information provided by these tools
>> > > > > > is invaluable when trying to understand the cause of a packet loss.
>> > > > > > 
>> > > > > > In recent years, large portions of the kernel data path were offloaded
>> > > > > > to capable devices. Today, it is possible to perform L2 and L3
>> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
>> > > > > > Different TC classifiers and actions are also offloaded to capable
>> > > > > > devices, at both ingress and egress.
>> > > > > > 
>> > > > > > However, when the data path is offloaded it is not possible to achieve
>> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
>> > > > > > become irrelevant.
>> > > > > > 
>> > > > > > This patchset aims to solve this by allowing users to monitor packets
>> > > > > > that the underlying device decided to drop along with relevant metadata
>> > > > > > such as the drop reason and ingress port.    
>> > > > > 
>> > > > > We are now going to have 5 or so ways to capture packets passing through
>> > > > > the system, this is nonsense.
>> > > > > 
>> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
>> > > > > devlink thing.
>> > > > > 
>> > > > > This is insanity, too many ways to do the same thing and therefore the
>> > > > > worst possible user experience.
>> > > > > 
>> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
>> > > > > XDP perf events, and these taps there too.
>> > > > > 
>> > > > > I mean really, think about it from the average user's perspective.  To
>> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
>> > > > > listen on devlink but configure a special tap thing beforehand and then
>> > > > > if someone is using XDP I gotta setup another perf event buffer capture
>> > > > > thing too.    
>> > > > 
>> > > > Let me try to explain again because I probably wasn't clear enough. The
>> > > > devlink-trap mechanism is not doing the same thing as other solutions.
>> > > > 
>> > > > The packets we are capturing in this patchset are packets that the
>> > > > kernel (the CPU) never saw up until now - they were silently dropped by
>> > > > the underlying device performing the packet forwarding instead of the
>> > > > CPU.  
>> > 
>> > Jakub,
>> > 
>> > It seems to me that most of the criticism is about consolidation of
>> > interfaces because you believe I'm doing something you can already do
>> > today, but this is not the case.
>> 
>> To be clear I'm not opposed to the patches, I'm just trying to
>> facilitate a discussion.
>
> Sure, sorry if it came out the wrong way. I appreciate your feedback and
> the time you have spent on this subject.
>
>> 
>> > Switch ASICs have dedicated traps for specific packets. Usually, these
>> > packets are control packets (e.g., ARP, BGP) which are required for the
>> > correct functioning of the control plane. You can see this in the SAI
>> > interface, which is an abstraction layer over vendors' SDKs:
>> > 
>> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
>> > 
>> > We need to be able to configure the hardware policers of these traps and
>> > read their statistics to understand how many packets they dropped. We
>> > currently do not have a way to do any of that and we rely on hardcoded
>> > defaults in the driver which do not fit every use case (from
>> > experience):
>> > 
>> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
>> > 
>> > We plan to extend devlink-trap mechanism to cover all these use cases. I
>> > hope you agree that this functionality belongs in devlink given it is a
>> > device-specific configuration and not a netdev-specific one.
>> 
>> No disagreement on providing knobs for traps.
>> 
>> > That being said, in its current form, this mechanism is focused on traps
>> > that correlate to packets the device decided to drop as this is very
>> > useful for debugging.
>> 
>> That'd be mixing two things - trap configuration and tracing exceptions
>> in one API. That's a little suboptimal but not too terrible, especially
>> if there is a higher level APIs users can default to.
>
> TBH, initially I was only focused on the drops, but then it occurred to
> me that this is a too narrow scope. These traps are only a subset of the
> complete list of traps we have and we have similar requirements for both
> (statistics, setting policers etc.). Therefore, I decided to design this
> interface in a more generic way, so that it could support the different
> use cases.
>
>> 
>> > Given that the entire configuration is done via devlink and that devlink
>> > stores all the information about these traps, it seems logical to also
>> > report these packets and their metadata to user space as devlink events.
>> > 
>> > If this is not desirable, we can try to call into drop_monitor from
>> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
>> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
>> > 
>> > IMO, this is less desirable, as instead of having one tool (devlink) to
>> > interact with this mechanism we will need two (devlink & dropwatch).
>> > 
>> > Below I tried to answer all your questions and refer to all the points
>> > you brought up.
>> > 
>> > > When you say silently dropped do you mean that mlxsw as of today
>> > > doesn't have any counters exposed for those events?  
>> > 
>> > Some of these packets are counted, but not all of them.
>> > 
>> > > If we wanted to consolidate this into something existing we can either
>> > >  (a) add similar traps in the kernel data path;
>> > >  (b) make these traps extension of statistics.
>> > > 
>> > > My knee jerk reaction to seeing the patches was that it adds a new
>> > > place where device statistics are reported.  
>> > 
>> > Not at all. This would be a step back. We can already count discards due
>> > to VLAN membership on ingress on a per-port basis. A software maintained
>> > global counter does not buy us anything.
>> > 
>> > By also getting the dropped packet - coupled with the drop reason and
>> > ingress port - you can understand exactly why and on which VLAN the
>> > packet was dropped. I wrote a Wireshark dissector for these netlink
>> > packets to make our life easier. You can see the details in my comment
>> > to the cover letter:
>> > 
>> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
>> > 
>> > In case you do not care about individual packets, but still want more
>> > fine-grained statistics for your monitoring application, you can use
>> > eBPF. For example, one thing we did is attaching a kprobe to
>> > devlink_trap_report() with an eBPF program that dissects the incoming
>> > skbs and maintains a counter per-{5 tuple, drop reason}. With
>> > ebpf_exporter you can export these statistics to Prometheus on which you
>> > can run queries and visualize the results with Grafana. This is
>> > especially useful for tail and early drops since it allows you to
>> > understand which flows contribute to most of the drops.
>> 
>> No question that the mechanism is useful.
>> 
>> > > Users who want to know why things are dropped will not get detailed
>> > > breakdown from ethtool -S which for better or worse is the one stop
>> > > shop for device stats today.  
>> > 
>> > I hope I managed to explain why counters are not enough, but I also want
>> > to point out that ethtool statistics are not properly documented and
>> > this hinders their effectiveness. I did my best to document the exposed
>> > traps in order to avoid the same fate:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128585/
>> > 
>> > In addition, there are selftests to show how each trap can be triggered
>> > to reduce the ambiguity even further:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128610/
>> > 
>> > And a note in the documentation to make sure future functionality is
>> > tested as well:
>> > 
>> > https://patchwork.ozlabs.org/patch/1128608/
>> > 
>> > > Having thought about it some more, however, I think that having a
>> > > forwarding "exception" object and hanging statistics off of it is a
>> > > better design, even if we need to deal with some duplication to get
>> > > there.
>> > > 
>> > > IOW having an way to "trap all packets which would increment a
>> > > statistic" (option (b) above) is probably a bad design.
>> > > 
>> > > As for (a) I wonder how many of those events have a corresponding event
>> > > in the kernel stack?  
>> > 
>> > Generic packet drops all have a corresponding kfree_skb() calls in the
>> > kernel, but that does not mean that every packet dropped by the hardware
>> > would also be dropped by the kernel if it were to be injected to its Rx
>> > path.
>> 
>> The notion that all SW events get captured by kfree_skb() would not be
>> correct.
>
> I meant that the generic drop reasons I'm exposing with this patchset
> all correspond to reasons for which the kernel would drop packets.
>
>> We have the kfree_skb(), and xdp_exception(), and drivers can
>> drop packets if various allocations fail.. the situation is already not
>> great.
>> 
>> I think that having a single useful place where users can look to see
>> all traffic exception events would go a long way. 
>
> I believe this was Dave's point as well. We have one tool to monitor
> kfree_skb() drops and with this patchset we will have another to monitor
> HW drops. As I mentioned in my previous reply, I will look into sending
> the events via drop_monitor by calling into it from devlink.
>
> I'm not involved with XDP (as you might have noticed), but I assume
> drop_monitor could be extended for this use case as well by doing
> register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
> events using a single netlink channel, potentially split into different
> multicast groups to allow user space programs to receive only the events
> they care about.

Yes, having XDP hook into this would be good! This ties in nicely with
the need for better statistics for XDP (see
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#consistency-for-statistics-with-xdp).

Unfortunately, it's not enough to just hook into the xdp_exception trace
point, as that is generally only triggered on XDP_ABORTED, not if the
XDP program just decides to drop the packet with XDP_DROP. So we may
need another mechanism to hook this if it's going to comprehensive; and
we'd probably want a way to do this that doesn't impose a performance
penalty when the drop monitor is not enabled...

-Toke

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: Vasily Gorbik @ 2019-07-10 11:34 UTC (permalink / raw)
  To: David S. Miller, Alexei Starovoitov, Daniel Borkmann
  Cc: Heiko Carstens, Christian Borntraeger, Ilya Leoshkevich, netdev,
	bpf, linux-s390
In-Reply-To: <patch.git-d365382dfc69.your-ad-here.call-01562755343-ext-3127@work.hours>

On Wed, Jul 10, 2019 at 12:43:45PM +0200, Vasily Gorbik wrote:
> Ilya Leoshkevich is joining as s390 bpf maintainer. With his background
> as gcc developer he would be valuable for the team and community as a
> whole. Ilya, have fun!
> 
> Since there is now enough eyes on s390 bpf, relieve Christian Borntraeger,
> so that he could focus on his maintainer tasks for other components.
> 
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 558acf24ea1e..98e7411dfe56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3066,9 +3066,9 @@ S:	Maintained
>  F:	arch/riscv/net/
>  
>  BPF JIT for S390
> +M:	Ilya Leoshkevich <iii@linux.ibm.com>
>  M:	Heiko Carstens <heiko.carstens@de.ibm.com>
>  M:	Vasily Gorbik <gor@linux.ibm.com>
> -M:	Christian Borntraeger <borntraeger@de.ibm.com>
>  L:	netdev@vger.kernel.org
>  L:	bpf@vger.kernel.org
>  S:	Maintained
> -- 
> 2.21.0

Dave, Alexei, Daniel,
would you take it via one of your trees? Or should I take it via s390?

Thanks,
Vasily


^ permalink raw reply

* Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths
From: Ido Schimmel @ 2019-07-10 11:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, netdev, jiri, mlxsw, dsahern, roopa, nikolay, andy,
	pablo, pieter.jansenvanvuuren, andrew, f.fainelli, vivien.didelot,
	idosch, Alexei Starovoitov
In-Reply-To: <20190709153430.5f0f5295@cakuba.netronome.com>

On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
> > > > > From: Ido Schimmel <idosch@idosch.org>
> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > > > >     
> > > > > > Users have several ways to debug the kernel and understand why a packet
> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > > > is freed as part of a failure. The information provided by these tools
> > > > > > is invaluable when trying to understand the cause of a packet loss.
> > > > > > 
> > > > > > In recent years, large portions of the kernel data path were offloaded
> > > > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > > > Different TC classifiers and actions are also offloaded to capable
> > > > > > devices, at both ingress and egress.
> > > > > > 
> > > > > > However, when the data path is offloaded it is not possible to achieve
> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > > > become irrelevant.
> > > > > > 
> > > > > > This patchset aims to solve this by allowing users to monitor packets
> > > > > > that the underlying device decided to drop along with relevant metadata
> > > > > > such as the drop reason and ingress port.    
> > > > > 
> > > > > We are now going to have 5 or so ways to capture packets passing through
> > > > > the system, this is nonsense.
> > > > > 
> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > > > devlink thing.
> > > > > 
> > > > > This is insanity, too many ways to do the same thing and therefore the
> > > > > worst possible user experience.
> > > > > 
> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > > > XDP perf events, and these taps there too.
> > > > > 
> > > > > I mean really, think about it from the average user's perspective.  To
> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > > > listen on devlink but configure a special tap thing beforehand and then
> > > > > if someone is using XDP I gotta setup another perf event buffer capture
> > > > > thing too.    
> > > > 
> > > > Let me try to explain again because I probably wasn't clear enough. The
> > > > devlink-trap mechanism is not doing the same thing as other solutions.
> > > > 
> > > > The packets we are capturing in this patchset are packets that the
> > > > kernel (the CPU) never saw up until now - they were silently dropped by
> > > > the underlying device performing the packet forwarding instead of the
> > > > CPU.  
> > 
> > Jakub,
> > 
> > It seems to me that most of the criticism is about consolidation of
> > interfaces because you believe I'm doing something you can already do
> > today, but this is not the case.
> 
> To be clear I'm not opposed to the patches, I'm just trying to
> facilitate a discussion.

Sure, sorry if it came out the wrong way. I appreciate your feedback and
the time you have spent on this subject.

> 
> > Switch ASICs have dedicated traps for specific packets. Usually, these
> > packets are control packets (e.g., ARP, BGP) which are required for the
> > correct functioning of the control plane. You can see this in the SAI
> > interface, which is an abstraction layer over vendors' SDKs:
> > 
> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
> > 
> > We need to be able to configure the hardware policers of these traps and
> > read their statistics to understand how many packets they dropped. We
> > currently do not have a way to do any of that and we rely on hardcoded
> > defaults in the driver which do not fit every use case (from
> > experience):
> > 
> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
> > 
> > We plan to extend devlink-trap mechanism to cover all these use cases. I
> > hope you agree that this functionality belongs in devlink given it is a
> > device-specific configuration and not a netdev-specific one.
> 
> No disagreement on providing knobs for traps.
> 
> > That being said, in its current form, this mechanism is focused on traps
> > that correlate to packets the device decided to drop as this is very
> > useful for debugging.
> 
> That'd be mixing two things - trap configuration and tracing exceptions
> in one API. That's a little suboptimal but not too terrible, especially
> if there is a higher level APIs users can default to.

TBH, initially I was only focused on the drops, but then it occurred to
me that this is a too narrow scope. These traps are only a subset of the
complete list of traps we have and we have similar requirements for both
(statistics, setting policers etc.). Therefore, I decided to design this
interface in a more generic way, so that it could support the different
use cases.

> 
> > Given that the entire configuration is done via devlink and that devlink
> > stores all the information about these traps, it seems logical to also
> > report these packets and their metadata to user space as devlink events.
> > 
> > If this is not desirable, we can try to call into drop_monitor from
> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
> > 
> > IMO, this is less desirable, as instead of having one tool (devlink) to
> > interact with this mechanism we will need two (devlink & dropwatch).
> > 
> > Below I tried to answer all your questions and refer to all the points
> > you brought up.
> > 
> > > When you say silently dropped do you mean that mlxsw as of today
> > > doesn't have any counters exposed for those events?  
> > 
> > Some of these packets are counted, but not all of them.
> > 
> > > If we wanted to consolidate this into something existing we can either
> > >  (a) add similar traps in the kernel data path;
> > >  (b) make these traps extension of statistics.
> > > 
> > > My knee jerk reaction to seeing the patches was that it adds a new
> > > place where device statistics are reported.  
> > 
> > Not at all. This would be a step back. We can already count discards due
> > to VLAN membership on ingress on a per-port basis. A software maintained
> > global counter does not buy us anything.
> > 
> > By also getting the dropped packet - coupled with the drop reason and
> > ingress port - you can understand exactly why and on which VLAN the
> > packet was dropped. I wrote a Wireshark dissector for these netlink
> > packets to make our life easier. You can see the details in my comment
> > to the cover letter:
> > 
> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> > 
> > In case you do not care about individual packets, but still want more
> > fine-grained statistics for your monitoring application, you can use
> > eBPF. For example, one thing we did is attaching a kprobe to
> > devlink_trap_report() with an eBPF program that dissects the incoming
> > skbs and maintains a counter per-{5 tuple, drop reason}. With
> > ebpf_exporter you can export these statistics to Prometheus on which you
> > can run queries and visualize the results with Grafana. This is
> > especially useful for tail and early drops since it allows you to
> > understand which flows contribute to most of the drops.
> 
> No question that the mechanism is useful.
> 
> > > Users who want to know why things are dropped will not get detailed
> > > breakdown from ethtool -S which for better or worse is the one stop
> > > shop for device stats today.  
> > 
> > I hope I managed to explain why counters are not enough, but I also want
> > to point out that ethtool statistics are not properly documented and
> > this hinders their effectiveness. I did my best to document the exposed
> > traps in order to avoid the same fate:
> > 
> > https://patchwork.ozlabs.org/patch/1128585/
> > 
> > In addition, there are selftests to show how each trap can be triggered
> > to reduce the ambiguity even further:
> > 
> > https://patchwork.ozlabs.org/patch/1128610/
> > 
> > And a note in the documentation to make sure future functionality is
> > tested as well:
> > 
> > https://patchwork.ozlabs.org/patch/1128608/
> > 
> > > Having thought about it some more, however, I think that having a
> > > forwarding "exception" object and hanging statistics off of it is a
> > > better design, even if we need to deal with some duplication to get
> > > there.
> > > 
> > > IOW having an way to "trap all packets which would increment a
> > > statistic" (option (b) above) is probably a bad design.
> > > 
> > > As for (a) I wonder how many of those events have a corresponding event
> > > in the kernel stack?  
> > 
> > Generic packet drops all have a corresponding kfree_skb() calls in the
> > kernel, but that does not mean that every packet dropped by the hardware
> > would also be dropped by the kernel if it were to be injected to its Rx
> > path.
> 
> The notion that all SW events get captured by kfree_skb() would not be
> correct.

I meant that the generic drop reasons I'm exposing with this patchset
all correspond to reasons for which the kernel would drop packets.

> We have the kfree_skb(), and xdp_exception(), and drivers can
> drop packets if various allocations fail.. the situation is already not
> great.
> 
> I think that having a single useful place where users can look to see
> all traffic exception events would go a long way. 

I believe this was Dave's point as well. We have one tool to monitor
kfree_skb() drops and with this patchset we will have another to monitor
HW drops. As I mentioned in my previous reply, I will look into sending
the events via drop_monitor by calling into it from devlink.

I'm not involved with XDP (as you might have noticed), but I assume
drop_monitor could be extended for this use case as well by doing
register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
events using a single netlink channel, potentially split into different
multicast groups to allow user space programs to receive only the events
they care about.

> Software side as I mentioned is pretty brutal, IDK how many users are
> actually willing to decode stack traces to figure out why their system
> is dropping packets :/
> 
> > In my reply to Dave I gave buffer drops as an example.
> 
> The example of buffer drops is also probably the case where having the
> packet is least useful, but yes, I definitely agree devices need a way
> of reporting events that can't happen in SW.
> 
> > There are also situations in which packets can be dropped due to
> > device-specific exceptions and these do not have a corresponding drop
> > reason in the kernel. See example here:
> > 
> > https://patchwork.ozlabs.org/patch/1128587/
> > 
> > > If we could add corresponding trace points and just feed those from
> > > the device driver, that'd obviously be a holy grail.  
> > 
> > Unlike tracepoints, netlink gives you a structured and extensible
> > interface. For example, in Spectrum-1 we cannot provide the Tx port for
> > early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
> > we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
> > Spectrum-1. You also get a programmatic interface that you can query for
> > this information:
> > 
> > # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
> > netdevsim/netdevsim10:
> >   name ingress_vlan_filter type drop generic true report false action drop group l2_drops
> >     metadata:
> >        input_port
> 
> Right, you can set or not set skb fields to some extent but its
> definitely not as flexible as netlink.

^ permalink raw reply

* Re: [PATCH bpf-next] net: Don't uninstall an XDP program when none is installed
From: Maxim Mikityanskiy @ 2019-07-10 11:16 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Daniel Borkmann, Björn Töpel, bpf@vger.kernel.org,
	netdev@vger.kernel.org, Saeed Mahameed, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song
In-Reply-To: <20190612161405.24064-1-maximmi@mellanox.com>

On 2019-06-12 19:14, Maxim Mikityanskiy wrote:
> dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
> XDP program. It means that the driver's ndo_bpf can be called with
> XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
> case happens if the user runs `ip link set eth0 xdp off` when there is
> no XDP program attached.
> 
> The drivers typically perform some heavy operations on XDP_SETUP_PROG,
> so they all have to handle this case internally to return early if it
> happens. This patch puts this check into the kernel code, so that all
> drivers will benefit from it.
> 
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> ---
> Björn, please take a look at this, Saeed told me you were doing
> something related, but I couldn't find it. If this fix is already
> covered by your work, please tell about that.
> 
>   net/core/dev.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 66f7508825bd..68b3e3320ceb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -8089,6 +8089,9 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
>   			bpf_prog_put(prog);
>   			return -EINVAL;
>   		}
> +	} else {
> +		if (!__dev_xdp_query(dev, bpf_op, query))
> +			return 0;
>   	}
>   
>   	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
> 

Alexei, so what about this patch? It's marked as "Changed Requested" in 
patchwork, but Jakub's point looks resolved - I don't see any changes 
required from my side.

^ permalink raw reply

* [PATCH iproute2 master 1/3] devlink: Change devlink health dump show command to dumpit
From: Tariq Toukan @ 2019-07-10 11:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, moshe, ayal, Tariq Toukan
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

Although devlink health dump show command is given per reporter, it
returns large amounts of data. Trying to use the doit cb results in
OUT-OF-BUFFER error. This complementary patch raises the DUMP flag in
order to invoke the dumpit cb. We're safe as no existing drivers
implement the dump health reporter option yet.

Fixes: 041e6e651a8e ("devlink: Add devlink health dump show command")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 devlink/devlink.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index ac8c0fb149b6..e3e1e27ab312 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -6078,13 +6078,13 @@ static int cmd_fmsg_object_cb(const struct nlmsghdr *nlh, void *data)
 	return MNL_CB_OK;
 }
 
-static int cmd_health_object_common(struct dl *dl, uint8_t cmd)
+static int cmd_health_object_common(struct dl *dl, uint8_t cmd, uint16_t flags)
 {
 	struct fmsg_cb_data data;
 	struct nlmsghdr *nlh;
 	int err;
 
-	nlh = mnlg_msg_prepare(dl->nlg, cmd,  NLM_F_REQUEST | NLM_F_ACK);
+	nlh = mnlg_msg_prepare(dl->nlg, cmd, flags | NLM_F_REQUEST | NLM_F_ACK);
 
 	err = dl_argv_parse_put(nlh, dl,
 				DL_OPT_HANDLE | DL_OPT_HEALTH_REPORTER_NAME, 0);
@@ -6099,12 +6099,16 @@ static int cmd_health_object_common(struct dl *dl, uint8_t cmd)
 
 static int cmd_health_dump_show(struct dl *dl)
 {
-	return cmd_health_object_common(dl, DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET);
+	return cmd_health_object_common(dl,
+					DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET,
+					NLM_F_DUMP);
 }
 
 static int cmd_health_diagnose(struct dl *dl)
 {
-	return cmd_health_object_common(dl, DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE);
+	return cmd_health_object_common(dl,
+					DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE,
+					0);
 }
 
 static int cmd_health_recover(struct dl *dl)
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH iproute2 master 2/3] devlink: Fix binary values print
From: Tariq Toukan @ 2019-07-10 11:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, moshe, ayal, Tariq Toukan
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

Fix function pr_out_binary_value() to start printing the binary buffer
from offset 0 instead of offset 1. Remove redundant new line at the
beginning of the output

Example:
With patch:
 mlx5e_txqsq:
   05 00 00 00 05 00 00 00 01 00 00 00 00 00 00 00
   00 00 00 00 00 00 00 00 8e 6e 3a 13 07 00 00 00
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   c0
Without patch
  mlx5e_txqsq:

  00 00 00 05 00 00 00 01 00 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 8e 6e 3a 13 07 00 00 00 00
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0

Fixes: 844a61764c6f ("devlink: Add helper functions for name and value separately")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 devlink/devlink.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index e3e1e27ab312..4bced4e60ae8 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -1779,29 +1779,31 @@ static void pr_out_uint64_value(struct dl *dl, uint64_t value)
 		pr_out(" %"PRIu64, value);
 }
 
+static bool is_binary_eol(int i)
+{
+	return !(i%16);
+}
+
 static void pr_out_binary_value(struct dl *dl, uint8_t *data, uint32_t len)
 {
-	int i = 1;
+	int i = 0;
 
 	if (dl->json_output)
 		jsonw_start_array(dl->jw);
-	else
-		pr_out("\n");
 
 	while (i < len) {
-		if (dl->json_output) {
+		if (dl->json_output)
 			jsonw_printf(dl->jw, "%d", data[i]);
-		} else {
-			pr_out(" %02x", data[i]);
-			if (!(i % 16))
-				pr_out("\n");
-		}
+		else
+			pr_out("%02x ", data[i]);
 		i++;
+		if (!dl->json_output && is_binary_eol(i))
+			__pr_out_newline();
 	}
 	if (dl->json_output)
 		jsonw_end_array(dl->jw);
-	else if ((i - 1) % 16)
-		pr_out("\n");
+	else if (!is_binary_eol(i))
+		__pr_out_newline();
 }
 
 static void pr_out_str_value(struct dl *dl, const char *value)
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH iproute2 master 0/3] devlink dumpit fixes
From: Tariq Toukan @ 2019-07-10 11:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, moshe, ayal, Tariq Toukan

Hi,

This series from Aya contains several fixes for devlink health
dump show command with binary data.

In patch 1 we replace the usage of doit with a dumpit, which
is non-blocking and allows transferring larger amount of data.

Patches 2 and 3 fix the output for binary data prints, for both
json and non-json.

Series generated against master commit:
2eb23f3e7aaf devlink: Show devlink port number

Regards,
Tariq

Aya Levin (3):
  devlink: Change devlink health dump show command to dumpit
  devlink: Fix binary values print
  devlink: Remove enclosing array brackets binary print with json format

 devlink/devlink.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

-- 
1.8.3.1


^ permalink raw reply

* [PATCH iproute2 master 3/3] devlink: Remove enclosing array brackets binary print with json format
From: Tariq Toukan @ 2019-07-10 11:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, moshe, ayal, Tariq Toukan
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

Keep pr_out_binary_value function only for printing. Inner relations
like array grouping should be done outside the function.

Fixes: 844a61764c6f ("devlink: Add helper functions for name and value separately")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 devlink/devlink.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index 4bced4e60ae8..7532c3f888f9 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -1788,9 +1788,6 @@ static void pr_out_binary_value(struct dl *dl, uint8_t *data, uint32_t len)
 {
 	int i = 0;
 
-	if (dl->json_output)
-		jsonw_start_array(dl->jw);
-
 	while (i < len) {
 		if (dl->json_output)
 			jsonw_printf(dl->jw, "%d", data[i]);
@@ -1800,9 +1797,7 @@ static void pr_out_binary_value(struct dl *dl, uint8_t *data, uint32_t len)
 		if (!dl->json_output && is_binary_eol(i))
 			__pr_out_newline();
 	}
-	if (dl->json_output)
-		jsonw_end_array(dl->jw);
-	else if (!is_binary_eol(i))
+	if (!dl->json_output && !is_binary_eol(i))
 		__pr_out_newline();
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: Ilya Leoshkevich @ 2019-07-10 10:50 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Alexei Starovoitov, Daniel Borkmann, Heiko Carstens,
	Christian Borntraeger, David S. Miller, netdev, bpf, linux-s390
In-Reply-To: <patch.git-d365382dfc69.your-ad-here.call-01562755343-ext-3127@work.hours>

> Am 10.07.2019 um 12:43 schrieb Vasily Gorbik <gor@linux.ibm.com>:
> 
> Ilya Leoshkevich is joining as s390 bpf maintainer. With his background
> as gcc developer he would be valuable for the team and community as a
> whole. Ilya, have fun!
> 
> Since there is now enough eyes on s390 bpf, relieve Christian Borntraeger,
> so that he could focus on his maintainer tasks for other components.
> 
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> MAINTAINERS | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 558acf24ea1e..98e7411dfe56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3066,9 +3066,9 @@ S:	Maintained
> F:	arch/riscv/net/
> 
> BPF JIT for S390
> +M:	Ilya Leoshkevich <iii@linux.ibm.com>
> M:	Heiko Carstens <heiko.carstens@de.ibm.com>
> M:	Vasily Gorbik <gor@linux.ibm.com>
> -M:	Christian Borntraeger <borntraeger@de.ibm.com>
> L:	netdev@vger.kernel.org
> L:	bpf@vger.kernel.org
> S:	Maintained
> -- 
> 2.21.0

Thanks, Vasily!

Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>


^ permalink raw reply

* Re: [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: Christian Borntraeger @ 2019-07-10 10:44 UTC (permalink / raw)
  To: Vasily Gorbik, Alexei Starovoitov, Daniel Borkmann
  Cc: Heiko Carstens, Ilya Leoshkevich, David S. Miller, netdev, bpf,
	linux-s390
In-Reply-To: <patch.git-d365382dfc69.your-ad-here.call-01562755343-ext-3127@work.hours>



On 10.07.19 12:43, Vasily Gorbik wrote:
> Ilya Leoshkevich is joining as s390 bpf maintainer. With his background
> as gcc developer he would be valuable for the team and community as a
> whole. Ilya, have fun!

> 
> Since there is now enough eyes on s390 bpf, relieve Christian Borntraeger,
> so that he could focus on his maintainer tasks for other components.
> 
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Acked-by: Christian Borntraeger <borntraeger@e.ibm.com>

> ---
>  MAINTAINERS | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 558acf24ea1e..98e7411dfe56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3066,9 +3066,9 @@ S:	Maintained
>  F:	arch/riscv/net/
>  
>  BPF JIT for S390
> +M:	Ilya Leoshkevich <iii@linux.ibm.com>
>  M:	Heiko Carstens <heiko.carstens@de.ibm.com>
>  M:	Vasily Gorbik <gor@linux.ibm.com>
> -M:	Christian Borntraeger <borntraeger@de.ibm.com>
>  L:	netdev@vger.kernel.org
>  L:	bpf@vger.kernel.org
>  S:	Maintained
> 


^ permalink raw reply

* [PATCH] MAINTAINERS: update BPF JIT S390 maintainers
From: Vasily Gorbik @ 2019-07-10 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Heiko Carstens, Christian Borntraeger, Ilya Leoshkevich,
	David S. Miller, netdev, bpf, linux-s390

Ilya Leoshkevich is joining as s390 bpf maintainer. With his background
as gcc developer he would be valuable for the team and community as a
whole. Ilya, have fun!

Since there is now enough eyes on s390 bpf, relieve Christian Borntraeger,
so that he could focus on his maintainer tasks for other components.

Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 558acf24ea1e..98e7411dfe56 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3066,9 +3066,9 @@ S:	Maintained
 F:	arch/riscv/net/
 
 BPF JIT for S390
+M:	Ilya Leoshkevich <iii@linux.ibm.com>
 M:	Heiko Carstens <heiko.carstens@de.ibm.com>
 M:	Vasily Gorbik <gor@linux.ibm.com>
-M:	Christian Borntraeger <borntraeger@de.ibm.com>
 L:	netdev@vger.kernel.org
 L:	bpf@vger.kernel.org
 S:	Maintained
-- 
2.21.0


^ permalink raw reply related

* RE: [PATCH 08/12] net: stmmac: Fix misuses of GENMASK macro
From: Jose Abreu @ 2019-07-10 10:33 UTC (permalink / raw)
  To: Joe Perches, Andrew Morton, Giuseppe Cavallaro, Alexandre Torgue,
	Maxime Coquelin, Maxime Ripard, Chen-Yu Tsai
  Cc: David S. Miller, netdev@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <b38b0b67e724cd026709194b68c2be5ee1058c57.1562734889.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Jul/10/2019, 06:04:21 (UTC+00:00)

> Arguments are supposed to be ordered high then low.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

If you submit another version please add:

Fixes: 293e4365a1ad ("stmmac: change descriptor layout")
Fixes: 9f93ac8d4085 ("net-next: stmmac: Add dwmac-sun8i")

---
Thanks,
Jose Miguel Abreu

^ permalink raw reply

* Re: [PATCH iproute2 1/2] Revert "ip6tunnel: fix 'ip -6 {show|change} dev <name>' cmds"
From: Andrea Claudi @ 2019-07-10 10:11 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: linux-netdev, Stephen Hemminger, David Ahern
In-Reply-To: <CAF2d9jiGnR-A-A-mEv-84Mu6xfwFNvWt5jp+iiBhMGNPMkaDZg@mail.gmail.com>

On Tue, Jul 9, 2019 at 7:31 PM Mahesh Bandewar (महेश बंडेवार)
<maheshb@google.com> wrote:
>
> On Tue, Jul 9, 2019 at 6:16 AM Andrea Claudi <aclaudi@redhat.com> wrote:
> >
> > This reverts commit ba126dcad20e6d0e472586541d78bdd1ac4f1123.
> > It breaks tunnel creation when using 'dev' parameter:
> >
> > $ ip link add type dummy
> > $ ip -6 tunnel add ip6tnl1 mode ip6ip6 remote 2001:db8:ffff:100::2 local 2001:db8:ffff:100::1 hoplimit 1 tclass 0x0 dev dummy0
> > add tunnel "ip6tnl0" failed: File exists
> >
> > dev parameter must be used to specify the device to which
> > the tunnel is binded, and not the tunnel itself.
> >
> > Reported-by: Jianwen Ji <jiji@redhat.com>
> > Reviewed-by: Matteo Croce <mcroce@redhat.com>
> > Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
> > ---
> >  ip/ip6tunnel.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
> > index 56fd3466ed062..999408ed801b1 100644
> > --- a/ip/ip6tunnel.c
> > +++ b/ip/ip6tunnel.c
> > @@ -298,8 +298,6 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm2 *p)
> >                 p->link = ll_name_to_index(medium);
> >                 if (!p->link)
> >                         return nodev(medium);
> > -               else
> > -                       strlcpy(p->name, medium, sizeof(p->name));
> NACK
>
> I see that ba126dcad20e6d0e472586541d78bdd1ac4f1123 has broke
> something but that doesn't mean revert of the original fix is correct
> way of fixing it. The original fix is fixing the show and change
> command. Shouldn't you try fixing the add command so that all (show,
> change, and add) work correctly?
>

Hi Mahesh,
Thanks for sharing your opinion. I think I already answered this
replying to your review of patch 2/2 of this series.
To summarize that up I think there is a misunderstanding on the
meaning of "dev" param, and "name" param (which is the default) must
be used instead in the cases highlighted in your commit.

Regards,
Andrea

> >         }
> >         return 0;
> >  }
> > --
> > 2.20.1
> >

^ permalink raw reply

* Re: [RESEND] test_verifier #13 fails on arm64: "retval 65507 != -29"
From: Paolo Pisati @ 2019-07-10 10:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Paolo Pisati, Daniel Borkmann, Alexei Starovoitov, Martin Lau,
	Song Liu, bpf@vger.kernel.org, netdev
In-Reply-To: <68248069-bcf6-69dd-b0a9-f4ec11e50092@fb.com>

On Mon, Jul 01, 2019 at 09:51:25PM +0000, Yonghong Song wrote:
> 
> Below is the test case.
> {
>          "valid read map access into a read-only array 2",
>          .insns = {
>          BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
>          BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
>          BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
>          BPF_LD_MAP_FD(BPF_REG_1, 0),
>          BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
> BPF_FUNC_map_lookup_elem),
>          BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
> 
>          BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
>          BPF_MOV64_IMM(BPF_REG_2, 4),
>          BPF_MOV64_IMM(BPF_REG_3, 0),
>          BPF_MOV64_IMM(BPF_REG_4, 0),
>          BPF_MOV64_IMM(BPF_REG_5, 0),
>          BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
>                       BPF_FUNC_csum_diff),
>          BPF_EXIT_INSN(),
>          },
>          .prog_type = BPF_PROG_TYPE_SCHED_CLS,
>          .fixup_map_array_ro = { 3 },
>          .result = ACCEPT,
>          .retval = -29,
> },
> 
> The issue may be with helper bpf_csum_diff().
> Maybe you can check bpf_csum_diff() helper return value
> to confirm and take a further look at bpf_csum_diff implementations
> between x64 and amd64.

Indeed, the different result comes from csum_partial() or, more precisely,
do_csum().

x86-64 uses an asm optimized version residing in arch/x86/lib/csum-partial_64.c,
while the generic version is in lib/checksum.c.

I replaced the x86-64 csum_partial() / do_csum() code, with the one in
lib/checksum.c and by doing so i reproduced the same error on x86-64 (thus, it's
not an arch dependent issue).

I added some debugging to bpf_csum_diff(), and here are the results with different
checksum implementation code:

http://paste.debian.net/1091037/

lib/checksum.c:
...
[  206.084537] ____bpf_csum_diff from_size: 1 to_size: 0
[  206.085274] ____bpf_csum_diff from[0]: 28
[  206.085276] ____bpf_csum_diff diff[0]: 4294967267
[  206.085277] ____bpf_csum_diff diff_size: 4 seed: 0

After csum_partial() call:

[  206.086059] ____bpf_csum_diff csum: 65507 - 0xffe3

arch/x86/lib/csum-partial_64.c
...
[   40.467308] ____bpf_csum_diff from_size: 1 to_size: 0
[   40.468141] ____bpf_csum_diff from[0]: 28
[   40.468143] ____bpf_csum_diff diff[0]: 4294967267
[   40.468144] ____bpf_csum_diff diff_size: 4 seed: 0

After csum_partial() call:

[   40.468937] ____bpf_csum_diff csum: -29 - 0xffffffe3

One thing that i noticed, x86-64 csum-partial_64.c::do_csum() doesn't reduce the
calculated checksum to 16bit before returning it (unless the input value is
odd - *):

arch/x86/lib/csum-partial_64.c::do_csum()
		...
        if (unlikely(odd)) { 
                result = from32to16(result);
                result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
        }
        return result;
}

contrary to all the other do_csum() implementations (that i could understand):

lib/checksum.c::do_csum()
arch/alpha/lib/checksum.c::do_csum()
arch/parisc/lib/checksum.c::do_csum()

Apparently even ia64 does the folding (arch/ia64/lib/do_csum.S see a comment right
before .do_csum_exit:), and arch/c6x/lib/csum_64plus.S too (see
arch/c6x/lib/csum_64plus.S).

Funnily enough, if i change do_csum() for x86-64, folding the
checksum to 16 bit (following all the other implementations):

--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -112,8 +112,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned
len)
        if (len & 1)
                result += *buff;
        result = add32_with_carry(result>>32, result & 0xffffffff); 
+       result = from32to16(result);
        if (unlikely(odd)) { 
-               result = from32to16(result);
                result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
        }
        return result;

then, the x86-64 result match the others: 65507 or 0xffe3.

As a last attempt, i tried running the bpf test_verifier on an armhf platform,
and i got a completely different number:

[   57.667999] ____bpf_csum_diff from_size: 1 to_size: 0
[   57.668016] ____bpf_csum_diff from[0]: 28
[   57.668028] ____bpf_csum_diff diff[0]: 4294967267
[   57.668039] ____bpf_csum_diff diff_size: 4 seed: 0

After csum_partial() call:

[   57.668052] ____bpf_csum_diff::2002 csum: 131042 - 0x0001ffe2

Not sure what to make of these number, but i have a question: whats is the
correct checksum of the memory chunk passed to csum_partial()? Is it really -29?

Because, at least 2 other implementations i tested (the arm assembly code and
the c implementation in lib/checksum.c) computes a different value, so either
there's a bug in checksum calcution (2 out of 3???), or we are interpreting the
returned value from csum_partial() somehow wrongly.

*: originally, the x86-64 did the 16bit folding, but the logic was changed to
what we have today during a big rewrite - search for:

commit 3ef076bb685a461bbaff37a1f06010fc4d7ce733
Author: Andi Kleen <ak@suse.de>
Date:   Fri Jun 13 04:27:34 2003 -0700

    [PATCH] x86-64 merge

in this historic repo:

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
-- 
bye,
p.

^ permalink raw reply

* Re: [PATCH 00/12] treewide: Fix GENMASK misuses
From: Russell King - ARM Linux admin @ 2019-07-10  9:43 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Joe Perches, Andrew Morton, Patrick Venture, Nancy Yuen,
	Benjamin Fair, Andrew Jeffery, openbmc, linux-kernel,
	linux-aspeed, linux-arm-kernel, linux-amlogic, netdev,
	linux-mediatek, linux-stm32, linux-wireless, linux-media,
	linux-iio, devel, alsa-devel, linux-mmc, dri-devel
In-Reply-To: <5fa1fa6998332642c49e2d5209193ffe2713f333.camel@sipsolutions.net>

On Wed, Jul 10, 2019 at 11:17:31AM +0200, Johannes Berg wrote:
> On Tue, 2019-07-09 at 22:04 -0700, Joe Perches wrote:
> > These GENMASK uses are inverted argument order and the
> > actual masks produced are incorrect.  Fix them.
> > 
> > Add checkpatch tests to help avoid more misuses too.
> > 
> > Joe Perches (12):
> >   checkpatch: Add GENMASK tests
> 
> IMHO this doesn't make a lot of sense as a checkpatch test - just throw
> in a BUILD_BUG_ON()?

My personal take on this is that GENMASK() is really not useful, it's
just pure obfuscation and leads to exactly these kinds of mistakes.

Yes, I fully understand the argument that you can just specify the
start and end bits, and it _in theory_ makes the code more readable.

However, the problem is when writing code.  GENMASK(a, b).  Is a the
starting bit or ending bit?  Is b the number of bits?  It's confusing
and causes mistakes resulting in incorrect code.  A BUILD_BUG_ON()
can catch some of the cases, but not all of them.

For example:

	GENMASK(6, 2)

would satisify the requirement that a > b, so a BUILD_BUG_ON() will
not trigger, but was the author meaning 0x3c or 0xc0?

Personally, I've decided I am _not_ going to use GENMASK() in my code
because I struggle to get the macro arguments correct - I'm _much_
happier, and it is way more reliable for me to write the mask in hex
notation.

I think this is where use of a ternary operator would come in use.  The
normal way of writing a number of bits tends to be "a:b", so if GENMASK
took something like GENMASK(6:2), then I'd have less issue with it,
because it's argument is then in a familiar notation.

Yes, I'm sure that someone will point out that the GENMASK arguments
are just in the same order, but that doesn't prevent _me_ frequently
getting it wrong - and that's the point.  The macro seems to me to
cause more problems than it solves.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox