Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH mlx5-next 0/5] Improve mlx5 live migration driver
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck

This series improves mlx5 live migration driver in few aspects as of
below.

Refactor to enable running migration commands in parallel over the PF
command interface.

To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).

Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.

This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.

In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.

Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.

Yishai

Yishai Hadas (5):
  vfio/mlx5: Reorganize the VF is migratable code
  net/mlx5: Expose mlx5_sriov_blocking_notifier_register /  unregister
    APIs
  vfio/mlx5: Manage the VF attach/detach callback from the PF
  vfio/mlx5: Refactor to enable VFs migration in parallel
  vfio/mlx5: Run the SAVE state command in an async mode

 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  65 ++++-
 drivers/vfio/pci/mlx5/cmd.c                   | 229 +++++++++++++-----
 drivers/vfio/pci/mlx5/cmd.h                   |  50 +++-
 drivers/vfio/pci/mlx5/main.c                  | 133 +++++-----
 include/linux/mlx5/driver.h                   |  12 +
 5 files changed, 358 insertions(+), 131 deletions(-)

-- 
2.18.1

^ permalink raw reply

* Re: [PATCH perf/core 1/5] libbpf: Add bpf_program__set_insns function
From: Jiri Olsa @ 2022-04-27  8:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Daniel Borkmann, Jiri Olsa, Arnaldo Carvalho de Melo,
	Alexei Starovoitov, Andrii Nakryiko, linux-perf-use., Networking,
	bpf, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Peter Zijlstra, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, Ian Rogers
In-Reply-To: <CAEf4Bza42-aN7dZAWsH1H5KNMhSZh6nUj0WQ5MkOkNjBq2At_A@mail.gmail.com>

On Tue, Apr 26, 2022 at 08:58:12AM -0700, Andrii Nakryiko wrote:
> On Mon, Apr 25, 2022 at 11:57 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Apr 25, 2022 at 11:19:09PM -0700, Andrii Nakryiko wrote:
> > > On Mon, Apr 25, 2022 at 9:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >
> > > > On 4/22/22 12:00 PM, Jiri Olsa wrote:
> > > > > Adding bpf_program__set_insns that allows to set new
> > > > > instructions for program.
> > > > >
> > > > > Also moving bpf_program__attach_kprobe_multi_opts on
> > > > > the proper name sorted place in map file.
> > >
> > > would make sense to fix it as a separate patch, it has nothing to do
> > > with bpf_program__set_insns() API itself
> >
> > np
> >
> > >
> > > > >
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > >   tools/lib/bpf/libbpf.c   |  8 ++++++++
> > > > >   tools/lib/bpf/libbpf.h   | 12 ++++++++++++
> > > > >   tools/lib/bpf/libbpf.map |  3 ++-
> > > > >   3 files changed, 22 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > > > index 809fe209cdcc..284790d81c1b 100644
> > > > > --- a/tools/lib/bpf/libbpf.c
> > > > > +++ b/tools/lib/bpf/libbpf.c
> > > > > @@ -8457,6 +8457,14 @@ size_t bpf_program__insn_cnt(const struct bpf_program *prog)
> > > > >       return prog->insns_cnt;
> > > > >   }
> > > > >
> > > > > +void bpf_program__set_insns(struct bpf_program *prog,
> > > > > +                         struct bpf_insn *insns, size_t insns_cnt)
> > > > > +{
> > > > > +     free(prog->insns);
> > > > > +     prog->insns = insns;
> > > > > +     prog->insns_cnt = insns_cnt;
> > >
> > > let's not store user-provided pointer here. Please realloc prog->insns
> > > as necessary and copy over insns into it.
> > >
> > > Also let's at least add the check for prog->loaded and return -EBUSY
> > > in such a case. And of course this API should return int, not void.
> >
> > ok, will change
> >
> > >
> > > > > +}
> > > > > +
> > > > >   int bpf_program__set_prep(struct bpf_program *prog, int nr_instances,
> > > > >                         bpf_program_prep_t prep)
> > > > >   {
> > > > > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > > > > index 05dde85e19a6..b31ad58d335f 100644
> > > > > --- a/tools/lib/bpf/libbpf.h
> > > > > +++ b/tools/lib/bpf/libbpf.h
> > > > > @@ -323,6 +323,18 @@ struct bpf_insn;
> > > > >    * different.
> > > > >    */
> > > > >   LIBBPF_API const struct bpf_insn *bpf_program__insns(const struct bpf_program *prog);
> > > > > +
> > > > > +/**
> > > > > + * @brief **bpf_program__set_insns()** can set BPF program's underlying
> > > > > + * BPF instructions.
> > > > > + * @param prog BPF program for which to return instructions
> > > > > + * @param insn a pointer to an array of BPF instructions
> > > > > + * @param insns_cnt number of `struct bpf_insn`'s that form
> > > > > + * specified BPF program
> > > > > + */
> > >
> > > This API makes me want to cry... but I can't come up with anything
> > > better for perf's use case.
> > >
> 
> So thinking about this some more. If we make libbpf not to close maps
> and prog FDs on BPF program load failure automatically and instead
> doing it in bpf_object__close(), which would seem to be a totally fine
> semantics and won't break any reasonable application as they always
> have to call bpf_object__close() anyways to clean up all the
> resources; we wouldn't need this horror of bpf_program__set_insns().
> Your BPF program would fail to load, but you'll get its fully prepared
> instructions with bpf_program__insns(), then you can just append
> correct preamble. Meanwhile, all the maps will be created (they are
> always created before the first program load), so all the FDs will be
> correct.
> 
> This is certainly advanced knowledge of libbpf behavior, but the use
> case you are trying to solve is also very unique and advanced (and I
> wouldn't recommend anyone trying to do this anyways). WDYT? Would that
> work?

hm, so verifier will fail after all maps are set up during the walk
of the program instructions.. I guess that could work, I'll give it
a try, should be easy change in libbpf (like below) and also in perf

but still the bpf_program__set_insns seems less horror to me

jirka


---
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index c8df74e5f658..1eb75d4231ff 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -7577,19 +7577,6 @@ static int bpf_object_load(struct bpf_object *obj, int extra_log_level, const ch
 	obj->btf_vmlinux = NULL;
 
 	obj->loaded = true; /* doesn't matter if successfully or not */
-
-	if (err)
-		goto out;
-
-	return 0;
-out:
-	/* unpin any maps that were auto-pinned during load */
-	for (i = 0; i < obj->nr_maps; i++)
-		if (obj->maps[i].pinned && !obj->maps[i].reused)
-			bpf_map__unpin(&obj->maps[i], NULL);
-
-	bpf_object_unload(obj);
-	pr_warn("failed to load object '%s'\n", obj->path);
 	return libbpf_err(err);
 }
 

^ permalink raw reply related

* Re: [PATCH v2 net-next 07/10] net: dsa: request drivers to perform FDB isolation
From: Hans Schultz @ 2022-04-27  8:38 UTC (permalink / raw)
  To: Vladimir Oltean, Andrew Lunn
  Cc: Hans Schultz, netdev@vger.kernel.org, Jakub Kicinski,
	David S. Miller, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	UNGLinuxDriver@microchip.com, Sean Wang, Landen Chao,
	DENG Qingfang, Claudiu Manoil, Alexandre Belloni, Linus Walleij,
	Alvin Šipraga, George McCollister
In-Reply-To: <20220426231755.7yhvabefzbyiaj4o@skbuf>

On tis, apr 26, 2022 at 23:17, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> On Tue, Apr 26, 2022 at 06:14:23PM +0200, Andrew Lunn wrote:
>> > > @@ -941,23 +965,29 @@ struct dsa_switch_ops {
>> > >  	 * Forwarding database
>> > >  	 */
>> > >  	int	(*port_fdb_add)(struct dsa_switch *ds, int port,
>> > > -				const unsigned char *addr, u16 vid);
>> > > +				const unsigned char *addr, u16 vid,
>> > > +				struct dsa_db db);
>> > 
>> > Hi! Wouldn't it be better to have a struct that has all the functions
>> > parameters in one instead of adding further parameters to these
>> > functions?
>> > 
>> > I am asking because I am also needing to add a parameter to
>> > port_fdb_add(), and it would be more future oriented to have a single
>> > function parameter as a struct, so that it is easier to add parameters
>> > to these functions without havíng to change the prototype of the
>> > function every time.
>> 
>> Hi Hans
>> 
>> Please trim the text to only what is relevant when replying. It is
>> easy to miss comments when having to Page Down, Page Down, Page down,
>> to find comments.
>> 
>>    Andrew
>
> Agreed, had to scroll too much.
>
> Hans, what extra argument do you want to add to port_fdb_add?
> A static/dynamic, I suppose, similar to what exists in port_fdb_dump?

Yes, a static/dynamic bool, or could be called a flag field.

>
> But we surely wouldn't pass _all_ parameters of port_fdb_add through
> some giant struct args_of_port_fdb_add, would we?  Not ds, port, db,
> just what is naturally grouped together as an FDB entry: addr, vid,
> maybe your new static/dynamic thing.
>
> If we group the addr and vid in port_fdb_add into a structure that
> represents an FDB entry, what do we do about those in port_fdb_del?
> Group those as well, maybe for consistency?

I think the 'old' interface that several other functions use should have
one struct... e.g. port, addr and vid. But somehow it would be good to
have something more dynamic. There could be two layer of structs, but
generally i think that for these op functions now in relation to fdb
should only have structs as parameters in a logical way that is
expandable and thus future oriented.

Something else to consider is what do switchcore drivers that don't use
'include/net/dsa.h' do and why?

>
> Same question for port_fdb_dump and its dsa_fdb_dump_cb_t: would you
> change it for uniformity, or would you keep it the way it is to reduce
> the churn? I mean it's a perfectly viable candidate for argument
> grouping, but your stated goal _is_ to reduce churn.

I think port_fdb_dump() is maybe the last one that I would change, but
all those functions where you have added the struct dsa_db would be 
candidates.

>
> But if we add the static/dynamic boolean to this structure, does it make
> sense on deletion? And if it doesn't, why have we changed the prototype
> of port_fdb_del to include it?
>

True a static/dynamic boolean doesn't make much sense on deletion, only
if it came in because it is part of a generic struct.

> Restated: do we want to treat the "static/dynamic" info as a property of
> an FDB entry (i.e. a member of the structure), or as the way in which a
> certain FDB entry can be added to hardware (case in which it is relevant
> only to _add and to _dump)?  After all, an FDB entry for {addr, vid}
> learned statically, and another FDB entry for the same {addr, vid} but
> learned dynamically, are fundamentally the same object.
>

I cannot answer for the workings of all switchcores, but for my sake I
use a debug tool to show the age of a dynamic entry in the ATU, so I
don't think that it has much relevance outside of that.

> And if we don't go with a big struct args_of_port_fdb_add (which would
> be silly if we did), who guarantees that the argument list of port_fdb_add
> won't grow in the future anyway? Like in the example I just gave above,
> where "static/dynamic" doesn't fully appear to group naturally with
> "addr" and "vid", and would probably still be a separate boolean,
> rendering the whole point moot.
>
> And even grouping only those last 2 together is maybe debatable due to
> practical reasons - where do we declare this small structure? We have a
> struct switchdev_notifier_fdb_info with some more stuff that we
> deliberately do not want to expose, and {addr, vid} are all that remain.
>
> Although maybe there are benefits to having a small {addr, vid} structure
> defined somewhere publicly, too, and referenced consistently by driver
> code. Like for example to avoid bad patterns from proliferating.
> Currently we have "const unsigned char *addr, u16 vid", so on a 64 bit
> machine, this is a pointer plus an unsigned short, 10 bytes, padded up
> by the compiler, maybe to 16. But the Ethernet addresses are 6 bytes,
> those are shorter than the pointer itself, so on a 64-bit machine,
> having "unsigned char addr[ETH_ALEN], u16 vid" makes a lot more space,
> saves some real memory.

I see that there is definitions for 64bit mac addresses out there, which
might also be needed to be supported at some point?

>
> Anyway, I'm side tracking. If you want to make changes, propose a
> patch, but come up with a more real argument than "reducing churn"
> (while effectively producing churn).

Unfortunately I don't have the time to make such a patch suggestion for
some time to come as I also have other patch sets coming up, and I
should study a bit what your patch set with the dsa_db is about also, so
maybe I must just add the bool to port_fdb_add() for now.

>
> To give you a counter example, phylink_mac_config() used to have the
> opposite problem, the arguments were all neatly packed into a struct
> phylink_link_state, but as the kerneldocs from include/linux/phylink.h
> put it, not all members of that structure were guaranteed to contain
> valid values. So there were bugs due to people not realizing this, and
> consequently, phylink_mac_link_up() took the opposite approach, of
> explicitly passing all known parameters of the resolved link state as
> individual arguments. Now there are 10 arguments to that function, but
> somehow at least this appears to have worked better, in the sense that
> there isn't an explanatory note saying what's valid and what isn't.

Yes, I can see the danger of it, but something like phylink is also
different as it is more hardware related, which has a slower development
cycle than feature/protocol stuff.

^ permalink raw reply

* Re: [PATCH net 1/2] ptp: ptp_clockmatrix: Add PTP_CLK_REQ_EXTTS support
From: kernel test robot @ 2022-04-27  8:31 UTC (permalink / raw)
  To: Min Li, richardcochran, lee.jones
  Cc: kbuild-all, linux-kernel, netdev, Min Li
In-Reply-To: <1651001574-32457-1-git-send-email-min.li.xe@renesas.com>

Hi Min,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Min-Li/ptp-ptp_clockmatrix-Add-PTP_CLK_REQ_EXTTS-support/20220427-033506
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git acb16b395c3f3d7502443e0c799c2b42df645642
config: i386-randconfig-a013-20220425 (https://download.01.org/0day-ci/archive/20220427/202204271653.N6Y5H0EO-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/afadc4edd1bf64b40cb61b38dedf67354baeb147
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Min-Li/ptp-ptp_clockmatrix-Add-PTP_CLK_REQ_EXTTS-support/20220427-033506
        git checkout afadc4edd1bf64b40cb61b38dedf67354baeb147
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: drivers/ptp/ptp_clockmatrix.o: in function `idtcm_enable_channel':
>> ptp_clockmatrix.c:(.text+0x2394): undefined reference to `__udivdi3'
>> ld: ptp_clockmatrix.c:(.text+0x23b0): undefined reference to `__udivdi3'

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply

* Re: [PATCH net 6/7] tcp: increase source port perturb table to 2^16
From: Willy Tarreau @ 2022-04-27  8:19 UTC (permalink / raw)
  To: David Laight
  Cc: netdev@vger.kernel.org, David Miller, Jakub Kicinski,
	Eric Dumazet, Moshe Kol, Yossi Gilad, Amit Klein,
	linux-kernel@vger.kernel.org
In-Reply-To: <7372f788762140d496c157813b0173e5@AcuMS.aculab.com>

On Wed, Apr 27, 2022 at 08:07:29AM +0000, David Laight wrote:
> From: Willy Tarreau
> > Sent: 27 April 2022 07:53
> > 
> > Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
> > identify a client by forcing it to emit only 40 times more connections
> > than there are entries in the table_perturb[] table. The previous two
> > improvements consisting in resalting the secret every 10s and adding
> > randomness to each port selection only slightly improved the situation,
> > and the current value of 2^8 was too small as it's not very difficult
> > to make a client emit 10k connections in less than 10 seconds.
> > 
> > Thus we're increasing the perturb table from 2^8 to 2^16 so that the the
> > same precision now requires 2.6M connections, which is more difficult in
> > this time frame and harder to hide as a background activity. The impact
> > is that the table now uses 256 kB instead of 1 kB, which could mostly
> > affect devices making frequent outgoing connections. However such
> > components usually target a small set of destinations (load balancers,
> > database clients, perf assessment tools), and in practice only a few
> > entries will be visited, like before.
> 
> Increasing the table size has a bigger impact on anyone trying
> to get the kernel to run in a limited memory footprint.
> 
> All these large tables (often hash tables) soon add up.

We know, and the initial approach that required a significantly larger
table and an extra table per namespace was a no-go. All these are part
of the reasons for the effort it took to refine the solution in order
to avoid such unacceptable costs. The way it was made here makes it easy
to patch the value for small systems if needed and leaves the door open
for a configurable setting in the future if that was estimated necessary
for certain environments.

Willy

^ permalink raw reply

* Re: [PATCH] ath9k: hif_usb: simplify if-if to if-else
From: Toke Høiland-Jørgensen @ 2022-04-27  8:09 UTC (permalink / raw)
  To: Wan Jiabing, Kalle Valo, David S. Miller, Jakub Kicinski,
	Paolo Abeni, linux-wireless, netdev, linux-kernel
  Cc: kael_w, Wan Jiabing
In-Reply-To: <20220424094441.104937-1-wanjiabing@vivo.com>

Wan Jiabing <wanjiabing@vivo.com> writes:

> Use if and else instead of if(A) and if (!A).
>
> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>

Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>

^ permalink raw reply

* RE: [PATCH net 6/7] tcp: increase source port perturb table to 2^16
From: David Laight @ 2022-04-27  8:07 UTC (permalink / raw)
  To: 'Willy Tarreau', netdev@vger.kernel.org
  Cc: David Miller, Jakub Kicinski, Eric Dumazet, Moshe Kol,
	Yossi Gilad, Amit Klein, linux-kernel@vger.kernel.org
In-Reply-To: <20220427065233.2075-7-w@1wt.eu>

From: Willy Tarreau
> Sent: 27 April 2022 07:53
> 
> Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
> identify a client by forcing it to emit only 40 times more connections
> than there are entries in the table_perturb[] table. The previous two
> improvements consisting in resalting the secret every 10s and adding
> randomness to each port selection only slightly improved the situation,
> and the current value of 2^8 was too small as it's not very difficult
> to make a client emit 10k connections in less than 10 seconds.
> 
> Thus we're increasing the perturb table from 2^8 to 2^16 so that the the
> same precision now requires 2.6M connections, which is more difficult in
> this time frame and harder to hide as a background activity. The impact
> is that the table now uses 256 kB instead of 1 kB, which could mostly
> affect devices making frequent outgoing connections. However such
> components usually target a small set of destinations (load balancers,
> database clients, perf assessment tools), and in practice only a few
> entries will be visited, like before.

Increasing the table size has a bigger impact on anyone trying
to get the kernel to run in a limited memory footprint.

All these large tables (often hash tables) soon add up.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply

* [PATCH] libbpf: fix returnvar.cocci warnings
From: kernel test robot @ 2022-04-27  8:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: kbuild-all, Linux Memory Management List, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, netdev, bpf, linux-kernel
In-Reply-To: <202204271656.OTIj2QNJ-lkp@intel.com>

From: kernel test robot <lkp@intel.com>

tools/lib/bpf/relo_core.c:1064:8-11: Unneeded variable: "len". Return "0" on line 1086


 Remove unneeded variable used to store return value.

Generated by: scripts/coccinelle/misc/returnvar.cocci

Fixes: b58af63aab11 ("libbpf: Refactor CO-RE relo human description formatting routine")
CC: Andrii Nakryiko <andrii@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head:   f02ac5c95dfd45d2f50ecc68d79177de326c668c
commit: b58af63aab11e4ae00fe96de9505759cfdde8ee9 [6746/7265] libbpf: Refactor CO-RE relo human description formatting routine
:::::: branch date: 2 hours ago
:::::: commit date: 9 hours ago

 tools/lib/bpf/relo_core.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/tools/lib/bpf/relo_core.c
+++ b/tools/lib/bpf/relo_core.c
@@ -1061,7 +1061,7 @@ static int bpf_core_format_spec(char *bu
 	const struct btf_enum *e;
 	const char *s;
 	__u32 type_id;
-	int i, len = 0;
+	int i;
 
 #define append_buf(fmt, args...)				\
 	({							\
@@ -1083,7 +1083,7 @@ static int bpf_core_format_spec(char *bu
 		   type_id, btf_kind_str(t), str_is_empty(s) ? "<anon>" : s);
 
 	if (core_relo_is_type_based(spec->relo_kind))
-		return len;
+		return 0;
 
 	if (core_relo_is_enumval_based(spec->relo_kind)) {
 		t = skip_mods_and_typedefs(spec->btf, type_id, NULL);
@@ -1091,7 +1091,7 @@ static int bpf_core_format_spec(char *bu
 		s = btf__name_by_offset(spec->btf, e->name_off);
 
 		append_buf("::%s = %u", s, e->val);
-		return len;
+		return 0;
 	}
 
 	if (core_relo_is_field_based(spec->relo_kind)) {
@@ -1110,10 +1110,10 @@ static int bpf_core_format_spec(char *bu
 			append_buf(" @ offset %u.%u)", spec->bit_offset / 8, spec->bit_offset % 8);
 		else
 			append_buf(" @ offset %u)", spec->bit_offset / 8);
-		return len;
+		return 0;
 	}
 
-	return len;
+	return 0;
 #undef append_buf
 }
 

^ permalink raw reply

* Re: [PATCH net-next 7/7] net: phy: smsc: Cope with hot-removal in interrupt handler
From: Oliver Neukum @ 2022-04-27  7:46 UTC (permalink / raw)
  To: Lukas Wunner, Steve Glendinning, UNGLinuxDriver, Oliver Neukum,
	David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-usb, Andre Edich, Oleksij Rempel, Martyn Welch,
	Gabriel Hojda, Christoph Fritz, Lino Sanfilippo,
	Philipp Rosenberger, Heiner Kallweit, Andrew Lunn, Russell King
In-Reply-To: <62b8375ae008035bcaa85c348ea4aa80c519bb07.1651037513.git.lukas@wunner.de>



On 27.04.22 07:48, Lukas Wunner wrote:
> If reading the Interrupt Source Flag register fails with -ENODEV, then
> the PHY has been hot-removed and the correct response is to bail out
> instead of throwing a WARN splat and attempting to suspend the PHY.
> The PHY should be stopped in due course anyway as the kernel
> asynchronously tears down the device.
>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/net/phy/smsc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
> index a521d48b22a7..35bff7fd234c 100644
> --- a/drivers/net/phy/smsc.c
> +++ b/drivers/net/phy/smsc.c
> @@ -91,7 +91,9 @@ static irqreturn_t smsc_phy_handle_interrupt(struct phy_device *phydev)
>  
>  	irq_status = phy_read(phydev, MII_LAN83C185_ISF);
>  	if (irq_status < 0) {
> -		phy_error(phydev);
> +		if (irq_status != -ENODEV)
> +			phy_error(phydev);
> +
>  		return IRQ_NONE;
>  	}
>  
Hi,

picking a small nit, if you get ENODEV, you have no idea whether the irq
was caused
by the phy. Strictly speaking you should return IRQ_HANDLED in that case.

    Regards
        Oliver


^ permalink raw reply

* Re: [PATCH net-next 1/7] usbnet: Run unregister_netdev() before unbind() again
From: Oliver Neukum @ 2022-04-27  7:45 UTC (permalink / raw)
  To: Lukas Wunner, Steve Glendinning, UNGLinuxDriver, Oliver Neukum,
	David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-usb, Andre Edich, Oleksij Rempel, Martyn Welch,
	Gabriel Hojda, Christoph Fritz, Lino Sanfilippo,
	Philipp Rosenberger, Heiner Kallweit, Andrew Lunn, Russell King
In-Reply-To: <ca65428c57c0b15ead00c92decb9b4a01a0fa81f.1651037513.git.lukas@wunner.de>



On 27.04.22 07:48, Lukas Wunner wrote:
> Commit 2c9d6c2b871d ("usbnet: run unbind() before unregister_netdev()")
> sought to fix a use-after-free on disconnect of USB Ethernet adapters.
>
>
Hi,

thank you for doing this. We have finally come at least to a partial
solution for this issue.

    Regards
        Oliver


^ permalink raw reply

* Re: [PATCH net-next 1/7] usbnet: Run unregister_netdev() before unbind() again
From: Oliver Neukum @ 2022-04-27  7:44 UTC (permalink / raw)
  To: Lukas Wunner, Steve Glendinning, UNGLinuxDriver, Oliver Neukum,
	David S. Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-usb, Andre Edich, Oleksij Rempel, Martyn Welch,
	Gabriel Hojda, Christoph Fritz, Lino Sanfilippo,
	Philipp Rosenberger, Heiner Kallweit, Andrew Lunn, Russell King
In-Reply-To: <ca65428c57c0b15ead00c92decb9b4a01a0fa81f.1651037513.git.lukas@wunner.de>



On 27.04.22 07:48, Lukas Wunner wrote:
> Commit 2c9d6c2b871d ("usbnet: run unbind() before unregister_netdev()")
> sought to fix a use-after-free on disconnect of USB Ethernet adapters.
>
> It turns out that a different fix is necessary to address the issue:
> https://lore.kernel.org/netdev/18b3541e5372bc9b9fc733d422f4e698c089077c.1650177997.git.lukas@wunner.de/
>
> So the commit was not necessary.
>
> The commit made binding and unbinding of USB Ethernet asymmetrical:
> Before, usbnet_probe() first invoked the ->bind() callback and then
> register_netdev().  usbnet_disconnect() mirrored that by first invoking
> unregister_netdev() and then ->unbind().
>
> Since the commit, the order in usbnet_disconnect() is reversed and no
> longer mirrors usbnet_probe().
>
> One consequence is that a PHY disconnected (and stopped) in ->unbind()
> is afterwards stopped once more by unregister_netdev() as it closes the
> netdev before unregistering.  That necessitates a contortion in ->stop()
> because the PHY may only be stopped if it hasn't already been
> disconnected.
>
> Reverting the commit allows making the call to phy_stop() unconditional
> in ->stop().
>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: Oleksij Rempel <o.rempel@pengutronix.de>
> Cc: Martyn Welch <martyn.welch@collabora.com>
> Cc: Andrew Lunn <andrew@lunn.ch>
>
Acked-by: Oliver Neukum <oneukum@suse.com>


^ permalink raw reply

* Re: [PATCH net-next 00/11] mlxsw: extend line card model by devices and info
From: Jiri Pirko @ 2022-04-27  7:37 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, Ido Schimmel, Ido Schimmel, netdev, davem, pabeni,
	jiri, petrm, dsahern, mlxsw
In-Reply-To: <YmgOuUPy9Digezvh@lunn.ch>

Tue, Apr 26, 2022 at 05:24:41PM CEST, andrew@lunn.ch wrote:
>> Does not make sense to me at all. Line cards are detachable PHY sets in
>> essence, very basic functionality. They does not have buffers, health
>> and params, I don't think so. 
>
>Ido recently added support to ethtool to upgrade the flash in SFPs.
>They are far from simple devices. Some of the GPON ones have linux
>running in them, that you can log in to.
>
>The real question is, can you do everything you need to do via
>existing uAPIs like ethtool and hwmon.

No, it does not fit. Ethtool works with netdevices which are
instantiated for separate port. The disconnection between what we can do
with netdev as a handle and how the devices are modeled was the reason
for devlink introduction in the past.

>
>	Andrew

^ permalink raw reply

* Re: [PATCH net-next 00/11] mlxsw: extend line card model by devices and info
From: Jiri Pirko @ 2022-04-27  7:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Ido Schimmel, netdev, davem, pabeni, jiri, petrm,
	dsahern, andrew, mlxsw
In-Reply-To: <20220426075133.53562a2e@kernel.org>

Tue, Apr 26, 2022 at 04:51:33PM CEST, kuba@kernel.org wrote:
>On Tue, 26 Apr 2022 16:00:10 +0200 Jiri Pirko wrote:
>> >> But you have to somehow link the component to the particular gearbox on
>> >> particular line card. Say, you need to flash GB on line card 8. This is
>> >> basically providing a way to expose this relationship to user.
>> >> 
>> >> Also, the "lc info" shows the FW version for gearboxes. As Ido
>> >> mentioned, the GB versions could be listed in "devlink dev info" in
>> >> theory. But then, you need to somehow expose the relationship with
>> >> line card as well.  
>> >
>> >Why would the automation which comes to update the firmware care 
>> >at all where the component is? Humans can see what the component 
>> >is by looking at the name.  
>> 
>> The relationship-by-name sounds a bit fragile to me. The names of
>> components are up to the individual drivers.
>
>I asked you how the automation will operate. You must answer questions
>if you want to have a discussion. Automation is the relevant part.

Automation, not sure. It would probably just see type of gearbox and
flash it. Not sure I understand the question, perhaps you could explain?
Plus, the possibility is to auto-flash the GB from driver directly.


>You're not designing an interface for SDK users but for end users.

Sure, that is the aim of this API. Human end user. That is why I wanted
the user to see the relationships between devlink dev, line cards and
the gearboxes on them. If you want to limit the visibility, sure, just
tell me how.


>
>> >If we do need to know (*if*!) you can list FW components as a lc
>> >attribute, no need for new commands and objects.  
>> 
>> There is no new command for that, only one nested attribute which
>> carries the device list added to the existing command. They are no new
>> objects, they are just few nested values.
>
>DEVLINK_CMD_LINECARD_INFO_GET

Okay, that is not only to expose devices. That is also to expose info
about linecards, like HW revision, INI version etc. Where else to put
it? I can perhaps embed it into devlink dev info, but I thought separate
command would be more suitable. object cmd, object info cmd. It is
more clear I believe.


>
>> >IMHO we should either keep lc objects simple and self contained or 
>> >give them a devlink instance. Creating sub-objects from them is very  
>> 
>> Give them a devlink instance? I don't understand how. LC is not a
>> separate device, far from that. That does not make any sense to me.
>
>You can put a name of another devlink instance as an attribute of a lc.
>See below.
>
>> >worrying. If there is _any_ chance we'll need per-lc health reporters 
>> >or sbs or params(🤢) etc. etc. - let's bite the bullet _now_ and create
>> >full devlink sub-instances!  
>> 
>> Does not make sense to me at all. Line cards are detachable PHY sets in
>> essence, very basic functionality. They does not have buffers, health
>> and params, I don't think so. 
>
>I guess the definition of a "line card" has become somewhat murky over
>the years, since the olden days of serial lines.
>
>Perhaps David and others can enlighten us but what I'm used to hearing
>about as a line card these days in a chassis system is a full-on switch.
>Chassis being effectively a Clos network in a box, the main difference
>being the line cards talk cells to the backplane, not full packets.

That is a different device model comparing what we have.
These are like "nested switches". If they are separate devices with
everything, they can be modeled as separate devlink device instances.
No problem. Different case though.


>
>Back in my Netronome days we called those detachable front panel gear
>boxes "phy mods". Those had nowhere near the complexity of a real line
>card. Sounds like that's more aligned with what you have.

Yep.


>
>To summarize, since your definition of a line card is a bit special,
>the less uAPI we add to fit your definition we add the better.

Tell me where to cut it so it still makes sense.


>
>> >> I don't see any simpler iface than this.  
>> >
>> >Based on the assumptions you've made, maybe, but the uAPI should
>> >abstract away the irrelevant details. I'm questioning the assumptions.  
>> 
>> Is the FW version of gearbox on a line card irrelevand detail?
>
>Not what I said.

That is why I'm asking :)


>
>> If so, how does the user know if/when to flash it?
>> If not, where would you list it if devices nest is not the correct place?
>
>Let me mock up what I had in mind for you since it did not come thru 
>in the explanation:
>
>$ devlink dev info show pci/0000:01:00.0
>    versions:
>        fixed:
>          hw.revision 0
>          lc2.hw.revision a
>          lc8.hw.revision b
>        running:
>          ini.version 4
>          lc2.gearbox 1.1.3
>          lc8.gearbox 1.2.3

Would be rather:

          lc2.gearbox0 1.1.3
          lc2.gearbox1 1.2.4
          lc8.gearbox0 1.2.3

Okay, I see. So instead of having clear api with relationships and
clear human+machine readability we have squahed indexes into strings.
I fail to see the benefit, other than no-api-extension :/ On contrary.


>
>$ devlink lc show pci/0000:01:00.0 lc 8
>pci/0000:01:00.0:
>  lc 8 state active type 16x100G
>    supported_types:
>      16x100G
>    versions: 
>      lc8.hw.revision (a) 
>      lc8.gearbox (1.2.3)
>
>Where the data in the brackets is optionally fetched thru the existing
>"dev info" API, but rendered together by the user space.

Quite odd. I find it questionable to say at least to mix multiple
command netlink outputs into one output. The processing of it would
be a small nightmare considering the way how the netlink message
processing works in iproute2 :/


>
>> >> There are 4 gearboxes on the line card. They share the same flash. So
>> >> if you flash gearbox 0, the rest will use the same FW.  
>> >
>> >o_0 so the FW component is called lcX_dev0 and yet it applies to _all_
>> >devices, not just dev0?! Looking at the output above I thought other
>> >devices simply don't have FW ("flashable false").  
>> 
>> Yes, device 0 is "flash master" (RW). 1-3 are RO. I know it is a bit
>> confusing. Maybe Andy's suggestion of "shared" flag of some sort might
>> help.
>> 
>> >> I'm exposing them for the sake of completeness. Also, the interface
>> >> needs to be designed as a list anyway, as different line cards may
>> >> have separate flash per gearbox.
>> >> 
>> >> What's is the harm in exposing devices 1-3? If you insist, we can hide
>> >> them.  
>> >
>> >Well, they are unnecessary (this is uAPI), and coming from the outside
>> >I misinterpreted what the information reported means, so yeah, I'd
>> >classify it as harmful :(  
>> 
>> UAPI is the "devices nest". It has to be list one way or another
>> (we may need to expose more gearboxes anyway). So what is differently
>> harmful with having list [0] or list [0,1,2,3] ?

^ permalink raw reply

* Re: [PATCH] Revert "mt76: mt7921: enable aspm by default"
From: Thorsten Leemhuis @ 2022-04-27  7:32 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Philippe Schenker, linux-wireless, Felix Fietkau, David S. Miller,
	Deren Wu, Jakub Kicinski, Lorenzo Bianconi, Matthias Brugger,
	Paolo Abeni, Ryder Lee, Sean Wang, Shayne Chen, YN Chen,
	linux-arm-kernel, linux-kernel, linux-mediatek, netdev,
	regressions@lists.linux.dev
In-Reply-To: <87o816oaez.fsf@kernel.org>

On 12.04.22 12:36, Kalle Valo wrote:
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> [...]
>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>> reports on my table. I can only look briefly into most of them and lack
>> knowledge about most of the areas they concern. I thus unfortunately
>> will sometimes get things wrong or miss something important. I hope
>> that's not the case here; if you think it is, don't hesitate to tell me
>> in a public reply, it's in everyone's interest to set the public record
>> straight.
> BTW, maybe you could add that boilerplace text after P.S. into the
> signature (ie. under "-- " line)? That way your mails would more
> readable and make it more clear that you didn't write the boilerplate
> text specifically for this mail.

Late reply:

FYI, I thought back and forth about the boilerplace text and how to
handle that when I started using it. I deliberately decided against
putting it under a "-- " line, as that wouldn't work well for some of
the mails I write -- for example those where I deliberately use
top-posting (which I hate and kinda feels wrong, but nevertheless right
at the same time) to make this as easy to grasp as possible.

After your comment I have thought about it again for a while but in the
end for now decided to mostly stick to the approach I used, but your
comment made me shorten the text a bit.

Ciao, Thorsten

^ permalink raw reply

* Re: [PATCH] wil6210: simplify if-if to if-else
From: Kalle Valo @ 2022-04-27  7:29 UTC (permalink / raw)
  To: Wan Jiabing
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Johannes Berg,
	linux-wireless, netdev, linux-kernel, kael_w, Wan Jiabing
In-Reply-To: <20220424094552.105466-1-wanjiabing@vivo.com>

Wan Jiabing <wanjiabing@vivo.com> wrote:

> Use if and else instead of if(A) and if (!A).
> 
> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>

Patch applied to ath-next branch of ath.git, thanks.

a5f3aed5889e wil6210: simplify if-if to if-else

-- 
https://patchwork.kernel.org/project/linux-wireless/patch/20220424094552.105466-1-wanjiabing@vivo.com/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


^ permalink raw reply

* Re: [PATCH] ath10k: simplify if-if to if-else
From: Kalle Valo @ 2022-04-27  7:28 UTC (permalink / raw)
  To: Wan Jiabing
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, ath10k,
	linux-wireless, netdev, linux-kernel, kael_w, Wan Jiabing
In-Reply-To: <20220424094522.105262-1-wanjiabing@vivo.com>

Wan Jiabing <wanjiabing@vivo.com> wrote:

> Use if and else instead of if(A) and if (!A).
> 
> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>

Patch applied to ath-next branch of ath.git, thanks.

7471f7d273ac ath10k: simplify if-if to if-else

-- 
https://patchwork.kernel.org/project/linux-wireless/patch/20220424094522.105262-1-wanjiabing@vivo.com/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


^ permalink raw reply

* Re: [PATCH v2 3/3] ARM: dts: imx6qdl-sr-som: update phy configuration for som revision 1.9
From: Josua Mayer @ 2022-04-27  7:15 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn
  Cc: netdev, alvaro.karsz, Rob Herring, Krzysztof Kozlowski, Shawn Guo,
	Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	NXP Linux Team
In-Reply-To: <YmFoKm0UvrSgD7kp@shell.armlinux.org.uk>

Hi Russell,

Am 21.04.22 um 17:20 schrieb Russell King (Oracle):
> On Thu, Apr 21, 2022 at 03:30:38PM +0200, Andrew Lunn wrote:
>>> The only other ways around this that I can see would be to have some
>>> way to flag in DT that the PHYs are "optional" - if they're not found
>>> while probing the hardware, then don't whinge about them. Or have
>>> u-boot discover which address the PHY is located, and update the DT
>>> blob passed to the kernel to disable the PHY addresses that aren't
>>> present. Or edit the DT to update the node name and reg property. Or
>>> something along those lines.
>> uboot sounds like the best option. I don't know if we currently
>> support the status property for PHYs. Maybe the .dtsi file should have
>> them all status = "disabled"; and uboot can flip the populated ones to
>> "okay". Or maybe the other way around to handle older bootloaders.
> ... which would immediately regress the networking on all SolidRun iMX6
> platforms when booting "new" DT with existing u-boot, so clearly that
> isn't a possible solution.

So to summarize - you don't want to see a third phy spamming the console 
with probe errors ...

I think a combination of the suggestions would be doable:
- Add the new phy to dt, with status disabled
- keep the existing phys unchanged
- after probing in u-boot, disable the two old entries, and enable the 
new one

It is not very convenient since that means changes to u-boot are necessary,
but it can be done - and won't break existing users only updating Linux.

sincerely
Josua Mayer


^ permalink raw reply

* Re: [PATCH net] SUNRPC: Fix local socket leak in xs_local_setup_socket()
From: wanghai (M) @ 2022-04-27  7:15 UTC (permalink / raw)
  To: Trond Myklebust, anna@kernel.org, pabeni@redhat.com,
	davem@davemloft.net, chuck.lever@oracle.com, kuba@kernel.org
  Cc: linux-nfs@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <d013bdc75085e380250cb79edf2b27680cbc9f7e.camel@hammerspace.com>


在 2022/4/27 2:51, Trond Myklebust 写道:
> On Tue, 2022-04-26 at 21:20 +0800, Wang Hai wrote:
>> If the connection to a local endpoint in xs_local_setup_socket()
>> fails,
>> fput() is missing in the error path, which will result in a socket
>> leak.
>> It can be reproduced in simple script below.
>>
>> while true
>> do
>>          systemctl stop rpcbind.service
>>          systemctl stop rpc-statd.service
>>          systemctl stop nfs-server.service
>>
>>          systemctl restart rpcbind.service
>>          systemctl restart rpc-statd.service
>>          systemctl restart nfs-server.service
>> done
>>
>> When executing the script, you can observe that the
>> "cat /proc/net/unix | wc -l" count keeps growing.
>>
>> Add the missing fput(), and restore transport to old socket.
>>
>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
>> ---
>>   net/sunrpc/xprtsock.c | 20 ++++++++++++++++++--
>>   1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 0f39e08ee580..7219c545385e 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -1819,6 +1819,9 @@ static int xs_local_finish_connecting(struct
>> rpc_xprt *xprt,
>>   {
>>          struct sock_xprt *transport = container_of(xprt, struct
>> sock_xprt,
>>                                                                       
>>     xprt);
>> +       struct socket *trans_sock = NULL;
>> +       struct sock *trans_inet = NULL;
>> +       int ret;
>>   
>>          if (!transport->inet) {
>>                  struct sock *sk = sock->sk;
>> @@ -1835,6 +1838,9 @@ static int xs_local_finish_connecting(struct
>> rpc_xprt *xprt,
>>   
>>                  xprt_clear_connected(xprt);
>>   
>> +               trans_sock = transport->sock;
>> +               trans_inet = transport->inet;
>> +
> Both values are NULL here
Got it, thanks
>
>>                  /* Reset to new socket */
>>                  transport->sock = sock;
>>                  transport->inet = sk;
>> @@ -1844,7 +1850,14 @@ static int xs_local_finish_connecting(struct
>> rpc_xprt *xprt,
>>   
>>          xs_stream_start_connect(transport);
>>   
>> -       return kernel_connect(sock, xs_addr(xprt), xprt->addrlen, 0);
>> +       ret = kernel_connect(sock, xs_addr(xprt), xprt->addrlen, 0);
>> +       /* Restore to old socket */
>> +       if (ret && trans_inet) {
>> +               transport->sock = trans_sock;
>> +               transport->inet = trans_inet;
>> +       }
>> +
>> +       return ret;
>>   }
>>   
>>   /**
>> @@ -1887,7 +1900,7 @@ static int xs_local_setup_socket(struct
>> sock_xprt *transport)
>>                  xprt->stat.connect_time += (long)jiffies -
>>                                             xprt->stat.connect_start;
>>                  xprt_set_connected(xprt);
>> -               break;
>> +               goto out;
>>          case -ENOBUFS:
>>                  break;
>>          case -ENOENT:
>> @@ -1904,6 +1917,9 @@ static int xs_local_setup_socket(struct
>> sock_xprt *transport)
>>                                  xprt-
>>> address_strings[RPC_DISPLAY_ADDR]);
>>          }
>>   
>> +       transport->file = NULL;
>> +       fput(filp);
> Please just call xprt_force_disconnect() so that this can be cleaned up
> from 	a safe context.

Hi, Trond

Thank you for your advice, I tried this, but it doesn't seem to

work and an error is reported. I'll analyze why this happens

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0f39e08ee580..3d1387b2cfbf 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1887,7 +1887,7 @@ static int xs_local_setup_socket(struct sock_xprt 
*transport)
                 xprt->stat.connect_time += (long)jiffies -
xprt->stat.connect_start;
                 xprt_set_connected(xprt);
-               break;
+               goto out;
         case -ENOBUFS:
                 break;
         case -ENOENT:
@@ -1904,6 +1904,8 @@ static int xs_local_setup_socket(struct sock_xprt 
*transport)
xprt->address_strings[RPC_DISPLAY_ADDR]);
         }

+       xprt_force_disconnect(xprt);
+
  out:
         xprt_clear_connecting(xprt);
         xprt_wake_pending_tasks(xprt, status);


[ 2541.763895][ T8289] ------------[ cut here ]------------
[ 2541.765829][ T8289] WARNING: CPU: 0 PID: 8289 at 
kernel/workqueue.c:1499 __queue_work+0x72a/0x810
[ 2541.768862][ T8289] Modules linked in:
[ 2541.770085][ T8289] CPU: 0 PID: 8289 Comm: gssproxy Tainted: G        
W         5.17.0+ #762
[ 2541.772724][ T8289] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 2541.773788][ T8289] RIP: 0010:__queue_work+0x72a/0x810
[ 2541.773788][ T8289] Code: 48 c7 c7 f8 7b b8 84 c6 05 b1 f4 39 04 01 
e8 ad 65 05 00 e9 7f fe ff ff e8 33 94 11 00 4c 8b 33 e9 ff f9 ff ff e8 
26 94 11 00 <0f> 0b e9 d2 fa ff ff e8 1a 94 11 00 4c 8d 7b 68 41 83 cc 
02 e9 aa
[ 2541.773788][ T8289] RSP: 0018:ffffc900083dfb20 EFLAGS: 00010093
[ 2541.773788][ T8289] RAX: 0000000000000000 RBX: ffff8881002a7900 RCX: 
0000000000000000
[ 2541.773788][ T8289] RDX: ffff88824e091b40 RSI: ffffffff8119be6a RDI: 
ffffc900083dfb07
[ 2541.773788][ T8289] RBP: ffffc900083dfb60 R08: 0000000000000001 R09: 
0000000000000000
[ 2541.773788][ T8289] R10: 0000000000000000 R11: 6e75732f74656e5b R12: 
0000000000000000
[ 2541.773788][ T8289] R13: ffff88811a284668 R14: ffff888237c2d440 R15: 
ffff888243141c00
[ 2541.773788][ T8289] FS:  00007f3bb3f9dc40(0000) 
GS:ffff888237c00000(0000) knlGS:0000000000000000
[ 2541.773788][ T8289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2541.773788][ T8289] CR2: 00007f3bb04a72e0 CR3: 00000002602c5000 CR4: 
00000000000006f0
[ 2541.773788][ T8289] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 2541.773788][ T8289] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 2541.773788][ T8289] Call Trace:
[ 2541.773788][ T8289]  <TASK>
[ 2541.773788][ T8289]  queue_work_on+0x88/0x90
[ 2541.773788][ T8289]  xprt_schedule_autoclose_locked+0x7a/0xb0
[ 2541.773788][ T8289]  xprt_force_disconnect+0x53/0x150
[ 2541.773788][ T8289]  xs_local_setup_socket+0x131/0x3e0
[ 2541.823215][ T8289]  xs_setup_local+0x24b/0x280
[ 2541.823215][ T8289]  xprt_create_transport+0xb0/0x340
[ 2541.823215][ T8289]  rpc_create+0x104/0x2b0
[ 2541.823215][ T8289]  gssp_rpc_create+0x93/0xe0
[ 2541.823215][ T8289]  set_gssp_clnt+0xd9/0x230
[ 2541.823215][ T8289]  write_gssp+0xb9/0x130
[ 2541.823215][ T8289]  ? lock_acquire+0x1de/0x2f0
[ 2541.823215][ T8289]  proc_reg_write+0xd2/0x110
[ 2541.823215][ T8289]  ? set_gss_proxy+0x1d0/0x1d0
[ 2541.823215][ T8289]  ? proc_reg_compat_ioctl+0x100/0x100
[ 2541.823215][ T8289]  vfs_write+0x11d/0x4b0
[ 2541.841496][ T8289]  ksys_write+0xe0/0x130
[ 2541.841496][ T8289]  __x64_sys_write+0x23/0x30
[ 2541.841496][ T8289]  do_syscall_64+0x34/0xb0
[ 2541.841496][ T8289]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2541.841496][ T8289] RIP: 0033:0x7f3bb0811280
[ 2541.841496][ T8289] Code: 00 c3 0f 1f 84 00 00 00 00 00 48 8b 05 c1 
8c 20 00 c3 0f 1f 84 00 00 00 00 00 83 3d 09 cf 20 00 00 75 10 b8 01 00 
00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 6e fd ff ff 48 
89 04 24
[ 2541.841496][ T8289] RSP: 002b:00007ffc59024c98 EFLAGS: 00000246 
ORIG_RAX: 0000000000000001
[ 2541.841496][ T8289] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 
00007f3bb0811280
[ 2541.841496][ T8289] RDX: 0000000000000001 RSI: 00007ffc59024ca6 RDI: 
0000000000000009
[ 2541.841496][ T8289] RBP: 0000000000000000 R08: 0000000000000020 R09: 
0000000000000060
[ 2541.841496][ T8289] R10: 0000561545627900 R11: 0000000000000246 R12: 
0000561545630580
[ 2541.841496][ T8289] R13: 00007ffc59024fd0 R14: 0000000000000000 R15: 
0000000000000000
[ 2541.841496][ T8289]  </TASK>
[ 2541.841496][ T8289] irq event stamp: 0
[ 2541.841496][ T8289] hardirqs last  enabled at (0): 
[<0000000000000000>] 0x0
[ 2541.841496][ T8289] hardirqs last disabled at (0): 
[<ffffffff81165e25>] copy_process+0xb35/0x2410
[ 2541.841496][ T8289] softirqs last  enabled at (0): 
[<ffffffff81165e25>] copy_process+0xb35/0x2410
[ 2541.841496][ T8289] softirqs last disabled at (0): 
[<0000000000000000>] 0x0
[ 2541.841496][ T8289] ---[ end trace 0000000000000000 ]---
>> +
>>   out:
>>          xprt_clear_connecting(xprt);
>>          xprt_wake_pending_tasks(xprt, status);

-- 
Wang Hai


^ permalink raw reply related

* [PATCH bpf-next] net: bpf: support direct packet access in tracing program
From: menglong8.dong @ 2022-04-27  7:06 UTC (permalink / raw)
  To: ast
  Cc: daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, rostedt, mingo, netdev, bpf, linux-kernel, Menglong Dong,
	Jiang Biao, Hao Peng

From: Menglong Dong <imagedong@tencent.com>

For now, eBPF program of type TRACING is able to access the arguments
of the function or raw_tracepoint directly, which makes data access
easier and efficient. And we can also output the raw data in skb to
user space in tracing program by bpf_skb_output() helper.

However, we still can't access the packet data in 'struct sk_buff'
directly and have to use the helper bpf_probe_read_kernel() to analyse
packet data.

Network tools, which based on eBPF TRACING, often do packet analyse
works in tracing program for filtering, statistics, etc. For example,
we want to trace abnormal skb free through 'kfree_skb' tracepoint with
special ip address or tcp port.

In this patch, 2 helpers are introduced: bpf_skb_get_header() and
bpf_skb_get_end(). The pointer returned by bpf_skb_get_header() has
the same effect with the 'data' in 'struct __sk_buff', and
bpf_skb_get_end() has the same effect with the 'data_end'.

Therefore, we can now access packet data in tracing program in this
way:

  SEC("fentry/icmp_rcv")
  int BPF_PROG(tracing_open, struct sk_buff* skb)
  {
  	void *data, *data_end;
  	struct ethhdr *eth;

  	data = bpf_skb_get_header(skb, BPF_SKB_HEADER_MAC);
  	data_end = bpf_skb_get_end(skb);

  	if (!data || !data_end)
  		return 0;

  	if (data + sizeof(*eth) > data_end)
  		return 0;

  	eth = data;
  	bpf_printk("proto:%d\n", bpf_ntohs(eth->h_proto));

  	return 0;
  }

With any positive reply, I'll complete the selftests programs.

Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
 include/linux/bpf.h      |  4 +++
 include/uapi/linux/bpf.h | 29 ++++++++++++++++++++
 kernel/bpf/verifier.c    |  6 +++++
 kernel/trace/bpf_trace.c | 58 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bdb5298735ce..69dff736331a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -419,6 +419,8 @@ enum bpf_return_type {
 	RET_PTR_TO_ALLOC_MEM,		/* returns a pointer to dynamically allocated memory */
 	RET_PTR_TO_MEM_OR_BTF_ID,	/* returns a pointer to a valid memory or a btf_id */
 	RET_PTR_TO_BTF_ID,		/* returns a pointer to a btf_id */
+	RET_PTR_TO_PACKET,		/* returns a pointer to packet */
+	RET_PTR_TO_PACKET_END,		/* returns a pointer to skb->data + headlen */
 	__BPF_RET_TYPE_MAX,
 
 	/* Extended ret_types. */
@@ -428,6 +430,8 @@ enum bpf_return_type {
 	RET_PTR_TO_SOCK_COMMON_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_SOCK_COMMON,
 	RET_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | MEM_ALLOC | RET_PTR_TO_ALLOC_MEM,
 	RET_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_BTF_ID,
+	RET_PTR_TO_PACKET_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_PACKET,
+	RET_PTR_TO_PACKET_END_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_PACKET_END,
 
 	/* This must be the last entry. Its purpose is to ensure the enum is
 	 * wide enough to hold the higher bits reserved for bpf_type_flag.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d14b10b85e51..841f6e7216f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5143,6 +5143,27 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void *bpf_skb_get_header(struct sk_buff *skb, u32 flags)
+ *	Description
+ *		Get packet header from skb in tracing program, which can
+ *		be access (read) directly. According to the *flags*,
+ *		different packet header is returned:
+ *
+ *			**BPF_SKB_HEADER_MAC**: get mac (L2) header
+ *			**BPF_SKB_HEADER_NETWORK**: get network (L3) header
+ *			**BPF_SKB_HEADER_TRANSPORT**:
+ *				get transport (L4) header
+ *	Return
+ *		The pointer to packet header on success, NULL on fail.
+ *
+ * void *bpf_skb_get_end(struct sk_buff *skb)
+ *	Description
+ *		Get packet head end pointer from skb in tracing program,
+ *		which is equal to *data_end* in *struct __sk_buff*.
+ *	Return
+ *		The pointer to packet head end on success, and NULL on
+ *		failing.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5339,6 +5360,8 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(skb_get_header),		\
+	FN(skb_get_end),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5537,6 +5560,12 @@ enum {
 	 */
 };
 
+enum {
+	BPF_SKB_HEADER_MAC,
+	BPF_SKB_HEADER_NETWORK,
+	BPF_SKB_HEADER_TRANSPORT,
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9c1a02b82ecd..caf4e09cc114 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6877,6 +6877,12 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		 */
 		regs[BPF_REG_0].btf = btf_vmlinux;
 		regs[BPF_REG_0].btf_id = ret_btf_id;
+	} else if (base_type(ret_type) == RET_PTR_TO_PACKET) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
+	} else if (base_type(ret_type) == RET_PTR_TO_PACKET_END) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_PACKET_END | ret_flag;
 	} else {
 		verbose(env, "unknown return type %u of func %s#%d\n",
 			base_type(ret_type), func_id_name(func_id), func_id);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b26f3da943de..6f2cd30aac07 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -810,6 +810,60 @@ static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_skb_get_header, struct sk_buff *, skb, u32, flags)
+{
+	void *header = NULL;
+
+	if (!skb)
+		return (unsigned long)NULL;
+
+	switch (flags) {
+	case BPF_SKB_HEADER_MAC:
+		if (skb_mac_header_was_set(skb) && skb->mac_header)
+			header = skb_mac_header(skb);
+		break;
+	case BPF_SKB_HEADER_TRANSPORT:
+		if (skb_transport_header_was_set(skb) &&
+		    skb->transport_header)
+			header = skb_transport_header(skb);
+		break;
+	case BPF_SKB_HEADER_NETWORK:
+		if (skb->network_header)
+			header = skb_network_header(skb);
+		break;
+	default:
+		break;
+	}
+
+	return (unsigned long)header;
+}
+
+BTF_ID_LIST_SINGLE(bpf_get_skb_ids, struct, sk_buff);
+static const struct bpf_func_proto bpf_skb_get_header_proto = {
+	.func           = bpf_skb_get_header,
+	.gpl_only       = false,
+	.ret_type       = RET_PTR_TO_PACKET_OR_NULL,
+	.arg1_type      = ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &bpf_get_skb_ids[0],
+	.arg2_type      = ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_skb_get_end, struct sk_buff *, skb)
+{
+	if (!skb)
+		return (unsigned long)NULL;
+
+	return (unsigned long)skb->data + skb_headlen(skb);
+}
+
+static const struct bpf_func_proto bpf_skb_get_end_proto = {
+	.func           = bpf_skb_get_end,
+	.gpl_only       = false,
+	.ret_type       = RET_PTR_TO_PACKET_END_OR_NULL,
+	.arg1_type      = ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &bpf_get_skb_ids[0],
+};
+
 struct send_signal_irq_work {
 	struct irq_work irq_work;
 	struct task_struct *task;
@@ -1282,6 +1336,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_branch_snapshot_proto;
 	case BPF_FUNC_find_vma:
 		return &bpf_find_vma_proto;
+	case BPF_FUNC_skb_get_header:
+		return &bpf_skb_get_header_proto;
+	case BPF_FUNC_skb_get_end:
+		return &bpf_skb_get_end_proto;
 	case BPF_FUNC_trace_vprintk:
 		return bpf_get_trace_vprintk_proto();
 	default:
-- 
2.36.0


^ permalink raw reply related

* Re: [PATCH v2 2/3] net: phy: adin: add support for clock output
From: Josua Mayer @ 2022-04-27  7:06 UTC (permalink / raw)
  To: kernel test robot, netdev
  Cc: kbuild-all, alvaro.karsz, Michael Hennerich, Andrew Lunn,
	Heiner Kallweit, Russell King, Jakub Kicinski, Paolo Abeni
In-Reply-To: <202204211324.qgcPMycQ-lkp@intel.com>

\o/

I am going to fix this by using NULL in v3.
Is there any other feedback I should take into account on this patch?

- Josua Mayer

Am 21.04.22 um 09:45 schrieb kernel test robot:
> Hi Josua,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on robh/for-next]
> [also build test WARNING on net/master net-next/master v5.18-rc3 next-20220420]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Josua-Mayer/dt-bindings-net-adin-document-phy-clock-output-properties/20220419-192719
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
> config: openrisc-randconfig-s032-20220420 (https://download.01.org/0day-ci/archive/20220421/202204211324.qgcPMycQ-lkp@intel.com/config)
> compiler: or1k-linux-gcc (GCC) 11.2.0
> reproduce:
>          wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # apt-get install sparse
>          # sparse version: v0.6.4-dirty
>          # https://github.com/intel-lab-lkp/linux/commit/74d856f1c89a6534fd58889f20ad4b481b8191c9
>          git remote add linux-review https://github.com/intel-lab-lkp/linux
>          git fetch --no-tags linux-review Josua-Mayer/dt-bindings-net-adin-document-phy-clock-output-properties/20220419-192719
>          git checkout 74d856f1c89a6534fd58889f20ad4b481b8191c9
>          # save the config file
>          mkdir build_dir && cp config build_dir/.config
>          COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=openrisc SHELL=/bin/bash drivers/net/phy/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
>
> sparse warnings: (new ones prefixed by >>)
>>> drivers/net/phy/adin.c:448:27: sparse: sparse: Using plain integer as NULL pointer
> vim +448 drivers/net/phy/adin.c
>
>     444	
>     445	static int adin_config_clk_out(struct phy_device *phydev)
>     446	{
>     447		struct device *dev = &phydev->mdio.dev;
>   > 448		const char *val = 0;
>     449		u8 sel = 0;
>     450	
>     451		device_property_read_string(dev, "adi,phy-output-clock", &val);
>     452		if(!val) {
>     453			/* property not present, do not enable GP_CLK pin */
>     454		} else if(strcmp(val, "25mhz-reference") == 0) {
>     455			sel |= ADIN1300_GE_CLK_CFG_25;
>     456		} else if(strcmp(val, "125mhz-free-running") == 0) {
>     457			sel |= ADIN1300_GE_CLK_CFG_FREE_125;
>     458		} else if(strcmp(val, "125mhz-recovered") == 0) {
>     459			sel |= ADIN1300_GE_CLK_CFG_RCVR_125;
>     460		} else if(strcmp(val, "adaptive-free-running") == 0) {
>     461			sel |= ADIN1300_GE_CLK_CFG_HRT_FREE;
>     462		} else if(strcmp(val, "adaptive-recovered") == 0) {
>     463			sel |= ADIN1300_GE_CLK_CFG_HRT_RCVR;
>     464		} else {
>     465			phydev_err(phydev, "invalid adi,phy-output-clock\n");
>     466			return -EINVAL;
>     467		}
>     468	
>     469		if(device_property_read_bool(dev, "adi,phy-output-reference-clock"))
>     470			sel |= ADIN1300_GE_CLK_CFG_REF_EN;
>     471	
>     472		return phy_modify_mmd(phydev, MDIO_MMD_VEND1, ADIN1300_GE_CLK_CFG_REG,
>     473				      ADIN1300_GE_CLK_CFG_MASK, sel);
>     474	}
>     475	
>

^ permalink raw reply

* Aw: Re: [RFC v1 1/3] net: dsa: mt753x: make reset optional
From: Frank Wunderlich @ 2022-04-27  7:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Frank Wunderlich, linux-mediatek, linux-rockchip, Rob Herring,
	Krzysztof Kozlowski, Heiko Stuebner, Sean Wang, Landen Chao,
	DENG Qingfang, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Matthias Brugger,
	Peter Geis, devicetree, linux-arm-kernel, linux-kernel, netdev
In-Reply-To: <20220426235713.engzue7ujwqjdyjc@skbuf>

Hi,

you're right, reset is already optional...

idk why it was failing before the patch...i guess i had always defined the reset in dts on switch-side and dropped it same time with this patch.

Reset on both nodes (gmac+switch) blocks in switch driver because of exclusive (error message "could not get our reset line") and after dropping the reset on gmac-side the mdio-bus does not come up after switch driver resets gmac+switch (in loop with edefer).

> Gesendet: Mittwoch, 27. April 2022 um 01:57 Uhr
> Von: "Vladimir Oltean" <olteanv@gmail.com>
> On Tue, Apr 26, 2022 at 03:49:22PM +0200, Frank Wunderlich wrote:
> > From: Frank Wunderlich <frank-w@public-files.de>
> >
> > Currently a reset line is required, but on BPI-R2-Pro board
> > this reset is shared with the gmac and prevents the switch to
> > be initialized because mdio is not ready fast enough after
> > the reset.
> >
> > So make the reset optional to allow shared reset lines.
>
> What does it mean "to allow shared reset lines"? Allow as in "allow them
> to sit there, unused"?

for switch part unused right, but reset-line is used by gmac. If switch does the reset, it resets the gmac too and so the mdio goes down. It took longer to get up than the wait-poll after the reset and so i got mdio read errors.

> > -	} else {
> > +	} else if (priv->reset) {
>
> I don't really understand this patch. gpiod_set_value_cansleep() can
> tolerate NULL GPIO descriptors.

had not looked for NULL-tolerance, so i precautionary added the check.

> >  		gpiod_set_value_cansleep(priv->reset, 0);
> >  		usleep_range(1000, 1100);
> >  		gpiod_set_value_cansleep(priv->reset, 1);

> > @@ -3272,8 +3272,7 @@ mt7530_probe(struct mdio_device *mdiodev)
> >  		priv->reset = devm_gpiod_get_optional(&mdiodev->dev, "reset",
> >  						      GPIOD_OUT_LOW);
> >  		if (IS_ERR(priv->reset)) {
> > -			dev_err(&mdiodev->dev, "Couldn't get our reset line\n");
> > -			return PTR_ERR(priv->reset);
> > +			dev_warn(&mdiodev->dev, "Couldn't get our reset line\n");
>
> I certainly don't understand why you're suppressing the pointer-encoded
> errors here. The function used is devm_gpiod_get_optional(), which
> returns NULL for a missing reset-gpios, not IS_ERR(something). The
> IS_ERR(something) is actually important to not ignore, maybe it's
> IS_ERR(-EPROBE_DEFER). And this change breaks waiting for the descriptor
> to become available.

you're right...the intention was to not leave the probe function if not reset was defined...but yes, devm_gpiod_get_optional is called so reset is already optional.

> So what doesn't work without this patch, exactly?

reverted the Patch in my repo and it is still working :)

just ignore it. something went wrong during my tests...

sorry for the inconvenience.

regards Frank

^ permalink raw reply

* [PATCH net 6/7] tcp: increase source port perturb table to 2^16
From: Willy Tarreau @ 2022-04-27  6:52 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Eric Dumazet, Moshe Kol,
	Yossi Gilad, Amit Klein, linux-kernel, Willy Tarreau
In-Reply-To: <20220427065233.2075-1-w@1wt.eu>

Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
identify a client by forcing it to emit only 40 times more connections
than there are entries in the table_perturb[] table. The previous two
improvements consisting in resalting the secret every 10s and adding
randomness to each port selection only slightly improved the situation,
and the current value of 2^8 was too small as it's not very difficult
to make a client emit 10k connections in less than 10 seconds.

Thus we're increasing the perturb table from 2^8 to 2^16 so that the the
same precision now requires 2.6M connections, which is more difficult in
this time frame and harder to hide as a background activity. The impact
is that the table now uses 256 kB instead of 1 kB, which could mostly
affect devices making frequent outgoing connections. However such
components usually target a small set of destinations (load balancers,
database clients, perf assessment tools), and in practice only a few
entries will be visited, like before.

A live test at 1 million connections per second showed no performance
difference from the previous value.

Reported-by: Moshe Kol <moshe.kol@mail.huji.ac.il>
Reported-by: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d746e5656baf..f30c50aaf8e2 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -726,11 +726,12 @@ EXPORT_SYMBOL_GPL(inet_unhash);
  * Note that we use 32bit integers (vs RFC 'short integers')
  * because 2^16 is not a multiple of num_ephemeral and this
  * property might be used by clever attacker.
- * RFC claims using TABLE_LENGTH=10 buckets gives an improvement,
- * we use 256 instead to really give more isolation and
- * privacy, this only consumes 1 KB of kernel memory.
+ * RFC claims using TABLE_LENGTH=10 buckets gives an improvement, though
+ * attacks were since demonstrated, thus we use 65536 instead to really
+ * give more isolation and privacy, at the expense of 256kB of kernel
+ * memory.
  */
-#define INET_TABLE_PERTURB_SHIFT 8
+#define INET_TABLE_PERTURB_SHIFT 16
 #define INET_TABLE_PERTURB_SIZE (1 << INET_TABLE_PERTURB_SHIFT)
 static u32 *table_perturb;

-- 
2.17.5

^ permalink raw reply related

* [PATCH net 7/7] tcp: drop the hash_32() part from the index calculation
From: Willy Tarreau @ 2022-04-27  6:52 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Eric Dumazet, Moshe Kol,
	Yossi Gilad, Amit Klein, linux-kernel, Willy Tarreau
In-Reply-To: <20220427065233.2075-1-w@1wt.eu>

In commit 190cc82489f4 ("tcp: change source port randomizarion at
connect() time"), the table_perturb[] array was introduced and an
index was taken from the port_offset via hash_32(). But it turns
out that hash_32() performs a multiplication while the input here
comes from the output of SipHash in secure_seq, that is well
distributed enough to avoid the need for yet another hash.

Suggested-by: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f30c50aaf8e2..a75e66d7eee6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -778,7 +778,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 	net_get_random_once(table_perturb,
 			    INET_TABLE_PERTURB_SIZE * sizeof(*table_perturb));
-	index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
+	index = port_offset & (INET_TABLE_PERTURB_SIZE - 1);
 
 	offset = (READ_ONCE(table_perturb[index]) + (port_offset >> 32)) % remaining;
 	/* In first pass we try ports of @low parity.
-- 
2.17.5


^ permalink raw reply related

* [PATCH net 5/7] tcp: dynamically allocate the perturb table used by source ports
From: Willy Tarreau @ 2022-04-27  6:52 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Eric Dumazet, Moshe Kol,
	Yossi Gilad, Amit Klein, linux-kernel, Willy Tarreau
In-Reply-To: <20220427065233.2075-1-w@1wt.eu>

We'll need to further increase the size of this table and it's likely
that at some point its size will not be suitable anymore for a static
table. Let's allocate it on boot from inet_hashinfo2_init(), which is
called from tcp_init().

Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f58c5caf3130..d746e5656baf 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -731,7 +731,8 @@ EXPORT_SYMBOL_GPL(inet_unhash);
  * privacy, this only consumes 1 KB of kernel memory.
  */
 #define INET_TABLE_PERTURB_SHIFT 8
-static u32 table_perturb[1 << INET_TABLE_PERTURB_SHIFT];
+#define INET_TABLE_PERTURB_SIZE (1 << INET_TABLE_PERTURB_SHIFT)
+static u32 *table_perturb;
 
 int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		struct sock *sk, u64 port_offset,
@@ -774,7 +775,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	if (likely(remaining > 1))
 		remaining &= ~1U;
 
-	net_get_random_once(table_perturb, sizeof(table_perturb));
+	net_get_random_once(table_perturb,
+			    INET_TABLE_PERTURB_SIZE * sizeof(*table_perturb));
 	index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
 
 	offset = (READ_ONCE(table_perturb[index]) + (port_offset >> 32)) % remaining;
@@ -910,6 +912,12 @@ void __init inet_hashinfo2_init(struct inet_hashinfo *h, const char *name,
 					    low_limit,
 					    high_limit);
 	init_hashinfo_lhash2(h);
+
+	/* this one is used for source ports of outgoing connections */
+	table_perturb = kmalloc_array(INET_TABLE_PERTURB_SIZE,
+				      sizeof(*table_perturb), GFP_KERNEL);
+	if (!table_perturb)
+		panic("TCP: failed to alloc table_perturb");
 }
 
 int inet_hashinfo2_init_mod(struct inet_hashinfo *h)
-- 
2.17.5


^ permalink raw reply related

* [PATCH net 4/7] tcp: add small random increments to the source port
From: Willy Tarreau @ 2022-04-27  6:52 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Eric Dumazet, Moshe Kol,
	Yossi Gilad, Amit Klein, linux-kernel, Willy Tarreau
In-Reply-To: <20220427065233.2075-1-w@1wt.eu>

Here we're randomly adding between 0 and 7 random increments to the
selected source port in order to add some noise in the source port
selection that will make the next port less predictable.

With the default port range of 32768-60999 this means a worst case
reuse scenario of 14116/8=1764 connections between two consecutive
uses of the same port, with an average of 14116/4.5=3137. This code
was stressed at more than 800000 connections per second to a fixed
target with all connections closed by the client using RSTs (worst
condition) and only 2 connections failed among 13 billion, despite
the hash being reseeded every 10 seconds, indicating a perfectly
safe situation.

Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/inet_hashtables.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 747f272da25b..f58c5caf3130 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -831,11 +831,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	return -EADDRNOTAVAIL;

 ok:
-	/* If our first attempt found a candidate, skip next candidate
-	 * in 1/16 of cases to add some noise.
+	/* Here we want to add a little bit of randomness to the next source
+	 * port that will be chosen. We use a max() with a random here so that
+	 * on low contention the randomness is maximal and on high contention
+	 * it may be inexistent.
 	 */
-	if (!i && !(prandom_u32() % 16))
-		i = 2;
+	i = max_t(int, i, (prandom_u32() & 7) * 2);
 	WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);

 	/* Head lock still held and bh's disabled */
-- 
2.17.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox