Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH net-next 4/5] dsa: Microchip KSZ switches SPI devicetree configuration
From: Woojung.Huh @ 2017-05-07  1:04 UTC (permalink / raw)
  To: sergei.shtylyov, andrew, f.fainelli, vivien.didelot
  Cc: netdev, davem, UNGLinuxDriver
In-Reply-To: <1ade6f7e-7a28-db47-3811-d35d90f57b56@cogentembedded.com>

Hi Sergei

>> +             spi1: spi@f8008000 {
>> +             pinctrl-0 = <&pinctrl_spi_ksz>;
>> +             cs-gpios = <&pioC 25 0>;
>> +             id = <1>;
>> +             status = "okay";
>
>    4 lines above should be indented more to the right.
>> +             };
>> +             };
>
>    Unmatched }?

Will fix both.

- Woojung

^ permalink raw reply

* Re: [PATCH iproute2] tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support
From: Daniel Borkmann @ 2017-05-07 15:20 UTC (permalink / raw)
  To: Alexander Alemayhu, netdev; +Cc: stephen
In-Reply-To: <1494102610-19853-1-git-send-email-alexander@alemayhu.com>

On 05/06/2017 10:30 PM, Alexander Alemayhu wrote:
> sparc64 support was added in 7a12b5031c6b (sparc64: Add eBPF JIT., 2017-04-17)[0]
> and ppc64 in 156d0e290e96 (powerpc/ebpf/jit: Implement JIT compiler for extended BPF, 2016-06-22)[1].
>
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7a12b5031c6b
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=156d0e290e96
> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged
From: Daniel Borkmann @ 2017-05-07 22:04 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, jannh, kafai, netdev, Daniel Borkmann

The patch fixes two things at once:

1) It checks the env->allow_ptr_leaks and only prints the map address to
   the log if we have the privileges to do so, otherwise it just dumps 0
   as we would when kptr_restrict is enabled on %pK. Given the latter is
   off by default and not every distro sets it, I don't want to rely on
   this, hence the 0 by default for unprivileged.

2) Printing of ldimm64 in the verifier log is currently broken in that
   we don't print the full immediate, but only the 32 bit part of the
   first insn part for ldimm64. Thus, fix this up as well; it's okay to
   access, since we verified all ldimm64 earlier already (including just
   constants) through replace_map_fd_with_map_ptr().

Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/verifier.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c2ff608..5d19a02 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -298,7 +298,8 @@ static void print_verifier_state(struct bpf_verifier_state *state)
 	[BPF_EXIT >> 4] = "exit",
 };
 
-static void print_bpf_insn(struct bpf_insn *insn)
+static void print_bpf_insn(const struct bpf_verifier_env *env,
+			   const struct bpf_insn *insn)
 {
 	u8 class = BPF_CLASS(insn->code);
 
@@ -362,9 +363,19 @@ static void print_bpf_insn(struct bpf_insn *insn)
 				insn->code,
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->src_reg, insn->imm);
-		} else if (BPF_MODE(insn->code) == BPF_IMM) {
-			verbose("(%02x) r%d = 0x%x\n",
-				insn->code, insn->dst_reg, insn->imm);
+		} else if (BPF_MODE(insn->code) == BPF_IMM &&
+			   BPF_SIZE(insn->code) == BPF_DW) {
+			/* At this point, we already made sure that the second
+			 * part of the ldimm64 insn is accessible.
+			 */
+			u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm;
+			bool map_ptr = insn->src_reg == BPF_PSEUDO_MAP_FD;
+
+			if (map_ptr && !env->allow_ptr_leaks)
+				imm = 0;
+
+			verbose("(%02x) r%d = 0x%llx\n", insn->code,
+				insn->dst_reg, (unsigned long long)imm);
 		} else {
 			verbose("BUG_ld_%02x\n", insn->code);
 			return;
@@ -2853,7 +2864,7 @@ static int do_check(struct bpf_verifier_env *env)
 
 		if (log_level) {
 			verbose("%d: ", insn_idx);
-			print_bpf_insn(insn);
+			print_bpf_insn(env, insn);
 		}
 
 		err = ext_analyzer_insn_hook(env, insn_idx, prev_insn_idx);
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH iproute2] tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support
From: David Miller @ 2017-05-07  0:23 UTC (permalink / raw)
  To: alexander; +Cc: netdev, daniel, stephen
In-Reply-To: <1494102610-19853-1-git-send-email-alexander@alemayhu.com>

From: Alexander Alemayhu <alexander@alemayhu.com>
Date: Sat,  6 May 2017 22:30:10 +0200

> sparc64 support was added in 7a12b5031c6b (sparc64: Add eBPF JIT., 2017-04-17)[0]
> and ppc64 in 156d0e290e96 (powerpc/ebpf/jit: Implement JIT compiler for extended BPF, 2016-06-22)[1].
> 
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7a12b5031c6b
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=156d0e290e96
> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged
From: Daniel Borkmann @ 2017-05-07 22:51 UTC (permalink / raw)
  To: Jann Horn; +Cc: David S. Miller, Alexei Starovoitov, kafai, netdev
In-Reply-To: <CAG48ez0S92fLmX5fEa1=HNC3wF+SYizske5MjiJDpW4=tLg6Uw@mail.gmail.com>

On 05/08/2017 12:26 AM, Jann Horn wrote:
> On Mon, May 8, 2017 at 12:04 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> The patch fixes two things at once:
>>
>> 1) It checks the env->allow_ptr_leaks and only prints the map address to
>>     the log if we have the privileges to do so, otherwise it just dumps 0
>>     as we would when kptr_restrict is enabled on %pK. Given the latter is
>>     off by default and not every distro sets it, I don't want to rely on
>>     this, hence the 0 by default for unprivileged.
>>
>> 2) Printing of ldimm64 in the verifier log is currently broken in that
>>     we don't print the full immediate, but only the 32 bit part of the
>>     first insn part for ldimm64. Thus, fix this up as well; it's okay to
>>     access, since we verified all ldimm64 earlier already (including just
>>     constants) through replace_map_fd_with_map_ptr().
>>
>> Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)")
>> Reported-by: Jann Horn <jannh@google.com>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> [...]
>> @@ -362,9 +363,19 @@ static void print_bpf_insn(struct bpf_insn *insn)
>>                                  insn->code,
>>                                  bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>>                                  insn->src_reg, insn->imm);
>> -               } else if (BPF_MODE(insn->code) == BPF_IMM) {
>> -                       verbose("(%02x) r%d = 0x%x\n",
>> -                               insn->code, insn->dst_reg, insn->imm);
>> +               } else if (BPF_MODE(insn->code) == BPF_IMM &&
>> +                          BPF_SIZE(insn->code) == BPF_DW) {
>> +                       /* At this point, we already made sure that the second
>> +                        * part of the ldimm64 insn is accessible.
>> +                        */
>> +                       u64 imm = ((u64)(insn + 1)->imm << 32) | (u32)insn->imm;
>> +                       bool map_ptr = insn->src_reg == BPF_PSEUDO_MAP_FD;
>> +
>> +                       if (map_ptr && !env->allow_ptr_leaks)
>> +                               imm = 0;
>> +
>> +                       verbose("(%02x) r%d = 0x%llx\n", insn->code,
>> +                               insn->dst_reg, (unsigned long long)imm);
>>                  } else {
>>                          verbose("BUG_ld_%02x\n", insn->code);
>>                          return;
>
> You replaced the `BPF_MODE(insn->code) == BPF_IMM` branch with a
> `BPF_MODE(insn->code) == BPF_IMM && BPF_SIZE(insn->code) == BPF_DW`
> branch. Doesn't that break printing normal immediates?

What do you mean by 'normal' immediates? You mean loads of imm into
register, right? The ldimm64 is kind of special treated; for imms
fitting into 32 bit, there is BPF_MOV64_IMM() and BPF_MOV32_IMM()
otherwise.

F.e. see the jumptable in __bpf_prog_run(), which is the interpreter.
All BPF_LD instructions that we have are:

static const void *jumptable[256] = {
   [...]
   [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
   [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
   [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
   [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
   [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
   [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
   [BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,
};

In the print_bpf_insn() under class == BPF_LD, the BPF_ABS and BPF_IND
are separately handled (load of packet data from skb), and the BPF_IMM
is the one we're fixing, which only has BPF_DW as an option. I added it
there since we really only want to see BPF_DW in this branch due to the
double imm access.

^ permalink raw reply

* Re: [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged
From: Jann Horn @ 2017-05-07 22:54 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David S. Miller, Alexei Starovoitov, kafai, netdev
In-Reply-To: <590FA4D6.4060206@iogearbox.net>

On Mon, May 8, 2017 at 12:51 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 05/08/2017 12:26 AM, Jann Horn wrote:
>>
>> On Mon, May 8, 2017 at 12:04 AM, Daniel Borkmann <daniel@iogearbox.net>
>> wrote:
>>>
>>> The patch fixes two things at once:
>>>
>>> 1) It checks the env->allow_ptr_leaks and only prints the map address to
>>>     the log if we have the privileges to do so, otherwise it just dumps 0
>>>     as we would when kptr_restrict is enabled on %pK. Given the latter is
>>>     off by default and not every distro sets it, I don't want to rely on
>>>     this, hence the 0 by default for unprivileged.
>>>
>>> 2) Printing of ldimm64 in the verifier log is currently broken in that
>>>     we don't print the full immediate, but only the 32 bit part of the
>>>     first insn part for ldimm64. Thus, fix this up as well; it's okay to
>>>     access, since we verified all ldimm64 earlier already (including just
>>>     constants) through replace_map_fd_with_map_ptr().
>>>
>>> Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification
>>> log)")
>>> Reported-by: Jann Horn <jannh@google.com>
>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>
>> [...]
>>>
>>> @@ -362,9 +363,19 @@ static void print_bpf_insn(struct bpf_insn *insn)
>>>                                  insn->code,
>>>                                  bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>>                                  insn->src_reg, insn->imm);
>>> -               } else if (BPF_MODE(insn->code) == BPF_IMM) {
>>> -                       verbose("(%02x) r%d = 0x%x\n",
>>> -                               insn->code, insn->dst_reg, insn->imm);
>>> +               } else if (BPF_MODE(insn->code) == BPF_IMM &&
>>> +                          BPF_SIZE(insn->code) == BPF_DW) {
>>> +                       /* At this point, we already made sure that the
>>> second
>>> +                        * part of the ldimm64 insn is accessible.
>>> +                        */
>>> +                       u64 imm = ((u64)(insn + 1)->imm << 32) |
>>> (u32)insn->imm;
>>> +                       bool map_ptr = insn->src_reg ==
>>> BPF_PSEUDO_MAP_FD;
>>> +
>>> +                       if (map_ptr && !env->allow_ptr_leaks)
>>> +                               imm = 0;
>>> +
>>> +                       verbose("(%02x) r%d = 0x%llx\n", insn->code,
>>> +                               insn->dst_reg, (unsigned long long)imm);
>>>                  } else {
>>>                          verbose("BUG_ld_%02x\n", insn->code);
>>>                          return;
>>
>>
>> You replaced the `BPF_MODE(insn->code) == BPF_IMM` branch with a
>> `BPF_MODE(insn->code) == BPF_IMM && BPF_SIZE(insn->code) == BPF_DW`
>> branch. Doesn't that break printing normal immediates?
>
>
> What do you mean by 'normal' immediates? You mean loads of imm into
> register, right? The ldimm64 is kind of special treated; for imms
> fitting into 32 bit, there is BPF_MOV64_IMM() and BPF_MOV32_IMM()
> otherwise.
>
> F.e. see the jumptable in __bpf_prog_run(), which is the interpreter.
> All BPF_LD instructions that we have are:
>
> static const void *jumptable[256] = {
>   [...]
>   [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
>   [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
>   [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
>   [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
>   [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
>   [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
>   [BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,
> };
>
> In the print_bpf_insn() under class == BPF_LD, the BPF_ABS and BPF_IND
> are separately handled (load of packet data from skb), and the BPF_IMM
> is the one we're fixing, which only has BPF_DW as an option. I added it
> there since we really only want to see BPF_DW in this branch due to the
> double imm access.

Ah, right, I missed that. Nevermind.

^ permalink raw reply

* Re: [PATCH] net: dsa: loop: Check for memory allocation failure
From: Florian Fainelli @ 2017-05-07 23:22 UTC (permalink / raw)
  To: Christophe JAILLET, andrew, vivien.didelot
  Cc: netdev, linux-kernel, kernel-janitors
In-Reply-To: <20170506052945.2639-1-christophe.jaillet@wanadoo.fr>

Le 05/05/17 à 22:29, Christophe JAILLET a écrit :
> If 'devm_kzalloc' fails, a NULL pointer will be dereferenced.
> Return -ENOMEM instead, as done for some other memory allocation just a
> few lines above.
> 
> Fixes: 98cd1552ea27 ("net: dsa: Mock-up driver")
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 2/5] phy: micrel: add Microchip KSZ 9477 Switch PHY support
From: Andrew Lunn @ 2017-05-08  0:12 UTC (permalink / raw)
  To: Woojung.Huh; +Cc: f.fainelli, vivien.didelot, netdev, davem, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40A5C2CA@CHN-SV-EXMX02.mchp-main.com>

On Sun, May 07, 2017 at 01:02:31AM +0000, Woojung.Huh@microchip.com wrote:
> >> +}, {
> >> +     .phy_id         = PHY_ID_KSZ9477,
> >> +     .phy_id_mask    = MICREL_PHY_ID_MASK,
> >> +     .name           = "Microchip KSZ9477",
> >> +     .features       = PHY_GBIT_FEATURES,
> >> +     .flags          = PHY_HAS_MAGICANEG,
> >
> >Is this magic still used anywhere? I could not find anything.
> >
> Hi Andrew,
> 
> Really no usage in define and flags.
> Will double check and remove it.

Hi Woojung

I just created a patch to remove PHY_HAS_MAGICANEG from the kernel.
Once net-next has reopened when the merge window closes, i will submit
it. So please don't add it here.

    Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH v4 net-next 04/10] net/ncsi: Ethtool operation to get NCSI topology
From: Gavin Shan @ 2017-05-08  0:19 UTC (permalink / raw)
  To: David Laight
  Cc: 'Gavin Shan', Stephen Hemminger, netdev@vger.kernel.org,
	joe@perches.com, kubakici@wp.pl, f.fainelli@gmail.com,
	davem@davemloft.net
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DCFFE535E@AcuExch.aculab.com>

On Thu, May 04, 2017 at 09:31:20AM +0000, David Laight wrote:
>From: Gavin Shan
>> Sent: 04 May 2017 07:16
>> On Wed, May 03, 2017 at 10:19:44PM -0700, Stephen Hemminger wrote:
>> >On Wed,  3 May 2017 14:44:35 +1000
>> >Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>...
>> >> +{
>> >> +	struct ethtool_ncsi_channels *enc;
>> >> +	short nr_channels;
>> >Should be __u16 or unsigned not short.
>> >
>> 
>> Nope, It's for signed number. User expects to get number of available
>> channels when negative number is passed in. When it's positive, it's
>> going to get the channels' information.
>
>Why 16 bits?
>You are just making life hard for the compiler and possibly generating
>random padding.
>

It's because there are 256 NCSI channels to maximal degree. So 16-bits
is the minial data width to hold it in signed format. Yes, I think
__s32 would be better in this case. However, I would like to discard
the negotiation mechanism in next respin.

>I guess the user is expected to pass -1 first to get the number of
>channels, then allocate an appropriate sized array and call again
>specifying the number of channels?
>

It's correct.

>What happens if the number of channels changes between the two requests?
>

There are only one case the number changes from zero to x. In previous call,
zero is returned and userspace will get nothing. When x channels are probed,
it's stable and won't change. I don't see any problem because of it.

In next respin, I'll pass 256 entries directly. Each entry will have a flag
to indicate it's valid or not. No negotiation will be needed.

>I'd also suggest passing the size of each entry (in at least one direction).
>That way additional channel information can be added.
>

why? we have another command (ETHTOOL_GNCSICINFO) to retrieve information
about the specified channel.

Cheers,
Gavin

^ permalink raw reply

* Re: [PATCH v4 net-next 08/10] net/ncsi: Support NCSI packet generation
From: Gavin Shan @ 2017-05-08  0:25 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Gavin Shan, netdev, joe, kubakici, f.fainelli, davem
In-Reply-To: <20170504120035.GQ8029@lunn.ch>

On Thu, May 04, 2017 at 02:00:35PM +0200, Andrew Lunn wrote:
>On Thu, May 04, 2017 at 04:31:57PM +1000, Gavin Shan wrote:
>> On Wed, May 03, 2017 at 02:52:54PM +0200, Andrew Lunn wrote:
>> >On Wed, May 03, 2017 at 02:44:39PM +1000, Gavin Shan wrote:
>> >> This introduces /sys/kernel/debug/ncsi/eth0/pkt. The debugfs entry
>> >> can accept parameters to produce NCSI command packet. The received
>> >> NCSI response packet is dumped on read. Below is an example to send
>> >> CIS command and dump its response.
>> >> 
>> >>    # echo CIS,0,0 > /sys/kernel/debug/ncsi/eth0/pkt
>> >>    # cat /sys/kernel/debug/ncsi/eth0/pkt
>> >>    NCSI response [CIS] packet received
>> >> 
>> >>    00 01 dd 80 00 0004 0000 0000
>> >
>> >Could this be done with a raw socket for Tx and
>> >libpcap/tcpdump/wireshart for Rx?
>> >
>> 
>> Andrew, it's really good question. Unfortunately, I don't think it can
>> be done solely by raw socket to transmit NCSI command packet because the
>> [packet sequence number] field in the packet can't be same to any used ones.
>> Otherwise the remote NIC will be confused and start to reponse abnormally.
>
>Just to make sure i'm on the same page. We are talking about:
>
>https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.0.1.pdf
>
>and the sequence number is the Instance ID. This is an 8-bit number,
>and if it receives a message with the previously used IID, it should
>assume it is a re-transmit of the request and send back the old
>response. It is a very simple scheme, no windowing, only one
>outstanding command/response, just one response buffered in case of a
>re-transmit.
>
>> We could reserve some sequence number to be used by raw socket.
>
>I don't think that works. You can only have one command
>'inflight'. Packets from a raw socket would have to go through your
>state machine. Which makes it complex.
>
>libpcap should however still work. So you should probably do all the
>receive side in user space.
>

Andrew, yeah, we're on same page about the sequence number. Yes, I
agree it's going to make thing complex to transmit packet through
raw socket. So lets keep using debugfs as we had.

I need to dig how libpcap receives packets. It's appreciated if you
can give some hints about that. However, I don't see the benefit to
receive packets by libpcap, could you claim?

Cheers,
Gavin

^ permalink raw reply

* Re: [PATCH v4 net-next 08/10] net/ncsi: Support NCSI packet generation
From: Andrew Lunn @ 2017-05-08  0:56 UTC (permalink / raw)
  To: Gavin Shan; +Cc: netdev, joe, kubakici, f.fainelli, davem
In-Reply-To: <20170508002512.GA6012@gwshan>

> I need to dig how libpcap receives packets. It's appreciated if you
> can give some hints about that. However, I don't see the benefit to
> receive packets by libpcap, could you claim?

The base interface should already be doing it for you. Try it! Run
tcpdump or wireshark and you should see the NCSI packets going
out/coming in. Look for the ethertype. Neither tcpdump or wireshark
appear to have dissectors for NCSI, but it should be simple to
write. I've written them before, and it is easy.

Doing it in userspace will give you a much nice environment to work
in. Wireshark can link the response to the request, do a much more
detailed decode than what you want to do in kernel space, and in
general it is safer. Wrongly decoding and printing protocols in kernel
space can lead to a remote kernel vulnerability. Getting it wrong in
user space 'just' allows a remote hack of a user account.

     Andrew

^ permalink raw reply

* RE: FEC on i.MX 7 transmit queue timeout
From: Andy Duan @ 2017-05-08  2:13 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Stefan Agner, festevam@gmail.com, netdev@vger.kernel.org,
	netdev-owner@vger.kernel.org
In-Reply-To: <20170505122330.GA23432@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch> Sent: Friday, May 05, 2017 8:24 PM
>To: Andy Duan <fugang.duan@nxp.com>
>Cc: Stefan Agner <stefan@agner.ch>; festevam@gmail.com;
>netdev@vger.kernel.org; netdev-owner@vger.kernel.org
>Subject: Re: FEC on i.MX 7 transmit queue timeout
>
>> No, it is not workaround. As i said, quque1 and queue2 are for AVB
>> paths have higher priority in transmition.
>
>Does this higher priority result in the low priority queue being starved? Is that
>why the timer goes off? What happens when somebody does use AVB. Are
>we back to the same problem? This is what seems to make is sounds like a
>work around, not a fix.
>
>     Andrew
Yes, queue0 may be blocked by queue1 and queue2, then the queue0 watchdog time maybe triggered.
If somebody use AVB quque1 and queue2, the remaining bandwidth is for queue0, for example, in 100Mbps system, quque1 cost 50Mbps bandwidth and queue2 cost 50Mbps bandwidth for audio and video streaming, then queue0 (best effort) has 0 bandwidth that limit user case cannot have  asynchronous frames (IP(tcp/udp)) on networking. Of course these is extreme case. 
In essentially,  asynchronous frames (IP) go queue0 for the original design. To do these just implement .ndo_select_queue() callback in driver like fsl tree.

Regards,
Andy

^ permalink raw reply

* [PATCH net] ipv6: make sure dev is not NULL before call ip6_frag_reasm
From: Hangbin Liu @ 2017-05-08  3:09 UTC (permalink / raw)
  To: netdev; +Cc: Hangbin Liu

Since ip6_frag_reasm() will call __in6_dev_get(dev), which will access
dev->ip6_ptr. We need to make sure dev is not NULL.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/ipv6/reassembly.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e1da5b8..e3ebd62 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -348,7 +348,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 		fq->q.flags |= INET_FRAG_FIRST_IN;
 	}
 
-	if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
+	if (dev && fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
 	    fq->q.meat == fq->q.len) {
 		int res;
 		unsigned long orefdst = skb->_skb_refdst;
-- 
2.5.5

^ permalink raw reply related

* RE: [PATCH net v4 00/12] Fix possbile memleaks when fail to register_netdevice
From: Gao Feng @ 2017-05-08  4:18 UTC (permalink / raw)
  To: 'David Miller'
  Cc: jiri, mareklindner, sw, a, kuznet, jmorris, yoshfuji, kaber,
	steffen.klassert, herbert, netdev
In-Reply-To: <20170507.182550.695836233888168365.davem@davemloft.net>

Hi David,

> From: David Miller [mailto:davem@davemloft.net]
> From: gfree.wind@foxmail.com
> Date: Tue,  2 May 2017 13:58:42 +0800
> 
> > These following drivers allocate kinds of resources in its ndo_init
> > func, free some of them or all in the destructor func. Then there is
> > one memleak that some errors happen after register_netdevice invokes
> > the ndo_init callback. Because only the ndo_uninit callback is invoked
> > in the error handler of register_netdevice, but destructor not.
> >
> > In my original approach, I tried to free the resources in the newlink
> > func when fail to register_netdevice, like destructor did except not
> > free the net_dev. This method is not good when destructor is changed,
> > and the memleak could be not fixed when there is no newlink callback.
> >
> > Now create one new func used to free the resources in the destructor,
> > and the ndo_uninit func also could invokes it when fail to register
> > the net_device by comparing the dev->reg_state with
> > NETREG_UNINITIALIZED.  If there is no existing ndo_uninit, just add
> > one.
> >
> > This solution doesn't only make sure free all resources in any case,
> > but also follows the original desgin that some resources could be kept
> > until the destructor executes normally after register the device
> > successfully.
> 
> Device private teardown is in two stages for the following reason.
> The issue is that netdev_ops->ndo_init() allocates two types of resources.
> 
> One type is OK to release during destruction before the netdev refs goes
to
> zero.  This is what netdev_ops->ndo_uninit() is for.
> 
> The second type is for releasing things which are not safe to drop until
the very
> last netdev reference disappears.  This is what
> netdev->destructor() is for.
> 
> If you look around there are hacks in place all over to try and deal with
this
> issue.  Basically, look for code that checks the return value of
> register_netdev() and if an error is indicated it does some local driver
state
> freeing.  Bonding is one example.  It is trying to deal with the problem
this
> patch set is targetting.

Yes. When I was fixing this bug, I had found the bond had deal with this
issue, frees the resources by itself.

> 
> What really needs to happen is we must divorce the logic of
> dev->destructor() from free_netdev().
> 
> That way we can do the free_netdev in the unregister netdevice path after
> calling netdev->destructor().
> 
> Then the only issue callers of register_netdevice() need to be aware of is
that if
> an error is returned, that caller must call free_netdev().  Which has been
the
> case for decades.
> 
> So I would fix this as follows:
> 
> 1) Rename netdev->destructor() into "netdev->priv_destructor()"  It
>    performs all post ndo_ops->ndo_uninit() cleanups, _except_
>    free_netdev().  Update all drivers with netdev->destructor().
> 
> 2) Add a boolean state to netdev, which indicates if free_netdev()
>    should be performed after netdev->priv_destructor() during
>    unregister.
> 
> That provides all of the flexibility necessary to fix this bug in the
core.
> 
> In register_netdevice() if something after ndo_ops->ndo_init() succeeeds,
we
> invoke _both_ ndo_ops->ndo_uninit() and
> netdev->priv_destructor().  We do not look at the netdev "needs
> free_netdev()" boolean.
> 
> In netdev_run_todo(), where we have the one and only
> netdev->destructor() call, change it to:
> 
> 	if (dev->priv_destructor)
> 		dev->priv_destructor(dev);
> 	if (dev->needs_free_netdev)
> 		free_netdev(dev);
> 
> That fixes the bug in all cases.  And makes the purpose and logic
extremely
> clear.  Also, no internal state tests leak into the drivers.
> 
> Finally, drivers that try to cover up this issue, such as bonding, need to
be
> changed to no try and free up device private state if their invocation of
> register_netdevice() fails.
> 
> Thanks.

Ok. It is better to fix it by changing the framework of net_dev.
I was afraid that I would bring other bugs if I changed it.
Because it would affect too many codes.

I would like to see you fix it by yourself.

Best Regards
Feng

^ permalink raw reply

* [PATCH net] vit: check nla_put_* return value
From: Hangbin Liu @ 2017-05-08  6:36 UTC (permalink / raw)
  To: netdev; +Cc: Hangbin Liu

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/ipv4/ip_vti.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 4097741..4ec9aff 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -546,12 +546,13 @@ static int vti_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	struct ip_tunnel *t = netdev_priv(dev);
 	struct ip_tunnel_parm *p = &t->parms;
 
-	nla_put_u32(skb, IFLA_VTI_LINK, p->link);
-	nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key);
-	nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key);
-	nla_put_in_addr(skb, IFLA_VTI_LOCAL, p->iph.saddr);
-	nla_put_in_addr(skb, IFLA_VTI_REMOTE, p->iph.daddr);
-	nla_put_u32(skb, IFLA_VTI_FWMARK, t->fwmark);
+	if (nla_put_u32(skb, IFLA_VTI_LINK, p->link) ||
+	    nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key) ||
+	    nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key) ||
+	    nla_put_in_addr(skb, IFLA_VTI_LOCAL, p->iph.saddr) ||
+	    nla_put_in_addr(skb, IFLA_VTI_REMOTE, p->iph.daddr) ||
+	    nla_put_u32(skb, IFLA_VTI_FWMARK, t->fwmark))
+		return -EMSGSIZE;
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH v3 net-next 01/11] net: stmmac: prepare dma op mode config for multiple queues
From: Jan Kiszka @ 2017-05-08  6:56 UTC (permalink / raw)
  To: Joao Pinto, davem
  Cc: peppe.cavallaro, alexandre.torgue, netdev,
	Linux Kernel Mailing List
In-Reply-To: <efde5ec3d6873d0b2ad5b806f230f5804257915d.1489575025.git.jpinto@synopsys.com>

On 2017-03-15 12:04, Joao Pinto wrote:
> This patch prepares DMA Operation Mode configuration for multiple queues.
> The work consisted on breaking the DMA operation Mode configuration function
> into RX and TX scope and adapting its mechanism in stmmac_main.
> 
> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
> ---
> changes v1->v3:
> - Just to keep up the patch-set version
> 
>  drivers/net/ethernet/stmicro/stmmac/common.h      |   3 +
>  drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c  | 118 +++++++++++-----------
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  82 +++++++++++----
>  3 files changed, 124 insertions(+), 79 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
> index 9f0d26d..13bd3d4 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> @@ -424,6 +424,9 @@ struct stmmac_dma_ops {
>  	 * An invalid value enables the store-and-forward mode */
>  	void (*dma_mode)(void __iomem *ioaddr, int txmode, int rxmode,
>  			 int rxfifosz);
> +	void (*dma_rx_mode)(void __iomem *ioaddr, int mode, u32 channel,
> +			    int fifosz);
> +	void (*dma_tx_mode)(void __iomem *ioaddr, int mode, u32 channel);
>  	/* To track extra statistic (if supported) */
>  	void (*dma_diagnostic_fr) (void *data, struct stmmac_extra_stats *x,
>  				   void __iomem *ioaddr);
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
> index 6ac6b26..6285e8a 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
> @@ -182,70 +182,26 @@ static void dwmac4_rx_watchdog(void __iomem *ioaddr, u32 riwt)
>  		writel(riwt, ioaddr + DMA_CHAN_RX_WATCHDOG(i));
>  }
>  
> -static void dwmac4_dma_chan_op_mode(void __iomem *ioaddr, int txmode,
> -				    int rxmode, u32 channel, int rxfifosz)
> +static void dwmac4_dma_rx_chan_op_mode(void __iomem *ioaddr, int mode,
> +				       u32 channel, int fifosz)
>  {
> -	unsigned int rqs = rxfifosz / 256 - 1;
> -	u32 mtl_tx_op, mtl_rx_op, mtl_rx_int;
> -
> -	/* Following code only done for channel 0, other channels not yet
> -	 * supported.
> -	 */
> -	mtl_tx_op = readl(ioaddr + MTL_CHAN_TX_OP_MODE(channel));
> -
> -	if (txmode == SF_DMA_MODE) {
> -		pr_debug("GMAC: enable TX store and forward mode\n");
> -		/* Transmit COE type 2 cannot be done in cut-through mode. */
> -		mtl_tx_op |= MTL_OP_MODE_TSF;
> -	} else {
> -		pr_debug("GMAC: disabling TX SF (threshold %d)\n", txmode);
> -		mtl_tx_op &= ~MTL_OP_MODE_TSF;
> -		mtl_tx_op &= MTL_OP_MODE_TTC_MASK;
> -		/* Set the transmit threshold */
> -		if (txmode <= 32)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_32;
> -		else if (txmode <= 64)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_64;
> -		else if (txmode <= 96)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_96;
> -		else if (txmode <= 128)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_128;
> -		else if (txmode <= 192)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_192;
> -		else if (txmode <= 256)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_256;
> -		else if (txmode <= 384)
> -			mtl_tx_op |= MTL_OP_MODE_TTC_384;
> -		else
> -			mtl_tx_op |= MTL_OP_MODE_TTC_512;
> -	}
> -	/* For an IP with DWC_EQOS_NUM_TXQ == 1, the fields TXQEN and TQS are RO
> -	 * with reset values: TXQEN on, TQS == DWC_EQOS_TXFIFO_SIZE.
> -	 * For an IP with DWC_EQOS_NUM_TXQ > 1, the fields TXQEN and TQS are R/W
> -	 * with reset values: TXQEN off, TQS 256 bytes.
> -	 *
> -	 * Write the bits in both cases, since it will have no effect when RO.
> -	 * For DWC_EQOS_NUM_TXQ > 1, the top bits in MTL_OP_MODE_TQS_MASK might
> -	 * be RO, however, writing the whole TQS field will result in a value
> -	 * equal to DWC_EQOS_TXFIFO_SIZE, just like for DWC_EQOS_NUM_TXQ == 1.
> -	 */
> -	mtl_tx_op |= MTL_OP_MODE_TXQEN | MTL_OP_MODE_TQS_MASK;
> -	writel(mtl_tx_op, ioaddr +  MTL_CHAN_TX_OP_MODE(channel));
> +	unsigned int rqs = fifosz / 256 - 1;
> +	u32 mtl_rx_op, mtl_rx_int;
>  
>  	mtl_rx_op = readl(ioaddr + MTL_CHAN_RX_OP_MODE(channel));
>  
> -	if (rxmode == SF_DMA_MODE) {
> +	if (mode == SF_DMA_MODE) {
>  		pr_debug("GMAC: enable RX store and forward mode\n");
>  		mtl_rx_op |= MTL_OP_MODE_RSF;
>  	} else {
> -		pr_debug("GMAC: disable RX SF mode (threshold %d)\n", rxmode);
> +		pr_debug("GMAC: disable RX SF mode (threshold %d)\n", mode);
>  		mtl_rx_op &= ~MTL_OP_MODE_RSF;
>  		mtl_rx_op &= MTL_OP_MODE_RTC_MASK;
> -		if (rxmode <= 32)
> +		if (mode <= 32)
>  			mtl_rx_op |= MTL_OP_MODE_RTC_32;
> -		else if (rxmode <= 64)
> +		else if (mode <= 64)
>  			mtl_rx_op |= MTL_OP_MODE_RTC_64;
> -		else if (rxmode <= 96)
> +		else if (mode <= 96)
>  			mtl_rx_op |= MTL_OP_MODE_RTC_96;
>  		else
>  			mtl_rx_op |= MTL_OP_MODE_RTC_128;
> @@ -255,7 +211,7 @@ static void dwmac4_dma_chan_op_mode(void __iomem *ioaddr, int txmode,
>  	mtl_rx_op |= rqs << MTL_OP_MODE_RQS_SHIFT;
>  
>  	/* enable flow control only if each channel gets 4 KiB or more FIFO */
> -	if (rxfifosz >= 4096) {
> +	if (fifosz >= 4096) {
>  		unsigned int rfd, rfa;
>  
>  		mtl_rx_op |= MTL_OP_MODE_EHFC;
> @@ -266,7 +222,7 @@ static void dwmac4_dma_chan_op_mode(void __iomem *ioaddr, int txmode,
>  		 * Set Threshold for Deactivating Flow Control to min 1 frame,
>  		 * i.e. 1500 bytes.
>  		 */
> -		switch (rxfifosz) {
> +		switch (fifosz) {
>  		case 4096:
>  			/* This violates the above formula because of FIFO size
>  			 * limit therefore overflow may occur in spite of this.
> @@ -306,11 +262,49 @@ static void dwmac4_dma_chan_op_mode(void __iomem *ioaddr, int txmode,
>  	       ioaddr + MTL_CHAN_INT_CTRL(channel));
>  }
>  
> -static void dwmac4_dma_operation_mode(void __iomem *ioaddr, int txmode,
> -				      int rxmode, int rxfifosz)
> +static void dwmac4_dma_tx_chan_op_mode(void __iomem *ioaddr, int mode,
> +				       u32 channel)
>  {
> -	/* Only Channel 0 is actually configured and used */
> -	dwmac4_dma_chan_op_mode(ioaddr, txmode, rxmode, 0, rxfifosz);
> +	u32 mtl_tx_op = readl(ioaddr + MTL_CHAN_TX_OP_MODE(channel));
> +
> +	if (mode == SF_DMA_MODE) {
> +		pr_debug("GMAC: enable TX store and forward mode\n");
> +		/* Transmit COE type 2 cannot be done in cut-through mode. */
> +		mtl_tx_op |= MTL_OP_MODE_TSF;
> +	} else {
> +		pr_debug("GMAC: disabling TX SF (threshold %d)\n", mode);
> +		mtl_tx_op &= ~MTL_OP_MODE_TSF;
> +		mtl_tx_op &= MTL_OP_MODE_TTC_MASK;
> +		/* Set the transmit threshold */
> +		if (mode <= 32)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_32;
> +		else if (mode <= 64)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_64;
> +		else if (mode <= 96)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_96;
> +		else if (mode <= 128)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_128;
> +		else if (mode <= 192)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_192;
> +		else if (mode <= 256)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_256;
> +		else if (mode <= 384)
> +			mtl_tx_op |= MTL_OP_MODE_TTC_384;
> +		else
> +			mtl_tx_op |= MTL_OP_MODE_TTC_512;
> +	}
> +	/* For an IP with DWC_EQOS_NUM_TXQ == 1, the fields TXQEN and TQS are RO
> +	 * with reset values: TXQEN on, TQS == DWC_EQOS_TXFIFO_SIZE.
> +	 * For an IP with DWC_EQOS_NUM_TXQ > 1, the fields TXQEN and TQS are R/W
> +	 * with reset values: TXQEN off, TQS 256 bytes.
> +	 *
> +	 * Write the bits in both cases, since it will have no effect when RO.
> +	 * For DWC_EQOS_NUM_TXQ > 1, the top bits in MTL_OP_MODE_TQS_MASK might
> +	 * be RO, however, writing the whole TQS field will result in a value
> +	 * equal to DWC_EQOS_TXFIFO_SIZE, just like for DWC_EQOS_NUM_TXQ == 1.
> +	 */
> +	mtl_tx_op |= MTL_OP_MODE_TXQEN | MTL_OP_MODE_TQS_MASK;
> +	writel(mtl_tx_op, ioaddr +  MTL_CHAN_TX_OP_MODE(channel));
>  }
>  
>  static void dwmac4_get_hw_feature(void __iomem *ioaddr,
> @@ -387,7 +381,8 @@ const struct stmmac_dma_ops dwmac4_dma_ops = {
>  	.init = dwmac4_dma_init,
>  	.axi = dwmac4_dma_axi,
>  	.dump_regs = dwmac4_dump_dma_regs,
> -	.dma_mode = dwmac4_dma_operation_mode,
> +	.dma_rx_mode = dwmac4_dma_rx_chan_op_mode,
> +	.dma_tx_mode = dwmac4_dma_tx_chan_op_mode,
>  	.enable_dma_irq = dwmac4_enable_dma_irq,
>  	.disable_dma_irq = dwmac4_disable_dma_irq,
>  	.start_tx = dwmac4_dma_start_tx,
> @@ -409,7 +404,8 @@ const struct stmmac_dma_ops dwmac410_dma_ops = {
>  	.init = dwmac4_dma_init,
>  	.axi = dwmac4_dma_axi,
>  	.dump_regs = dwmac4_dump_dma_regs,
> -	.dma_mode = dwmac4_dma_operation_mode,
> +	.dma_rx_mode = dwmac4_dma_rx_chan_op_mode,
> +	.dma_tx_mode = dwmac4_dma_tx_chan_op_mode,
>  	.enable_dma_irq = dwmac410_enable_dma_irq,
>  	.disable_dma_irq = dwmac4_disable_dma_irq,
>  	.start_tx = dwmac4_dma_start_tx,
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ec363e1..c4e4a53 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1285,14 +1285,20 @@ static void stmmac_mac_enable_rx_queues(struct stmmac_priv *priv)
>   */
>  static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
>  {
> +	u32 rx_channels_count = priv->plat->rx_queues_to_use;
> +	u32 tx_channels_count = priv->plat->tx_queues_to_use;
>  	int rxfifosz = priv->plat->rx_fifo_size;
> +	u32 txmode = 0;
> +	u32 rxmode = 0;
> +	u32 chan = 0;
>  
>  	if (rxfifosz == 0)
>  		rxfifosz = priv->dma_cap.rx_fifo_size;
>  
> -	if (priv->plat->force_thresh_dma_mode)
> -		priv->hw->dma->dma_mode(priv->ioaddr, tc, tc, rxfifosz);
> -	else if (priv->plat->force_sf_dma_mode || priv->plat->tx_coe) {
> +	if (priv->plat->force_thresh_dma_mode) {
> +		txmode = tc;
> +		rxmode = tc;
> +	} else if (priv->plat->force_sf_dma_mode || priv->plat->tx_coe) {
>  		/*
>  		 * In case of GMAC, SF mode can be enabled
>  		 * to perform the TX COE in HW. This depends on:
> @@ -1300,12 +1306,26 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
>  		 * 2) There is no bugged Jumbo frame support
>  		 *    that needs to not insert csum in the TDES.
>  		 */
> -		priv->hw->dma->dma_mode(priv->ioaddr, SF_DMA_MODE, SF_DMA_MODE,
> -					rxfifosz);
> +		txmode = SF_DMA_MODE;
> +		rxmode = SF_DMA_MODE;
>  		priv->xstats.threshold = SF_DMA_MODE;
> -	} else
> -		priv->hw->dma->dma_mode(priv->ioaddr, tc, SF_DMA_MODE,
> +	} else {
> +		txmode = tc;
> +		rxmode = SF_DMA_MODE;
> +	}
> +
> +	/* configure all channels */
> +	if (priv->synopsys_id >= DWMAC_CORE_4_00) {
> +		for (chan = 0; chan < rx_channels_count; chan++)
> +			priv->hw->dma->dma_rx_mode(priv->ioaddr, rxmode, chan,
> +						   rxfifosz);
> +
> +		for (chan = 0; chan < tx_channels_count; chan++)
> +			priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan);
> +	} else {
> +		priv->hw->dma->dma_mode(priv->ioaddr, txmode, rxmode,
>  					rxfifosz);
> +	}
>  }
>  
>  /**
> @@ -1444,6 +1464,34 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
>  }
>  
>  /**
> + *  stmmac_set_dma_operation_mode - Set DMA operation mode by channel
> + *  @priv: driver private structure
> + *  @txmode: TX operating mode
> + *  @rxmode: RX operating mode
> + *  @chan: channel index
> + *  Description: it is used for configuring of the DMA operation mode in
> + *  runtime in order to program the tx/rx DMA thresholds or Store-And-Forward
> + *  mode.
> + */
> +static void stmmac_set_dma_operation_mode(struct stmmac_priv *priv, u32 txmode,
> +					  u32 rxmode, u32 chan)
> +{
> +	int rxfifosz = priv->plat->rx_fifo_size;
> +
> +	if (rxfifosz == 0)
> +		rxfifosz = priv->dma_cap.rx_fifo_size;
> +
> +	if (priv->synopsys_id >= DWMAC_CORE_4_00) {
> +		priv->hw->dma->dma_rx_mode(priv->ioaddr, rxmode, chan,
> +					   rxfifosz);
> +		priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan);
> +	} else {
> +		priv->hw->dma->dma_mode(priv->ioaddr, txmode, rxmode,
> +					rxfifosz);
> +	}
> +}
> +
> +/**
>   * stmmac_dma_interrupt - DMA ISR
>   * @priv: driver private structure
>   * Description: this is the DMA ISR. It is called by the main ISR.
> @@ -1452,11 +1500,8 @@ static void stmmac_tx_err(struct stmmac_priv *priv)
>   */
>  static void stmmac_dma_interrupt(struct stmmac_priv *priv)
>  {
> +	u32 chan = STMMAC_CHAN0;
>  	int status;
> -	int rxfifosz = priv->plat->rx_fifo_size;
> -
> -	if (rxfifosz == 0)
> -		rxfifosz = priv->dma_cap.rx_fifo_size;
>  
>  	status = priv->hw->dma->dma_interrupt(priv->ioaddr, &priv->xstats);
>  	if (likely((status & handle_rx)) || (status & handle_tx)) {
> @@ -1471,11 +1516,12 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
>  		    (tc <= 256)) {
>  			tc += 64;
>  			if (priv->plat->force_thresh_dma_mode)
> -				priv->hw->dma->dma_mode(priv->ioaddr, tc, tc,
> -							rxfifosz);
> +				stmmac_set_dma_operation_mode(priv->ioaddr,
> +							      tc, tc, chan);
>  			else
> -				priv->hw->dma->dma_mode(priv->ioaddr, tc,
> -							SF_DMA_MODE, rxfifosz);
> +				stmmac_set_dma_operation_mode(priv->ioaddr, tc,
> +							     SF_DMA_MODE, chan);
> +
>  			priv->xstats.threshold = tc;
>  		}
>  	} else if (unlikely(status == tx_hard_error))
> @@ -1749,6 +1795,9 @@ static void stmmac_mtl_configuration(struct stmmac_priv *priv)
>  	/* Enable MAC RX Queues */
>  	if (rx_queues_count > 1 && priv->hw->mac->rx_queue_enable)
>  		stmmac_mac_enable_rx_queues(priv);
> +
> +	/* Set the HW DMA mode and the COE */
> +	stmmac_dma_operation_mode(priv);
>  }
>  
>  /**
> @@ -1812,9 +1861,6 @@ static int stmmac_hw_setup(struct net_device *dev, bool init_ptp)
>  	else
>  		stmmac_set_mac(priv->ioaddr, true);
>  
> -	/* Set the HW DMA mode and the COE */
> -	stmmac_dma_operation_mode(priv);
> -
>  	stmmac_mmc_setup(priv);
>  
>  	if (init_ptp) {
> 

Starting with this patch, the stmmac-based network adapters of the Intel
Quark SoC stop working. I'm getting an IP via DHCP, I can ping, but TCP
connections can no longer be established.

Moving on a few patches (didn't bisect the exact one yet), the TX
watchdog starts to fire, and DHCP fails completely. And if I go to
current master in Linus tree (reverting an unrelated boot regression), I
even get a crash in stmmac_xmit.

Here are some details about the hw from dma_cap POV, if this helps:

==============================
        DMA HW features
==============================
        10/100 Mbps: Y
        1000 Mbps: N
        Half duplex: Y
        Hash Filter: Y
        Multiple MAC address registers: N
        PCS (TBI/SGMII/RTBI PHY interfaces): N
        SMA (MDIO) Interface: Y
        PMT Remote wake up: N
        PMT Magic Frame: N
        RMON module: Y
        IEEE 1588-2002 Time Stamp: N
        IEEE 1588-2008 Advanced Time Stamp: Y
        802.3az - Energy-Efficient Ethernet (EEE): N
        AV features: N
        Checksum Offload in TX: Y
        IP Checksum Offload (type1) in RX: N
        IP Checksum Offload (type2) in RX: Y
        RXFIFO > 2048bytes: Y
        Number of Additional RX channel: 0
        Number of Additional TX channel: 0
        Enhanced descriptors: Y

Given the number of different failure modes, my feeling is that there
are multiple regressions coming with these patches...

I've tested on the IOT2000 board, but I suspect the Galileo Gen2 will be
affected equally. If you don't have access to any such device, let me
know what I can debug for you.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply

* Re: [RFC PATCH 0/3] udp: scalability improvements
From: Paolo Abeni @ 2017-05-08  7:29 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Eric Dumazet
In-Reply-To: <CALx6S349K0_61uKs0Aje4f-8yjkQdLgFTxgF=Xd1jOyS0uddVQ@mail.gmail.com>

On Sat, 2017-05-06 at 16:09 -0700, Tom Herbert wrote:
> On Sat, May 6, 2017 at 1:42 PM, Paolo Abeni <pabeni@redhat.com> wrote:
> > This patch series implement an idea suggested by Eric Dumazet to
> > reduce the contention of the udp sk_receive_queue lock when the socket is
> > under flood.
> > 
> > An ancillary queue is added to the udp socket, and the socket always
> > tries first to read packets from such queue. If it's empty, we splice
> > the content from sk_receive_queue into the ancillary queue.
> > 
> > The first patch introduces some helpers to keep the udp code small, and the
> > following two implement the ancillary queue strategy. The code is split
> > to hopefully help the reviewing process.
> > 
> > The measured overall gain under udp flood is in the 20-35% range depending on
> > the numa layout and the number of ingress queue used by the relevant nic.
> > 
> 
> Certainly sounds good, but can you give real reproducible performance
> numbers including the test that was run?

You are right, and I'm sorry, the cover letter was too terse.

I used pktgen as sender, with 64 bytes packets, random src port on an
host b2b connected via a 10Gbs link with the dut.

On the receiver I used the udp_sink program by Jesper (https://github.c
om/netoptimizer/network-testing/blob/master/src/udp_sink.c) and I
configured an h/w l4 rx hash, so that I could control the number of
ingress nic rx queues hit by the udp traffic via ethtool -L.

The udp_sink program was bound to the first idle cpu, to get more
stable numbers.

Using a single numa note as receiver, I got the following:

nic rx queues		vanilla			patched kernel
1			1820 kpps		1900 kpps
2			1950 kpps		2500 kpps
16			1670 kpps		2120 kpps

When using a single nic rx queue I also enabled busy polling;
elsewhere, in my scenario, the bh processing becames the bottle-neck
and this produces large artifacts in the measured performances (e.g.
improving the udp sink run time, decreases the overall tput, since more
action from the scheduler comes into play).

Cheers,

Paolo

^ permalink raw reply

* [PATCH net] xfrm: Fix NETDEV_DOWN with IPSec offload
From: ilant @ 2017-05-08  7:30 UTC (permalink / raw)
  To: David Miller, Steffen Klassert; +Cc: Boris Pismenny, netdev, Ilan Tayari

From: Ilan Tayari <ilant@mellanox.com>

Upon NETDEV_DOWN event, all xfrm_state objects which are bound to
the device are flushed.

The condition for this is wrong, though, testing dev->hw_features
instead of dev->features. If a device has non-user-modifiable
NETIF_F_HW_ESP, then its xfrm_state objects are not flushed,
causing a crash later on after the device is deleted.

Check dev->features instead of dev->hw_features.

Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 8ec8a3fcf8d4..574e6f32f94f 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -170,7 +170,7 @@ static int xfrm_dev_feat_change(struct net_device *dev)
 
 static int xfrm_dev_down(struct net_device *dev)
 {
-	if (dev->hw_features & NETIF_F_HW_ESP)
+	if (dev->features & NETIF_F_HW_ESP)
 		xfrm_dev_state_flush(dev_net(dev), dev, true);
 
 	xfrm_garbage_collect(dev_net(dev));
-- 
2.11.0

^ permalink raw reply related

* [PATCH net v2 0/1] xfrm: Fix NETDEV_DOWN with IPSec offload
From: ilant @ 2017-05-08  7:39 UTC (permalink / raw)
  To: David Miller, Steffen Klassert; +Cc: Boris Pismenny, netdev, Ilan Tayari

From: Ilan Tayari <ilant@mellanox.com>

v1 -> v2: Added Fixes tag

Ilan Tayari (1):
  xfrm: Fix NETDEV_DOWN with IPSec offload

 net/xfrm/xfrm_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net v2 1/1] xfrm: Fix NETDEV_DOWN with IPSec offload
From: ilant @ 2017-05-08  7:39 UTC (permalink / raw)
  To: David Miller, Steffen Klassert; +Cc: Boris Pismenny, netdev, Ilan Tayari
In-Reply-To: <20170508073934.28529-1-ilant@mellanox.com>

From: Ilan Tayari <ilant@mellanox.com>

Upon NETDEV_DOWN event, all xfrm_state objects which are bound to
the device are flushed.

The condition for this is wrong, though, testing dev->hw_features
instead of dev->features. If a device has non-user-modifiable
NETIF_F_HW_ESP, then its xfrm_state objects are not flushed,
causing a crash later on after the device is deleted.

Check dev->features instead of dev->hw_features.

Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 8ec8a3fcf8d4..574e6f32f94f 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -170,7 +170,7 @@ static int xfrm_dev_feat_change(struct net_device *dev)
 
 static int xfrm_dev_down(struct net_device *dev)
 {
-	if (dev->hw_features & NETIF_F_HW_ESP)
+	if (dev->features & NETIF_F_HW_ESP)
 		xfrm_dev_state_flush(dev_net(dev), dev, true);
 
 	xfrm_garbage_collect(dev_net(dev));
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v4 net-next 08/10] net/ncsi: Support NCSI packet generation
From: Gavin Shan @ 2017-05-08  6:27 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Gavin Shan, netdev, joe, kubakici, f.fainelli, davem
In-Reply-To: <20170508005632.GF27709@lunn.ch>

On Mon, May 08, 2017 at 02:56:32AM +0200, Andrew Lunn wrote:
>> I need to dig how libpcap receives packets. It's appreciated if you
>> can give some hints about that. However, I don't see the benefit to
>> receive packets by libpcap, could you claim?
>
>The base interface should already be doing it for you. Try it! Run
>tcpdump or wireshark and you should see the NCSI packets going
>out/coming in. Look for the ethertype. Neither tcpdump or wireshark
>appear to have dissectors for NCSI, but it should be simple to
>write. I've written them before, and it is easy.
>
>Doing it in userspace will give you a much nice environment to work
>in. Wireshark can link the response to the request, do a much more
>detailed decode than what you want to do in kernel space, and in
>general it is safer. Wrongly decoding and printing protocols in kernel
>space can lead to a remote kernel vulnerability. Getting it wrong in
>user space 'just' allows a remote hack of a user account.
>

Andrew, thanks a lot for the hints. Yeah, I think it's implemented based
on AF_PACKET + ETH_P_ALL. NCSI packets has been included.

The output from (libpcap and debugfs) will be different, but I assume it's
not a problem. libpcap usually outputs all Tx/Rx NCSI packets while debugfs
file outputs received NCSI response packets that correspond to the NCSI command
packets sent via the debugfs interface.

In next respin, I'll move the Rx path to libpcap while the Tx is left in
the debugfs interace.

Cheers,
Gavin

^ permalink raw reply

* Re: [PATCH net-next ] tcp: add new parameter tcp_inherit_buffsize to control the initial buffer size for new passive connetion
From: Shan Wei @ 2017-05-08  7:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <1493992346.7796.43.camel@edumazet-glaptop3.roam.corp.google.com>

at 2017/5/5 21:52
> On Fri, 2017-05-05 at 18:22 +0800,  wrote:
>>
>> 2017-05-02 23:05 GMT+08:00 Eric Dumazet <eric.dumazet@gmail.com>:
>>         
>>         
>>         Hi Shan
>>         
>>         1) Your patch never reached netdev, because it was sent in
>>         HTML format.
>>         
>>         2) During Linus merge window, net-next is closed
>>         
>>         I am not really convinced that we need this with TCP
>>         autotuning anyway.
>>         Initial value of sk_sndbuf and sk_rcvbuf is really a hint.
>>         
>>         How often do you really tweak /proc/sys/net/ipv4 files in
>>         production.
>>
>>
>>
> Again you posted your reply in HTML, so netdev did not get it.
Now, fix it with thunderbird instead of gmail web client.
>> Historically, most products have not adjusted any wmem/rmem
>> parameters.
>> In our experience to adjust the parameters, you can get a better visit
>> experience, faster. 
>>
>> Now more and more products want to change this default value.
>> But for a product, once we adjust the parameters, it is rarely
>> adjusted.
>>
>>
>> We hope that this patch can provide a capacity: adjust the parameters
>> online and quickly take effect.
>>
>> Without this patch, it is time consuming for us to adjust the
>> parameters and need to migrate flows.
>>
>>
>>
>>          
>>         Please provide more information, like what actual values you
>>         change back
>>         and forth.
>>         
>>
>>
>> There is no a fixed value here. big value, will take up more memory
>> resources.
>> But we are also willing to sacrifice memory for performance.
>>
>> Select the parameter value based on the characteristics of the
>> product.
>>
>> For example, for 1MB files, we will set buffsize to 1611646 = (1MB /
>> 1460) * 2244
>> Mss = 1460, SKB_TRUESIZE = 2244
>>
>>
>> With TCP autotuning, reach the buffer size until the cwnd equal 359
>> from 10.
> Looks like a problem with autotuning because you might have HZ=100 or
> HZ=250 maybe ?
Indeed,  HZ=250.  What is the problem about autotuning with HZ=250?

^ permalink raw reply

* net_sched strange in 4.11
From: Anton Ivanov @ 2017-05-08  7:15 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Hi all,

I was revising some of my old work for UML to prepare it for submission 
and I noticed that skb->xmit_more does not seem to be set any more.

I traced the issue as far as net/sched/sched_generic.c

try_bulk_dequeue_skb() is never invoked (the drivers I am working on are 
dql enabled so that is not the problem).

More interestingly, if I put a breakpoint and debug output into 
dequeue_skb() around line 147 - right before the bulk: tag that skb 
there is always NULL. ???

Similarly, debug in pfifo_fast_dequeue shows only NULLs being dequeued. 
Again - ???

First and foremost, I apologize for the silly question, but how can this 
work at all? I see the skbs showing up at the driver level, why are 
NULLs being returned at qdisc dequeue and where do the skbs at the 
driver level come from?

Second, where should I look to fix it?

A.

^ permalink raw reply

* Re: Network cooling device and how to control NIC speed on thermal condition
From: Waldemar Rymarkiewicz @ 2017-05-08  8:08 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Alan Cox, Florian Fainelli, netdev, linux-kernel
In-Reply-To: <20170428115630.GG13231@lunn.ch>

On 28 April 2017 at 13:56, Andrew Lunn <andrew@lunn.ch> wrote:
> Is that a realistic test? No traffic over the network? If you are
> hitting your thermal limit, to me that means one of two things:
>
> 1) The device is under very heavy load, consuming a lot of power to do
>    what it needs to to.
>
> 2) Your device is idle, no packets are flowing, but your thermal
>    design is wrong, so that it cannot dissipate enough heat.
>
> It seems to me, you are more interested in 1). But your quick test is
> more about 2).

The test was not realistic indeed, but it was rather about showing how
link speed correlates to temperature. In the test, I was not under any
thermal condition. But the same gain on the temperature we can achieve
when we hit hot temperature trip point and does not matter how heavy
the network traffic is. It's not said the source of heat is a heavy
network traffic. There can be several sources of heat on SoC.
However, the fact is that PHYs having active 1G/s link generate much
more heat than having 100M/s link independently from network traffic.

> I would be more interested in do quick tests of switching 8Gbps,
> 4Gbps, 2Gbps, 1Gbps, 512Mbps, 256Bps, ... What effect does this have
> on temperature?
>
>> So, throttling link speed can really help to dissipate heat
>> significantly when the platform is under threat.
>>
>> Renegotiating link speed costs something I agree, it also impacts user
>> experience, but such a thermal condition will not occur often I
>> believe.
>
> It is a heavy handed approach, and you have to be careful. There are
> some devices which don't work properly, e.g. if you try to negotiate
> 1000 half duplex, you might find the link just breaks.

That is a valuable remark. I definitely need to run some interoperability tests.

> Doing this via packet filtering, dropping packets, gives you a much
> finer grained control and is a lot less disruptive. But it assumes
> handling packets is what it causing you heat problems, not the links
> themselves.

Link speed manipulation is considered by me as one of a cooling
method, the way to maintain the temperature along with cpufreq, fan
etc. It's not said, the heat is caused by heavy network traffic
itself. So, packets filtering is not of my interests.

All cooling methods impact host only, but "net cooling" impacts remote
side in addition, which seems to me to be a problem sometimes. Also,
the moment of link renegotiation blocks rx/tx for upper layers, so the
user sees a pause when streaming a video for example. However, if a
system is under a thermal condition, does it really matter?

/Waldek

^ permalink raw reply

* Re: [PATCH net] bpf: don't let ldimm64 leak map addresses on unprivileged
From: Daniel Borkmann @ 2017-05-08  8:44 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, jannh, kafai, netdev
In-Reply-To: <793c517a7d163c613ab886eb02d32efea9f902fd.1494194233.git.daniel@iogearbox.net>

On 05/08/2017 12:04 AM, Daniel Borkmann wrote:
> The patch fixes two things at once:
>
> 1) It checks the env->allow_ptr_leaks and only prints the map address to
>     the log if we have the privileges to do so, otherwise it just dumps 0
>     as we would when kptr_restrict is enabled on %pK. Given the latter is
>     off by default and not every distro sets it, I don't want to rely on
>     this, hence the 0 by default for unprivileged.
>
> 2) Printing of ldimm64 in the verifier log is currently broken in that
>     we don't print the full immediate, but only the 32 bit part of the
>     first insn part for ldimm64. Thus, fix this up as well; it's okay to
>     access, since we verified all ldimm64 earlier already (including just
>     constants) through replace_map_fd_with_map_ptr().

This one is also needed for the log (should come first):

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")

> Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)")
> Reported-by: Jann Horn <jannh@google.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks,
Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox