* Re: linux-next: build failure after merge of the bpf-next tree
From: Alexei Starovoitov @ 2018-09-07 5:45 UTC (permalink / raw)
To: Björn Töpel
Cc: sfr, Daniel Borkmann, ast, Netdev, linux-next, LKML,
Björn Töpel
In-Reply-To: <CAJ+HfNgRd5ztzwJ9HtgnrDyf8F0ykODY_EGC91d9KiAfDbhcXA@mail.gmail.com>
On Fri, Sep 07, 2018 at 07:21:05AM +0200, Björn Töpel wrote:
> Den fre 7 sep. 2018 kl 02:23 skrev Alexei Starovoitov
> <alexei.starovoitov@gmail.com>:
> >
> > On Fri, Sep 07, 2018 at 10:19:23AM +1000, Stephen Rothwell wrote:
> > > Hi all,
> > >
> > > After merging the bpf-next tree, today's linux-next build (powerpc
> > > ppc64_defconfig) failed like this:
> > >
> > > ERROR: ".xsk_reuseq_swap" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined!
> > > ERROR: ".xsk_reuseq_free" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined!
> > > ERROR: ".xsk_reuseq_prepare" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined
> > >
> > > Caused by commit
> > >
> > > 9654bd10da60 ("i40e: clean zero-copy XDP Rx ring on shutdown/reset")
> > >
> > > CONFIG_XDP_SOCKETS is not set for this build.
> > >
> > > I have used the version of the bfp-next tree from next-20180906 for today.
> >
> > merge conflict and build error...
> > Bjorn, I'm thinking to toss the patches out of bpf-next and reapply
> > cleaned up version of the patches...
> > what do you think?
> >
>
> Yes, do that. I'll get back with a cleaned up v2.
Done.
Unfortunately during interactive rebase that removed your commits git didn't
preserve merge commits that came after yours,
so Jesper's and Yonghong's cover letters are not in the git history.
Yet all patches are in the same order.
Explicit revert with trail in the git history would have been worse.
patchworks + git isn't the most convenient workflow. sigh.
^ permalink raw reply
* Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Olof Johansson @ 2018-09-07 6:20 UTC (permalink / raw)
To: Herbert Xu
Cc: Eric Dumazet, David Miller, Neil Horman, Marcelo Ricardo Leitner,
Vladislav Yasevich, Alexey Kuznetsov, Hideaki YOSHIFUJI,
linux-crypto, LKML, linux-sctp, netdev, linux-decnet-user,
kernel-team
In-Reply-To: <20180907033257.2nlgiqm2t4jiwhzc@gondor.apana.org.au>
Hi,
On Thu, Sep 6, 2018 at 8:32 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote:
>> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson <olof@lixom.net> wrote:
>> >
>> > Today these are all global shared variables per protocol, and in
>> > particular tcp_memory_allocated can get hot on a system with
>> > large number of CPUs and a substantial number of connections.
>> >
>> > Moving it over to a per-cpu variable makes it significantly cheaper,
>> > and the added overhead when summing up the percpu copies is still smaller
>> > than the cost of having a hot cacheline bouncing around.
>>
>> I am curious. We never noticed contention on this variable, at least for TCP.
>
> Yes these variables are heavily amortised so I'm surprised that
> they would cause much contention.
>
>> Please share some numbers with us.
>
> Indeed.
Certainly, just had to collect them again.
This is on a dual xeon box, with ~150-200k TCP connections. I see
about .7% CPU spent in __sk_mem_{reduce,raise}_allocated in the
inlined atomic ops, most of those in reduce.
Call path for reduce is practically all from tcp_write_timer on softirq:
__sk_mem_reduce_allocated
tcp_write_timer
call_timer_fn
run_timer_softirq
__do_softirq
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt
cpuidle_enter_state
With this patch, I see about .18+.11+.07 = .36% in percpu-related
functions called from the same __sk_mem functions.
Now, that's a halving of cycles samples on that specific setup. The
real difference though, is on another platform where atomics are more
expensive. There, this makes a significant difference. Unfortunately,
I can't share specifics but I think this change stands on its own on
the dual xeon setup as well, maybe with slightly less strong wording
on just how hot the variable/line happens to be.
-Olof
^ permalink raw reply
* Re: [Intel-wired-lan] e1000e driver stuck at 10Mbps after reconnection
From: Camille Bordignon @ 2018-09-07 6:28 UTC (permalink / raw)
To: Neftin, Sasha
Cc: Alexander Duyck, Netdev, intel-wired-lan, David S. Miller,
linux-kernel
In-Reply-To: <c1e2cf13-6bcc-498d-03ea-7c3792d675cb@intel.com>
Le mercredi 08 août 2018 à 18:00:28 (+0300), Neftin, Sasha a écrit :
> On 8/8/2018 17:24, Neftin, Sasha wrote:
> > On 8/7/2018 09:42, Camille Bordignon wrote:
> > > Le lundi 06 août 2018 à 15:45:29 (-0700), Alexander Duyck a écrit :
> > > > On Mon, Aug 6, 2018 at 4:59 AM, Camille Bordignon
> > > > <camille.bordignon@easymile.com> wrote:
> > > > > Hello,
> > > > >
> > > > > Recently we experienced some issues with intel NIC (I219-LM
> > > > > and I219-V).
> > > > > It seems that after a wire reconnection, auto-negotation "fails" and
> > > > > link speed drips to 10 Mbps.
> > > > >
> > > > > From kernel logs:
> > > > > [17616.346150] e1000e: enp0s31f6 NIC Link is Down
> > > > > [17627.003322] e1000e: enp0s31f6 NIC Link is Up 10 Mbps Full
> > > > > Duplex, Flow Control: None
> > > > > [17627.003325] e1000e 0000:00:1f.6 enp0s31f6: 10/100 speed:
> > > > > disabling TSO
> > > > >
> > > > >
> > > > > $ethtool enp0s31f6
> > > > > Settings for enp0s31f6:
> > > > > Supported ports: [ TP ]
> > > > > Supported link modes: 10baseT/Half 10baseT/Full
> > > > > 100baseT/Half 100baseT/Full
> > > > > 1000baseT/Full
> > > > > Supported pause frame use: No
> > > > > Supports auto-negotiation: Yes
> > > > > Supported FEC modes: Not reported
> > > > > Advertised link modes: 10baseT/Half 10baseT/Full
> > > > > 100baseT/Half 100baseT/Full
> > > > > 1000baseT/Full
> > > > > Advertised pause frame use: No
> > > > > Advertised auto-negotiation: Yes
> > > > > Advertised FEC modes: Not reported
> > > > > Speed: 10Mb/s
> > > > > Duplex: Full
> > > > > Port: Twisted Pair
> > > > > PHYAD: 1
> > > > > Transceiver: internal
> > > > > Auto-negotiation: on
> > > > > MDI-X: on (auto)
> > > > > Supports Wake-on: pumbg
> > > > > Wake-on: g
> > > > > Current message level: 0x00000007 (7)
> > > > > drv probe link
> > > > > Link detected: yes
> > > > >
> > > > >
> > > > > Notice that if disconnection last less than about 5 seconds,
> > > > > nothing wrong happens.
> > > > > And if after last failure, disconnection / connection occurs again and
> > > > > last less than 5 seconds, link speed is back to 1000 Mbps.
> > > > >
> > > > > [18075.350678] e1000e: enp0s31f6 NIC Link is Down
> > > > > [18078.716245] e1000e: enp0s31f6 NIC Link is Up 1000 Mbps
> > > > > Full Duplex, Flow Control: None
> > > > >
> > > > > The following patch seems to fix this issue.
> > > > > However I don't clearly understand why.
> > > > >
> > > > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > index 3ba0c90e7055..763c013960f1 100644
> > > > > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > @@ -5069,7 +5069,7 @@ static bool e1000e_has_link(struct
> > > > > e1000_adapter *adapter)
> > > > > case e1000_media_type_copper:
> > > > > if (hw->mac.get_link_status) {
> > > > > ret_val = hw->mac.ops.check_for_link(hw);
> > > > > - link_active = !hw->mac.get_link_status;
> > > > > + link_active = false;
> > > > > } else {
> > > > > link_active = true;
> > > > > }
> > > > >
> > > > > Maybe this is related to watchdog task.
> > > > >
> > > > > I've found out this fix by comparing with last commit that works fine :
> > > > > commit 0b76aae741abb9d16d2c0e67f8b1e766576f897d.
> > > > > However I don't know if this information is relevant.
> > > > >
> > > > > Thank you.
> > > > > Camille Bordignon
> > > >
> > > > What kernel were you testing this on? I know there have been a number
> > > > of changes over the past few months in this area and it would be
> > > > useful to know exactly what code base you started out with and what
> > > > the latest version of the kernel is you have tested.
> > > >
> > > > Looking over the code change the net effect of it should be to add a 2
> > > > second delay from the time the link has changed until you actually
> > > > check the speed/duplex configuration. It is possible we could be
> > > > seeing some sort of timing issue and adding the 2 second delay after
> > > > the link event is enough time for things to stabilize and detect the
> > > > link at 1000 instead of 10/100.
> > > >
> > > > - Alex
> > >
> > > We've found out this issue using Fedora 27 (4.17.11-100.fc27.x86_64).
> > >
> > > Then I've tested wth a more recent version of the driver v4.18-rc7 but
> > > behavior looks the same.
> > >
> > > Thanks for you reply.
> > >
> > > Camille Bordignon
> > > _______________________________________________
> > > Intel-wired-lan mailing list
> > > Intel-wired-lan@osuosl.org
> > > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> > >
> > I've agree with Alex. Let's try add 2s delay after a link event. Please,
> > let us know if it will solve your problem.
> > Also, I would like recommend try work with different link partner and
> > see if you see same problem.
> > _______________________________________________
> > Intel-wired-lan mailing list
> > Intel-wired-lan@osuosl.org
> > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> Camille,
> My apologies, I wrong understand Alex. Please, do not try add delay. Please,
> check if you see same problem with different link partners.
> Thanks,
> Sasha
Hello,
I recently figured out that neither the previous patch nor commit
0b76aae741abb9d16d2c0e67f8b1e766576f897d fix this issue.
In these cases, after reproducing the issue, when ethernet wire is connected
kernel logs mention full speed (1000 Mbps) but actually it seems it is not.
The problem persists.
I made some simple tests with a web browser and a dowload speed test
site (fast.com). As a result speed was always around 8 Mbps.
And again after a quick ethernet wire disconnection / reconnection, same
test is around 500 Mbps.
I feel very confused.
Camille Bordignon
^ permalink raw reply
* [PATCH] can: rcar_can: convert to SPDX identifiers
From: Kuninori Morimoto @ 2018-09-07 2:00 UTC (permalink / raw)
To: Wolfgang Grandegger, Marc Kleine-Budde
Cc: Linux-Renesas, Linux-Net, linux-can
From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
This patch updates license to use SPDX-License-Identifier
instead of verbose license text.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
drivers/net/can/rcar/Kconfig | 1 +
drivers/net/can/rcar/Makefile | 1 +
drivers/net/can/rcar/rcar_can.c | 6 +-----
drivers/net/can/rcar/rcar_canfd.c | 6 +-----
4 files changed, 4 insertions(+), 10 deletions(-)
diff --git a/drivers/net/can/rcar/Kconfig b/drivers/net/can/rcar/Kconfig
index 7b03a3a..bd5a8fc 100644
--- a/drivers/net/can/rcar/Kconfig
+++ b/drivers/net/can/rcar/Kconfig
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
config CAN_RCAR
tristate "Renesas R-Car CAN controller"
depends on ARCH_RENESAS || ARM
diff --git a/drivers/net/can/rcar/Makefile b/drivers/net/can/rcar/Makefile
index 08de36a..c9185b0 100644
--- a/drivers/net/can/rcar/Makefile
+++ b/drivers/net/can/rcar/Makefile
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
#
# Makefile for the Renesas R-Car CAN & CAN FD controller drivers
#
diff --git a/drivers/net/can/rcar/rcar_can.c b/drivers/net/can/rcar/rcar_can.c
index 11662f4..06f90a0 100644
--- a/drivers/net/can/rcar/rcar_can.c
+++ b/drivers/net/can/rcar/rcar_can.c
@@ -1,12 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0+
/* Renesas R-Car CAN device driver
*
* Copyright (C) 2013 Cogent Embedded, Inc. <source@cogentembedded.com>
* Copyright (C) 2013 Renesas Solutions Corp.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the
- * Free Software Foundation; either version 2 of the License, or (at your
- * option) any later version.
*/
#include <linux/module.h>
diff --git a/drivers/net/can/rcar/rcar_canfd.c b/drivers/net/can/rcar/rcar_canfd.c
index 602c19e..0541000 100644
--- a/drivers/net/can/rcar/rcar_canfd.c
+++ b/drivers/net/can/rcar/rcar_canfd.c
@@ -1,11 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0+
/* Renesas R-Car CAN FD device driver
*
* Copyright (C) 2015 Renesas Electronics Corp.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the
- * Free Software Foundation; either version 2 of the License, or (at your
- * option) any later version.
*/
/* The R-Car CAN FD controller can operate in either one of the below two modes
--
2.7.4
^ permalink raw reply related
* [PATCH] crypto: Adds user space interface for ALG_SET_KEY_TYPE
From: Kalyani Akula @ 2018-09-07 6:40 UTC (permalink / raw)
To: herbert, davem, kstewart, gregkh, tglx, pombredanne, linux-crypto,
linux-kernel, netdev, saratcha
Cc: Kalyani Akula
ALG_SET_KEY_TYPE requires caller to pass the key_type to be used
for AES encryption/decryption.
Sometimes the cipher key will be stored in the device's
hardware (eFuse, BBRAM etc).So,there is a need to specify the information
about the key-type to use it for Encrypt or Decrypt operations.
This patch implements the above requirement.
Signed-off-by: Kalyani Akula <kalyani.akula@xilinx.com>
---
crypto/af_alg.c | 7 +++++++
crypto/algif_skcipher.c | 12 ++++++++++++
crypto/blkcipher.c | 9 +++++++++
crypto/skcipher.c | 18 ++++++++++++++++++
include/crypto/if_alg.h | 2 ++
include/crypto/skcipher.h | 12 +++++++++++-
include/linux/crypto.h | 12 ++++++++++++
include/uapi/linux/if_alg.h | 1 +
8 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index b053179..9ea6b86 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -261,6 +261,13 @@ static int alg_setsockopt(struct socket *sock, int level, int optname,
if (!type->setauthsize)
goto unlock;
err = type->setauthsize(ask->private, optlen);
+ break;
+ case ALG_SET_KEY_TYPE:
+ if (sock->state == SS_CONNECTED)
+ goto unlock;
+ if (!type->setkeytype)
+ goto unlock;
+ err = type->setkeytype(ask->private, optval, optlen);
}
unlock:
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index cfdaab2..9164465 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -320,6 +320,17 @@ static int skcipher_setkey(void *private, const u8 *key, unsigned int keylen)
return crypto_skcipher_setkey(private, key, keylen);
}
+static int skcipher_setkeytype(void *private, const u8 *key,
+ unsigned int keylen)
+{
+ struct skcipher_tfm *tfm = private;
+ int err;
+
+ err = crypto_skcipher_setkeytype(tfm->skcipher, key, keylen);
+
+ return err;
+}
+
static void skcipher_sock_destruct(struct sock *sk)
{
struct alg_sock *ask = alg_sk(sk);
@@ -384,6 +395,7 @@ static int skcipher_accept_parent(void *private, struct sock *sk)
.bind = skcipher_bind,
.release = skcipher_release,
.setkey = skcipher_setkey,
+ .setkeytype = skcipher_setkeytype,
.accept = skcipher_accept_parent,
.accept_nokey = skcipher_accept_parent_nokey,
.ops = &algif_skcipher_ops,
diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index f93abf1..3ea0e7f 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -408,6 +408,14 @@ static int setkey(struct crypto_tfm *tfm, const u8 *key, unsigned int keylen)
return cipher->setkey(tfm, key, keylen);
}
+static int setkeytype(struct crypto_tfm *tfm, const u8 *key,
+ unsigned int keylen)
+{
+ struct blkcipher_alg *cipher = &tfm->__crt_alg->cra_blkcipher;
+
+ return cipher->setkeytype(tfm, key, keylen);
+}
+
static int async_setkey(struct crypto_ablkcipher *tfm, const u8 *key,
unsigned int keylen)
{
@@ -478,6 +486,7 @@ static int crypto_init_blkcipher_ops_sync(struct crypto_tfm *tfm)
unsigned long addr;
crt->setkey = setkey;
+ crt->setkeytype = setkeytype;
crt->encrypt = alg->encrypt;
crt->decrypt = alg->decrypt;
diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 0bd8c6c..cb794dd 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -604,6 +604,23 @@ static int skcipher_setkey_blkcipher(struct crypto_skcipher *tfm,
return 0;
}
+static int skcipher_setkeytype_blkcipher(struct crypto_skcipher *tfm,
+ const u8 *key, unsigned int keylen)
+{
+ struct crypto_blkcipher **ctx = crypto_skcipher_ctx(tfm);
+ struct crypto_blkcipher *blkcipher = *ctx;
+ int err;
+
+ crypto_blkcipher_clear_flags(blkcipher, ~0);
+ crypto_blkcipher_set_flags(blkcipher, crypto_skcipher_get_flags(tfm) &
+ CRYPTO_TFM_REQ_MASK);
+ err = crypto_blkcipher_setkeytype(blkcipher, key, keylen);
+ crypto_skcipher_set_flags(tfm, crypto_blkcipher_get_flags(blkcipher) &
+ CRYPTO_TFM_RES_MASK);
+
+ return err;
+}
+
static int skcipher_crypt_blkcipher(struct skcipher_request *req,
int (*crypt)(struct blkcipher_desc *,
struct scatterlist *,
@@ -670,6 +687,7 @@ static int crypto_init_skcipher_ops_blkcipher(struct crypto_tfm *tfm)
tfm->exit = crypto_exit_skcipher_ops_blkcipher;
skcipher->setkey = skcipher_setkey_blkcipher;
+ skcipher->setkeytype = skcipher_setkeytype_blkcipher;
skcipher->encrypt = skcipher_encrypt_blkcipher;
skcipher->decrypt = skcipher_decrypt_blkcipher;
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 482461d..202298e 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -51,6 +51,8 @@ struct af_alg_type {
void *(*bind)(const char *name, u32 type, u32 mask);
void (*release)(void *private);
int (*setkey)(void *private, const u8 *key, unsigned int keylen);
+ int (*setkeytype)(void *private, const u8 *keytype,
+ unsigned int keylen);
int (*accept)(void *private, struct sock *sk);
int (*accept_nokey)(void *private, struct sock *sk);
int (*setauthsize)(void *private, unsigned int authsize);
diff --git a/include/crypto/skcipher.h b/include/crypto/skcipher.h
index 2f327f0..54f6752 100644
--- a/include/crypto/skcipher.h
+++ b/include/crypto/skcipher.h
@@ -54,7 +54,9 @@ struct skcipher_givcrypt_request {
struct crypto_skcipher {
int (*setkey)(struct crypto_skcipher *tfm, const u8 *key,
- unsigned int keylen);
+ unsigned int keylen);
+ int (*setkeytype)(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int keylen);
int (*encrypt)(struct skcipher_request *req);
int (*decrypt)(struct skcipher_request *req);
@@ -125,6 +127,8 @@ struct crypto_skcipher {
struct skcipher_alg {
int (*setkey)(struct crypto_skcipher *tfm, const u8 *key,
unsigned int keylen);
+ int (*setkeytype)(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int keylen);
int (*encrypt)(struct skcipher_request *req);
int (*decrypt)(struct skcipher_request *req);
int (*init)(struct crypto_skcipher *tfm);
@@ -401,6 +405,12 @@ static inline int crypto_skcipher_setkey(struct crypto_skcipher *tfm,
return tfm->setkey(tfm, key, keylen);
}
+static inline int crypto_skcipher_setkeytype(struct crypto_skcipher *tfm,
+ const u8 *key, unsigned int keylen)
+{
+ return tfm->setkeytype(tfm, key, keylen);
+}
+
static inline unsigned int crypto_skcipher_default_keysize(
struct crypto_skcipher *tfm)
{
diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index e8839d3..bda8380 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -292,6 +292,8 @@ struct ablkcipher_alg {
struct blkcipher_alg {
int (*setkey)(struct crypto_tfm *tfm, const u8 *key,
unsigned int keylen);
+ int (*setkeytype)(struct crypto_tfm *tfm, const u8 *keytype,
+ unsigned int keylen);
int (*encrypt)(struct blkcipher_desc *desc,
struct scatterlist *dst, struct scatterlist *src,
unsigned int nbytes);
@@ -563,6 +565,8 @@ struct blkcipher_tfm {
void *iv;
int (*setkey)(struct crypto_tfm *tfm, const u8 *key,
unsigned int keylen);
+ int (*setkeytype)(struct crypto_tfm *tfm, const u8 *key,
+ unsigned int keylen);
int (*encrypt)(struct blkcipher_desc *desc, struct scatterlist *dst,
struct scatterlist *src, unsigned int nbytes);
int (*decrypt)(struct blkcipher_desc *desc, struct scatterlist *dst,
@@ -1281,6 +1285,14 @@ static inline int crypto_blkcipher_setkey(struct crypto_blkcipher *tfm,
key, keylen);
}
+static inline int crypto_blkcipher_setkeytype(struct crypto_blkcipher *tfm,
+ const u8 *key,
+ unsigned int keylen)
+{
+ return crypto_blkcipher_crt(tfm)->setkeytype(crypto_blkcipher_tfm(tfm),
+ key, keylen);
+}
+
/**
* crypto_blkcipher_encrypt() - encrypt plaintext
* @desc: reference to the block cipher handle with meta data
diff --git a/include/uapi/linux/if_alg.h b/include/uapi/linux/if_alg.h
index bc2bcde..aa31b18 100644
--- a/include/uapi/linux/if_alg.h
+++ b/include/uapi/linux/if_alg.h
@@ -35,6 +35,7 @@ struct af_alg_iv {
#define ALG_SET_OP 3
#define ALG_SET_AEAD_ASSOCLEN 4
#define ALG_SET_AEAD_AUTHSIZE 5
+#define ALG_SET_KEY_TYPE 6
/* Operations */
#define ALG_OP_DECRYPT 0
--
1.9.5
This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.
^ permalink raw reply related
* [PATCH] ethernet: renesas: convert to SPDX identifiers
From: Kuninori Morimoto @ 2018-09-07 2:02 UTC (permalink / raw)
To: Sergei Shtylyov, David S. Miller, Mark Brown
Cc: Geert Uytterhoeven, Robin Murphy, netdev, linux-renesas-soc
From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
This patch updates license to use SPDX-License-Identifier
instead of verbose license text.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
drivers/net/ethernet/renesas/Kconfig | 1 +
drivers/net/ethernet/renesas/Makefile | 1 +
drivers/net/ethernet/renesas/ravb_ptp.c | 6 +-----
3 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/renesas/Kconfig b/drivers/net/ethernet/renesas/Kconfig
index f3f7477..bb0ebdf 100644
--- a/drivers/net/ethernet/renesas/Kconfig
+++ b/drivers/net/ethernet/renesas/Kconfig
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
#
# Renesas device configuration
#
diff --git a/drivers/net/ethernet/renesas/Makefile b/drivers/net/ethernet/renesas/Makefile
index a05102a..f21ab8c 100644
--- a/drivers/net/ethernet/renesas/Makefile
+++ b/drivers/net/ethernet/renesas/Makefile
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
#
# Makefile for the Renesas device drivers.
#
diff --git a/drivers/net/ethernet/renesas/ravb_ptp.c b/drivers/net/ethernet/renesas/ravb_ptp.c
index eede70e..0721b5c 100644
--- a/drivers/net/ethernet/renesas/ravb_ptp.c
+++ b/drivers/net/ethernet/renesas/ravb_ptp.c
@@ -1,13 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0+
/* PTP 1588 clock using the Renesas Ethernet AVB
*
* Copyright (C) 2013-2015 Renesas Electronics Corporation
* Copyright (C) 2015 Renesas Solutions Corp.
* Copyright (C) 2015-2016 Cogent Embedded, Inc. <source@cogentembedded.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
*/
#include "ravb.h"
--
2.7.4
^ permalink raw reply related
* Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Eric Dumazet @ 2018-09-07 7:03 UTC (permalink / raw)
To: Olof Johansson
Cc: Herbert Xu, David Miller, Neil Horman, Marcelo Ricardo Leitner,
Vladislav Yasevich, Alexey Kuznetsov, Hideaki YOSHIFUJI,
linux-crypto, LKML, linux-sctp, netdev, linux-decnet-user,
kernel-team, Yuchung Cheng, Neal Cardwell
In-Reply-To: <CAOesGMgRrb4D2S_qWwgo00iNxbCL9EEGfhD5Ji-2HMWuZeq0Yw@mail.gmail.com>
On Thu, Sep 6, 2018 at 11:20 PM Olof Johansson <olof@lixom.net> wrote:
>
> Hi,
>
> On Thu, Sep 6, 2018 at 8:32 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote:
> >> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson <olof@lixom.net> wrote:
> >> >
> >> > Today these are all global shared variables per protocol, and in
> >> > particular tcp_memory_allocated can get hot on a system with
> >> > large number of CPUs and a substantial number of connections.
> >> >
> >> > Moving it over to a per-cpu variable makes it significantly cheaper,
> >> > and the added overhead when summing up the percpu copies is still smaller
> >> > than the cost of having a hot cacheline bouncing around.
> >>
> >> I am curious. We never noticed contention on this variable, at least for TCP.
> >
> > Yes these variables are heavily amortised so I'm surprised that
> > they would cause much contention.
> >
> >> Please share some numbers with us.
> >
> > Indeed.
>
> Certainly, just had to collect them again.
>
> This is on a dual xeon box, with ~150-200k TCP connections. I see
> about .7% CPU spent in __sk_mem_{reduce,raise}_allocated in the
> inlined atomic ops, most of those in reduce.
>
> Call path for reduce is practically all from tcp_write_timer on softirq:
>
> __sk_mem_reduce_allocated
> tcp_write_timer
> call_timer_fn
> run_timer_softirq
> __do_softirq
> irq_exit
> smp_apic_timer_interrupt
> apic_timer_interrupt
> cpuidle_enter_state
>
> With this patch, I see about .18+.11+.07 = .36% in percpu-related
> functions called from the same __sk_mem functions.
>
> Now, that's a halving of cycles samples on that specific setup. The
> real difference though, is on another platform where atomics are more
> expensive. There, this makes a significant difference. Unfortunately,
> I can't share specifics but I think this change stands on its own on
> the dual xeon setup as well, maybe with slightly less strong wording
> on just how hot the variable/line happens to be.
Problem is : we have platforms with more than 100 cpus, and
sk_memory_allocated() cost will be too expensive,
especially if the host is under memory pressure, since all cpus will
touch their private counter.
per cpu variables do not really scale, they were ok 10 years ago when
no more than 16 cpus were the norm.
I would prefer change TCP to not aggressively call
__sk_mem_reduce_allocated() from tcp_write_timer()
Ideally only tcp_retransmit_timer() should attempt to reduce forward
allocations, after recurring timeout.
Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid
sk_forward_alloc overflows")
we have better control over sockets having huge forward allocations.
Something like :
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 7fdf222a0bdfe9775970082f6b5dcdcc82b2ae1a..7e2e17cde9b6a9be835edfac26b64f4ce9411538
100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -505,6 +505,8 @@ void tcp_retransmit_timer(struct sock *sk)
mib_idx = LINUX_MIB_TCPTIMEOUTS;
}
__NET_INC_STATS(sock_net(sk), mib_idx);
+ } else {
+ sk_mem_reclaim(sk);
}
tcp_enter_loss(sk);
@@ -576,11 +578,11 @@ void tcp_write_timer_handler(struct sock *sk)
if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ||
!icsk->icsk_pending)
- goto out;
+ return;
if (time_after(icsk->icsk_timeout, jiffies)) {
sk_reset_timer(sk, &icsk->icsk_retransmit_timer,
icsk->icsk_timeout);
- goto out;
+ return;
}
tcp_mstamp_refresh(tcp_sk(sk));
@@ -602,9 +604,6 @@ void tcp_write_timer_handler(struct sock *sk)
tcp_probe_timer(sk);
break;
}
-
-out:
- sk_mem_reclaim(sk);
}
^ permalink raw reply
* Re: [PATCH v2 net-next 3/4] net: make listified RX functions return number of good packets
From: Eric Dumazet @ 2018-09-07 2:27 UTC (permalink / raw)
To: Edward Cree, davem; +Cc: linux-net-drivers, netdev
In-Reply-To: <d601bec7-0eaa-133d-f021-a340e38c45ed@solarflare.com>
On 09/06/2018 07:26 AM, Edward Cree wrote:
> Signed-off-by: Edward Cree <ecree@solarflare.com>
Lack of changelog here ?
I do not know what is a good packet.
You are adding a lot of conditional expressions, that cpu
will mispredict quite often.
Typical micro benchmarks wont really notice.
^ permalink raw reply
* Re: [PATCH v2 net-next 0/4] net: batched receive in GRO path
From: Eric Dumazet @ 2018-09-07 2:32 UTC (permalink / raw)
To: Edward Cree, davem; +Cc: linux-net-drivers, netdev
In-Reply-To: <c1e79c86-56ae-98c6-8dc0-c227f91ee9bc@solarflare.com>
On 09/06/2018 07:24 AM, Edward Cree wrote:
> This series listifies part of GRO processing, in a manner which allows those
> packets which are not GROed (i.e. for which dev_gro_receive returns
> GRO_NORMAL) to be passed on to the listified regular receive path.
> I have not listified dev_gro_receive() itself, or the per-protocol GRO
> callback, since GRO's need to hold packets on lists under napi->gro_hash
> makes keeping the packets on other lists awkward, and since the GRO control
> block state of held skbs can refer only to one 'new' skb at a time.
> Nonetheless the batching of the calling code yields some performance gains
> in the GRO case as well.
>
> Herewith the performance figures obtained in a NetPerf TCP stream test (with
> four streams, and irqs bound to a single core):
> net-next: 7.166 Gbit/s (sigma 0.435)
> after #2: 7.715 Gbit/s (sigma 0.145) = datum + 7.7%
> after #4: 7.890 Gbit/s (sigma 0.217) = datum + 10.1%
> (Note that the 'net-next' results were distinctly bimodal, with two results
> of about 8 Gbit/s and the remaining ten around 7 Gbit/s. I don't have a
> good explanation for this.)
> And with GRO disabled through ethtool -K (thus simulating traffic which is
> not GRO-able but, being TCP, is still passed to the GRO entry point):
> net-next: 4.756 Gbit/s (sigma 0.240)
> after #4: 5.355 Gbit/s (sigma 0.232) = datum + 12.6%
>
> v2: Rebased on latest net-next. Removed RFC tags. Otherwise unchanged
> owing to lack of comments on v1.
>
Your performance numbers are not convincing, since TCP stream test should
get nominal GRO gains.
Adding this complexity and icache pressure needs more experimental results.
What about RPC workloads (eg 100 concurrent netperf -t TCP_RR -- -r 8000,8000 )
Thanks.
^ permalink raw reply
* Re: [PATCH] [RFC v2] Drop all 00-INDEX files from Documentation/
From: Greg Kroah-Hartman @ 2018-09-07 7:21 UTC (permalink / raw)
To: Daniel Vetter
Cc: Mark Rutland, Linux MIPS Mailing List,
Linux Fbdev development list, Jan Kandziora,
Radim Krčmář, kvm, Linux Doc Mailing List,
Peter Zijlstra, James Hogan, Mark Brown, Henrik Austad,
Will Deacon, dri-devel, Masahiro Yamada, devicetree,
Paul Mackerras, Henrik Austad, Pavel Machek, H. Peter Anvin,
Evgeniy Polyakov, Ian Kent, linux-s390, Paul Moore
In-Reply-To: <CAKMK7uHoeB89-VVS8qVaoNiP_0waHHJ=dFCUgXkRDTnRkXz69g@mail.gmail.com>
On Thu, Sep 06, 2018 at 11:39:42PM +0200, Daniel Vetter wrote:
> On Thu, Sep 6, 2018 at 6:01 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Thu, 6 Sep 2018 09:58:04 -0600
> > Jonathan Corbet <corbet@lwn.net> wrote:
> >
> >> Thanks,
> >>
> >> jon (who is increasingly inclined to apply this patch)
> >
> > As Colin Kaepernick now says... "Just do it!"
> >
> > ;-)
>
> +1
>
> But I'm biased, I'm part of the party that is responsible for the new
> shiny documentation system ...
I am not responsible for any of the new shiny documentation system, and
I think this is a good idea:
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply
* Re: [PATCH v2 2/2] net: ethernet: i40evf: fix potential build error
From: kbuild test robot @ 2018-09-07 7:25 UTC (permalink / raw)
To: Wang Dongsheng
Cc: kbuild-all, jeffrey.t.kirsher, sergei.shtylyov, jacob.e.keller,
davem, intel-wired-lan, netdev, linux-kernel, Wang Dongsheng
In-Reply-To: <1536114430-21356-2-git-send-email-dongsheng.wang@hxt-semitech.com>
[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]
Hi Wang,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on jkirsher-next-queue/dev-queue]
[cannot apply to v4.19-rc2 next-20180906]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Wang-Dongsheng/net-ethernet-i40e-fix-build-error/20180907-063122
base: https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git dev-queue
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
drivers/net/ethernet/intel/i40evf/i40evf_ethtool.o: In function `__i40e_add_stat_strings':
>> i40evf_ethtool.c:(.text+0xeb0): multiple definition of `__i40e_add_stat_strings'
drivers/net/ethernet/intel/i40e/i40e_ethtool.o:i40e_ethtool.c:(.text+0x5e20): first defined here
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 64499 bytes --]
^ permalink raw reply
* Re: [PATCH net-next 09/11] tuntap: accept an array of XDP buffs through sendmsg()
From: Jason Wang @ 2018-09-07 7:33 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906134834-mutt-send-email-mst@kernel.org>
On 2018年09月07日 01:51, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:24PM +0800, Jason Wang wrote:
>> This patch implement TUN_MSG_PTR msg_control type. This type allows
>> the caller to pass an array of XDP buffs to tuntap through ptr field
>> of the tun_msg_control. If an XDP program is attached, tuntap can run
>> XDP program directly. If not, tuntap will build skb and do a fast
>> receiving since part of the work has been done by vhost_net.
>>
>> This will avoid lots of indirect calls thus improves the icache
>> utilization and allows to do XDP batched flushing when doing XDP
>> redirection.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Is most of the benefit in batched flushing or skipping
> indirect calls? Because if it's flushing we can gain
> most of it easily by adding an analog of xmit_more.
>
Should be both. XDP_DROP doesn't flush but it gives obvious improvement.
Thanks
^ permalink raw reply
* Re: [PATCH 2/7] mark root hnode explicitly
From: Cong Wang @ 2018-09-07 2:57 UTC (permalink / raw)
To: Al Viro; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <20180906105933.GU19965@ZenIV.linux.org.uk>
On Thu, Sep 6, 2018 at 3:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index 3f985f29ef30..d14048e38b5c 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -84,6 +84,7 @@ struct tc_u_hnode {
> int refcnt;
> unsigned int divisor;
> struct idr handle_idr;
> + bool is_root;
> struct rcu_head rcu;
> u32 flags;
> /* The 'ht' field MUST be the last field in structure to allow for
> @@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
> root_ht->refcnt++;
> root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x80000000;
> root_ht->prio = tp->prio;
> + root_ht->is_root = true;
> idr_init(&root_ht->handle_idr);
>
> if (tp_c == NULL) {
> @@ -693,7 +695,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last,
> goto out;
> }
>
> - if (root_ht == ht) {
> + if (ht->is_root) {
What's wrong with comparing pointers with root ht?
> NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
> return -EINVAL;
> }
> @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
> NL_SET_ERR_MSG_MOD(extack, "Link hash table not found");
> return -EINVAL;
> }
> + if (ht_down->is_root) {
root ht is saved in tp->root, so you can compare ht_down with it too,
if you want.
If this check is all what you need, you don't need an extra flag.
^ permalink raw reply
* Re: [PATCH net-next 11/11] vhost_net: batch submitting XDP buffers to underlayer sockets
From: Jason Wang @ 2018-09-07 7:41 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906122857-mutt-send-email-mst@kernel.org>
On 2018年09月07日 00:46, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:26PM +0800, Jason Wang wrote:
>> This patch implements XDP batching for vhost_net. The idea is first to
>> try to do userspace copy and build XDP buff directly in vhost. Instead
>> of submitting the packet immediately, vhost_net will batch them in an
>> array and submit every 64 (VHOST_NET_BATCH) packets to the under layer
>> sockets through msg_control of sendmsg().
>>
>> When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a
>> loop without caring GUP thus it can do batch map flushing. When XDP is
>> not enabled or not supported, the underlayer socket need to build skb
>> and pass it to network core. The batched packet submission allows us
>> to do batching like netif_receive_skb_list() in the future.
>>
>> This saves lots of indirect calls for better cache utilization. For
>> the case that we can't so batching e.g when sndbuf is limited or
>> packet size is too large, we will go for usual one packet per
>> sendmsg() way.
>>
>> Doing testpmd on various setups gives us:
>>
>> Test /+pps%
>> XDP_DROP on TAP /+44.8%
>> XDP_REDIRECT on TAP /+29%
>> macvtap (skb) /+26%
>>
>> Netperf tests shows obvious improvements for small packet transmission:
>>
>> size/session/+thu%/+normalize%
>> 64/ 1/ +2%/ 0%
>> 64/ 2/ +3%/ +1%
>> 64/ 4/ +7%/ +5%
>> 64/ 8/ +8%/ +6%
>> 256/ 1/ +3%/ 0%
>> 256/ 2/ +10%/ +7%
>> 256/ 4/ +26%/ +22%
>> 256/ 8/ +27%/ +23%
>> 512/ 1/ +3%/ +2%
>> 512/ 2/ +19%/ +14%
>> 512/ 4/ +43%/ +40%
>> 512/ 8/ +45%/ +41%
>> 1024/ 1/ +4%/ 0%
>> 1024/ 2/ +27%/ +21%
>> 1024/ 4/ +38%/ +73%
>> 1024/ 8/ +15%/ +24%
>> 2048/ 1/ +10%/ +7%
>> 2048/ 2/ +16%/ +12%
>> 2048/ 4/ 0%/ +2%
>> 2048/ 8/ 0%/ +2%
>> 4096/ 1/ +36%/ +60%
>> 4096/ 2/ -11%/ -26%
>> 4096/ 4/ 0%/ +14%
>> 4096/ 8/ 0%/ +4%
>> 16384/ 1/ -1%/ +5%
>> 16384/ 2/ 0%/ +2%
>> 16384/ 4/ 0%/ -3%
>> 16384/ 8/ 0%/ +4%
>> 65535/ 1/ 0%/ +10%
>> 65535/ 2/ 0%/ +8%
>> 65535/ 4/ 0%/ +1%
>> 65535/ 8/ 0%/ +3%
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>> drivers/vhost/net.c | 164 ++++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 151 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index fb01ce6d981c..1dd4239cbff8 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -116,6 +116,7 @@ struct vhost_net_virtqueue {
>> * For RX, number of batched heads
>> */
>> int done_idx;
>> + int batched_xdp;
> Pls add a comment documenting what does this new field do.
Ok.
>
>> /* an array of userspace buffers info */
>> struct ubuf_info *ubuf_info;
>> /* Reference counting for outstanding ubufs.
>> @@ -123,6 +124,7 @@ struct vhost_net_virtqueue {
>> struct vhost_net_ubuf_ref *ubufs;
>> struct ptr_ring *rx_ring;
>> struct vhost_net_buf rxq;
>> + struct xdp_buff xdp[VHOST_NET_BATCH];
> This is a bit too much to have inline. 64 entries 48 bytes each, and we
> have 2 of these structures, that's already >4K.
Let me allocate it elsewhere.
>
>> };
>>
>> struct vhost_net {
>> @@ -338,6 +340,11 @@ static bool vhost_sock_zcopy(struct socket *sock)
>> sock_flag(sock->sk, SOCK_ZEROCOPY);
>> }
>>
>> +static bool vhost_sock_xdp(struct socket *sock)
>> +{
>> + return sock_flag(sock->sk, SOCK_XDP);
> what if an xdp program is attached while this processing
> is going on? Flag value will be wrong - is this safe
> and why? Pls add a comment.
Ok, let me add comment to explain. It's safe and may just lead few
packets to be dropped if XDP program want to extend packet's header.
>
>> +}
>> +
>> /* In case of DMA done not in order in lower device driver for some reason.
>> * upend_idx is used to track end of used idx, done_idx is used to track head
>> * of used idx. Once lower device DMA done contiguously, we will signal KVM
>> @@ -444,10 +451,36 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
>> nvq->done_idx = 0;
>> }
>>
>> +static void vhost_tx_batch(struct vhost_net *net,
>> + struct vhost_net_virtqueue *nvq,
>> + struct socket *sock,
>> + struct msghdr *msghdr)
>> +{
>> + struct tun_msg_ctl ctl = {
>> + .type = nvq->batched_xdp << 16 | TUN_MSG_PTR,
>> + .ptr = nvq->xdp,
>> + };
>> + int err;
>> +
>> + if (nvq->batched_xdp == 0)
>> + goto signal_used;
>> +
>> + msghdr->msg_control = &ctl;
>> + err = sock->ops->sendmsg(sock, msghdr, 0);
>> + if (unlikely(err < 0)) {
>> + vq_err(&nvq->vq, "Fail to batch sending packets\n");
>> + return;
>> + }
>> +
>> +signal_used:
>> + vhost_net_signal_used(nvq);
>> + nvq->batched_xdp = 0;
>> +}
>> +
>> static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>> struct vhost_net_virtqueue *nvq,
>> unsigned int *out_num, unsigned int *in_num,
>> - bool *busyloop_intr)
>> + struct msghdr *msghdr, bool *busyloop_intr)
>> {
>> struct vhost_virtqueue *vq = &nvq->vq;
>> unsigned long uninitialized_var(endtime);
>> @@ -455,8 +488,9 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>> out_num, in_num, NULL, NULL);
>>
>> if (r == vq->num && vq->busyloop_timeout) {
>> + /* Flush batched packets first */
>> if (!vhost_sock_zcopy(vq->private_data))
>> - vhost_net_signal_used(nvq);
>> + vhost_tx_batch(net, nvq, vq->private_data, msghdr);
>> preempt_disable();
>> endtime = busy_clock() + vq->busyloop_timeout;
>> while (vhost_can_busy_poll(endtime)) {
>> @@ -512,7 +546,7 @@ static int get_tx_bufs(struct vhost_net *net,
>> struct vhost_virtqueue *vq = &nvq->vq;
>> int ret;
>>
>> - ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, busyloop_intr);
>> + ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, msg, busyloop_intr);
>>
>> if (ret < 0 || ret == vq->num)
>> return ret;
>> @@ -540,6 +574,83 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
>> !vhost_vq_avail_empty(vq->dev, vq);
>> }
>>
>> +#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> I wonder whether NET_IP_ALIGN make sense for XDP.
XDP is not the only consumer, socket may build skb based on this.
>
>> +
>> +static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>> + struct iov_iter *from)
>> +{
>> + struct vhost_virtqueue *vq = &nvq->vq;
>> + struct socket *sock = vq->private_data;
>> + struct page_frag *alloc_frag = ¤t->task_frag;
>> + struct virtio_net_hdr *gso;
>> + struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
>> + size_t len = iov_iter_count(from);
>> + int headroom = vhost_sock_xdp(sock) ? XDP_PACKET_HEADROOM : 0;
>> + int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> + int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + headroom + nvq->sock_hlen);
>> + int sock_hlen = nvq->sock_hlen;
>> + void *buf;
>> + int copied;
>> +
>> + if (unlikely(len < nvq->sock_hlen))
>> + return -EFAULT;
>> +
>> + if (SKB_DATA_ALIGN(len + pad) +
>> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
>> + return -ENOSPC;
>> +
>> + buflen += SKB_DATA_ALIGN(len + pad);
>> + alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
>> + if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
>> + return -ENOMEM;
>> +
>> + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>> +
>> + /* We store two kinds of metadata in the header which will be
>> + * used for XDP_PASS to do build_skb():
>> + * offset 0: buflen
>> + * offset sizeof(int): vnet header
> Define a struct for the metadata then?
Ok.
>
>
>> + */
>> + copied = copy_page_from_iter(alloc_frag->page,
>> + alloc_frag->offset + sizeof(int),
>> + sock_hlen, from);
>> + if (copied != sock_hlen)
>> + return -EFAULT;
>> +
>> + gso = (struct virtio_net_hdr *)(buf + sizeof(int));
>> +
>> + if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
>> + vhost16_to_cpu(vq, gso->csum_start) +
>> + vhost16_to_cpu(vq, gso->csum_offset) + 2 >
>> + vhost16_to_cpu(vq, gso->hdr_len)) {
>> + gso->hdr_len = cpu_to_vhost16(vq,
>> + vhost16_to_cpu(vq, gso->csum_start) +
>> + vhost16_to_cpu(vq, gso->csum_offset) + 2);
>> +
>> + if (vhost16_to_cpu(vq, gso->hdr_len) > len)
>> + return -EINVAL;
>> + }
>> +
>> + len -= sock_hlen;
>> + copied = copy_page_from_iter(alloc_frag->page,
>> + alloc_frag->offset + pad,
>> + len, from);
>> + if (copied != len)
>> + return -EFAULT;
>> +
>> + xdp->data_hard_start = buf;
>> + xdp->data = buf + pad;
>> + xdp->data_end = xdp->data + len;
>> + *(int *)(xdp->data_hard_start) = buflen;
>> +
>> + get_page(alloc_frag->page);
>> + alloc_frag->offset += buflen;
>> +
>> + ++nvq->batched_xdp;
>> +
>> + return 0;
>> +}
>> +
>> static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> {
>> struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>> @@ -556,10 +667,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> size_t len, total_len = 0;
>> int err;
>> int sent_pkts = 0;
>> + bool bulking = (sock->sk->sk_sndbuf == INT_MAX);
> What does bulking mean?
The name is misleading, it means whether we can do batching. For
simplicity, I disable batching is sndbuf is not INT_MAX.
>>
>> for (;;) {
>> bool busyloop_intr = false;
>>
>> + if (nvq->done_idx == VHOST_NET_BATCH)
>> + vhost_tx_batch(net, nvq, sock, &msg);
>> +
>> head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
>> &busyloop_intr);
>> /* On error, stop handling until the next kick. */
>> @@ -577,14 +692,34 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> break;
>> }
>>
>> - vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
>> - vq->heads[nvq->done_idx].len = 0;
>> -
>> total_len += len;
>> - if (tx_can_batch(vq, total_len))
>> - msg.msg_flags |= MSG_MORE;
>> - else
>> - msg.msg_flags &= ~MSG_MORE;
>> +
>> + /* For simplicity, TX batching is only enabled if
>> + * sndbuf is unlimited.
> What if sndbuf changes while this processing is going on?
We will get the correct sndbuf in the next run of handle_tx(). I think
this is safe.
Thanks
>
>> + */
>> + if (bulking) {
>> + err = vhost_net_build_xdp(nvq, &msg.msg_iter);
>> + if (!err) {
>> + goto done;
>> + } else if (unlikely(err != -ENOSPC)) {
>> + vhost_tx_batch(net, nvq, sock, &msg);
>> + vhost_discard_vq_desc(vq, 1);
>> + vhost_net_enable_vq(net, vq);
>> + break;
>> + }
>> +
>> + /* We can't build XDP buff, go for single
>> + * packet path but let's flush batched
>> + * packets.
>> + */
>> + vhost_tx_batch(net, nvq, sock, &msg);
>> + msg.msg_control = NULL;
>> + } else {
>> + if (tx_can_batch(vq, total_len))
>> + msg.msg_flags |= MSG_MORE;
>> + else
>> + msg.msg_flags &= ~MSG_MORE;
>> + }
>>
>> /* TODO: Check specific error and bomb out unless ENOBUFS? */
>> err = sock->ops->sendmsg(sock, &msg, len);
>> @@ -596,15 +731,17 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> if (err != len)
>> pr_debug("Truncated TX packet: len %d != %zd\n",
>> err, len);
>> - if (++nvq->done_idx >= VHOST_NET_BATCH)
>> - vhost_net_signal_used(nvq);
>> +done:
>> + vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
>> + vq->heads[nvq->done_idx].len = 0;
>> + ++nvq->done_idx;
>> if (vhost_exceeds_weight(++sent_pkts, total_len)) {
>> vhost_poll_queue(&vq->poll);
>> break;
>> }
>> }
>>
>> - vhost_net_signal_used(nvq);
>> + vhost_tx_batch(net, nvq, sock, &msg);
>> }
>>
>> static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>> @@ -1111,6 +1248,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>> n->vqs[i].ubuf_info = NULL;
>> n->vqs[i].upend_idx = 0;
>> n->vqs[i].done_idx = 0;
>> + n->vqs[i].batched_xdp = 0;
>> n->vqs[i].vhost_hlen = 0;
>> n->vqs[i].sock_hlen = 0;
>> n->vqs[i].rx_ring = NULL;
>> --
>> 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 2/7] mark root hnode explicitly
From: Al Viro @ 2018-09-07 3:04 UTC (permalink / raw)
To: Cong Wang; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <CAM_iQpUDpKB2y0Yy5YJjj7MALGsDLJKpdXYZkydmCo1N_uo=FA@mail.gmail.com>
On Thu, Sep 06, 2018 at 07:57:25PM -0700, Cong Wang wrote:
> > - if (root_ht == ht) {
> > + if (ht->is_root) {
>
>
> What's wrong with comparing pointers with root ht?
The fact that there may be more than one tcf_proto sharing tp->data.
> > NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
> > return -EINVAL;
> > }
> > @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
> > NL_SET_ERR_MSG_MOD(extack, "Link hash table not found");
> > return -EINVAL;
> > }
> > + if (ht_down->is_root) {
>
> root ht is saved in tp->root, so you can compare ht_down with it too,
> if you want.
>
> If this check is all what you need, you don't need an extra flag.
Again, *which* tp? We can trivially check that we are not linking to/deleting
our own root, sure. But there's nothing to stop doing the same via another
tcf_proto...
^ permalink raw reply
* Re: [PATCH 2/7] mark root hnode explicitly
From: Cong Wang @ 2018-09-07 3:23 UTC (permalink / raw)
To: Al Viro; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <20180907030438.GX19965@ZenIV.linux.org.uk>
On Thu, Sep 6, 2018 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Sep 06, 2018 at 07:57:25PM -0700, Cong Wang wrote:
>
> > > - if (root_ht == ht) {
> > > + if (ht->is_root) {
> >
> >
> > What's wrong with comparing pointers with root ht?
>
> The fact that there may be more than one tcf_proto sharing tp->data.
Hmm? root ht is from tp->root, not tp->data.
Also, this very important information is missing in your one-line changelog...
>
> > > NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
> > > return -EINVAL;
> > > }
> > > @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
> > > NL_SET_ERR_MSG_MOD(extack, "Link hash table not found");
> > > return -EINVAL;
> > > }
> > > + if (ht_down->is_root) {
> >
> > root ht is saved in tp->root, so you can compare ht_down with it too,
> > if you want.
> >
> > If this check is all what you need, you don't need an extra flag.
>
> Again, *which* tp? We can trivially check that we are not linking to/deleting
Pretty sure there is a 'tp' in u32_set_parms() parameter list.
Are you saying it is not what you want? If so, why?
More importantly, why this information is again missing in your
changelog? This patch is definitely not trivial, it deserves a detailed
changelog.
> our own root, sure. But there's nothing to stop doing the same via another
> tcf_proto...
To my best knowledge, the place where you set ->is_root=true
is precisely same with where we set tp->root=root_ht, and it doesn't
change after set. What am I missing here?
^ permalink raw reply
* Re: [PATCH] PCI: Reprogram bridge prefetch registers on resume
From: Daniel Drake @ 2018-09-07 8:06 UTC (permalink / raw)
To: Sinan Kaya
Cc: Mika Westerberg, Linux PM, linux-pci-u79uwXL29TY76Z2rM5mHXA,
Wysocki, Rafael J, nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ, Keith Busch,
netdev-u79uwXL29TY76Z2rM5mHXA,
nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Bjorn Helgaas,
Andy Shevchenko, Linux Upstreaming Team,
davem-fT/PcQaiUtIeIZ0/mPfg9Q, Jon Derrick,
hkallweit1-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <3b37e4fd-c793-bd52-7d70-950f846a7d5a-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
On Fri, Sep 7, 2018 at 2:40 PM, Sinan Kaya <okaya@kernel.org> wrote:
> On 9/6/2018 10:36 PM, Daniel Drake wrote:
>>
>> + if (pci_dev->class == PCI_CLASS_BRIDGE_PCI << 8)
>> + pci_setup_bridge_mmio_pref(pci_dev);
>
>
> This should probably some kind of a quirk rather than default
> for the listed card as it sounds like you are dealing with
> broken hardware.
With that approach there's a sizeable list that your quirk list is
incomplete or out of date.
And when the bug bites, it's extremely cryptic. We've spent months
working on this issue and only found this magic register write mostly
through a stroke of good luck. Separately there's been a flurry of
mails around the r8169 MSI-X problem but as far as I can see nobody
suggested even looking at the values of the parent bridge prefetch
registers. And even if they did, they'd probably have said "values are
fine, nothing to see here" (exactly as we did 4 months ago when Nvidia
mentioned these registers as a possible cause - oops!).
So here I'm instead following a suggestion from Bjorn, after also
having confirmed the windows behaviour:
https://marc.info/?l=linux-pci&m=153574276126484&w=2
> Can we tell whether Windows rewrites this register unconditionally at
> resume-time? If so, it may be more robust for Linux to do the same.
> The whole thing is black magic, which I hate, but if it's our only
> choice, it may be better to have this applied everywhere so we don't
> keep stubbing our toes on new systems that require the quirk.
Also, we just spoke to Asus BIOS engineers who told us that the BIOS
does already restore the PCI bridge prefetch registers on resume. I
guess this is why the other registers like the (non-zero) prefetch
base address lower 32 bits do have the right value on resume even
before my patch. It sounds like a more subtle bug related to register
write timing or sequence, in that case it will be harder to define who
is responsible for the breakage and hence under which conditions the
quirk should apply.
Daniel
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau
^ permalink raw reply
* Re: [PATCH net-next 07/11] tuntap: move XDP flushing out of tun_do_xdp()
From: Jason Wang @ 2018-09-07 3:31 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906132209-mutt-send-email-mst@kernel.org>
On 2018年09月07日 01:48, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:22PM +0800, Jason Wang wrote:
>> This will allow adding batch flushing on top.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>> drivers/net/tun.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 21b125020b3b..ff1cbf3ebd50 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1646,7 +1646,6 @@ static u32 tun_do_xdp(struct tun_struct *tun,
>> switch (act) {
>> case XDP_REDIRECT:
>> *err = xdp_do_redirect(tun->dev, xdp, xdp_prog);
>> - xdp_do_flush_map();
>> if (*err)
>> break;
>> goto out;
>> @@ -1735,6 +1734,9 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>> act = tun_do_xdp(tun, tfile, xdp_prog, &xdp, &err);
>> if (err)
>> goto err_xdp;
>> +
>> + if (act == XDP_REDIRECT)
>> + xdp_do_flush_map();
>> if (act != XDP_PASS)
>> goto out;
> At this point the switch statement which used to contain all XDP things
> seems to be gone completely. Just rewrite with a bunch of if statements
> and all xdp handling spread out to where it makes sense?
But tun_do_xdp() will be reused in the batching path.
Thanks
>
>> --
>> 2.17.1
^ permalink raw reply
* Re: [PATCH net-next v3 1/2] netlink: ipv4 igmp join notifications
From: Roopa Prabhu @ 2018-09-07 3:40 UTC (permalink / raw)
To: Patrick Ruddy; +Cc: netdev, Jiří Pírko, Stephen Hemminger
In-Reply-To: <20180906091056.21109-1-pruddy@vyatta.att-mail.com>
On Thu, Sep 6, 2018 at 2:10 AM, Patrick Ruddy
<pruddy@vyatta.att-mail.com> wrote:
> Some userspace applications need to know about IGMP joins from the
> kernel for 2 reasons:
> 1. To allow the programming of multicast MAC filters in hardware
> 2. To form a multicast FORUS list for non link-local multicast
> groups to be sent to the kernel and from there to the interested
> party.
> (1) can be fulfilled but simply sending the hardware multicast MAC
> address to be programmed but (2) requires the L3 address to be sent
> since this cannot be constructed from the MAC address whereas the
> reverse translation is a standard library function.
>
> This commit provides addition and deletion of multicast addresses
> using the RTM_NEWMDB and RTM_DELMDB messages with AF_INET. It also
> provides the RTM_GETMDB extension to allow multicast join state to
> be read from the kernel.
>
> Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
> ---
> v3 rework to use RTM_***MDB messages as per review comments.
Patrick, this version seems to be using RTM_***MDB msgs with the
RTM_*ADDR format.
We cant do that...because existing RTM_MDB users will be confused.
My request was to evaluate RTM_***MDB msg format. see
nlmsg_populate_mdb_fill for details.
If you can wait a day or two I can share some experimental code that
moves high level RTM_*MDB msg handling into net/core/rtnetlink.c
similar to RTM_*FDB
>
> net/ipv4/igmp.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 139 insertions(+)
>
> diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
> index 4da39446da2d..aed819e2ea93 100644
> --- a/net/ipv4/igmp.c
> +++ b/net/ipv4/igmp.c
> @@ -86,6 +86,7 @@
> #include <linux/inetdevice.h>
> #include <linux/igmp.h>
> #include <linux/if_arp.h>
> +#include <net/netlink.h>
> #include <linux/rtnetlink.h>
> #include <linux/times.h>
> #include <linux/pkt_sched.h>
> @@ -1385,6 +1386,91 @@ static void ip_mc_hash_remove(struct in_device *in_dev,
> }
>
>
> +static int fill_addr(struct sk_buff *skb, struct net_device *dev, __be32 addr,
> + int type, unsigned int flags)
> +{
> + struct nlmsghdr *nlh;
> + struct ifaddrmsg *ifm;
> +
> + nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), flags);
> + if (!nlh)
> + return -EMSGSIZE;
> +
> + ifm = nlmsg_data(nlh);
> + ifm->ifa_family = AF_INET;
> + ifm->ifa_prefixlen = 32;
> + ifm->ifa_flags = IFA_F_PERMANENT;
> + ifm->ifa_scope = RT_SCOPE_LINK;
> + ifm->ifa_index = dev->ifindex;
> +
> + if (nla_put_in_addr(skb, IFA_ADDRESS, addr))
> + goto nla_put_failure;
> + nlmsg_end(skb, nlh);
> + return 0;
> +
> +nla_put_failure:
> + nlmsg_cancel(skb, nlh);
> + return -EMSGSIZE;
> +}
> +
> +static inline size_t addr_nlmsg_size(void)
> +{
> + return NLMSG_ALIGN(sizeof(struct ifaddrmsg))
> + + nla_total_size(sizeof(__be32));
> +}
> +
> +static void ip_mc_addr_notify(struct net_device *dev, __be32 addr, int type)
> +{
> + struct net *net = dev_net(dev);
> + struct sk_buff *skb;
> + int err = -ENOBUFS;
> +
> + skb = nlmsg_new(addr_nlmsg_size(), GFP_ATOMIC);
> + if (!skb)
> + goto errout;
> +
> + err = fill_addr(skb, dev, addr, type, 0);
> + if (err < 0) {
> + WARN_ON(err == -EMSGSIZE);
> + kfree_skb(skb);
> + goto errout;
> + }
> + rtnl_notify(skb, net, 0, RTNLGRP_MDB, NULL, GFP_ATOMIC);
> + return;
> +errout:
> + if (err < 0)
> + rtnl_set_sk_err(net, RTNLGRP_MDB, err);
> +}
> +
> +int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb,
> + struct net_device *dev)
> +{
> + int s_idx;
> + int idx = 0;
> + struct ip_mc_list *im;
> + struct in_device *in_dev;
> +
> + ASSERT_RTNL();
> +
> + s_idx = cb->args[2];
> + in_dev = __in_dev_get_rtnl(dev);
> +
> + for_each_pmc_rtnl(in_dev, im) {
> + if (idx < s_idx)
> + continue;
> + if (fill_addr(skb, dev, im->multiaddr, RTM_NEWMDB,
> + NLM_F_MULTI) < 0)
> + goto done;
> + nl_dump_check_consistent(cb, nlmsg_hdr(skb));
> + idx++;
> + }
> +
> + done:
> + cb->args[2] = idx;
> +
> + return skb->len;
> +}
> +
> /*
> * A socket has joined a multicast group on device dev.
> */
> @@ -1430,6 +1516,8 @@ static void __ip_mc_inc_group(struct in_device *in_dev, __be32 addr,
> igmpv3_del_delrec(in_dev, im);
> #endif
> igmp_group_added(im);
> +
> + ip_mc_addr_notify(in_dev->dev, addr, RTM_NEWMDB);
> if (!in_dev->dead)
> ip_rt_multicast_event(in_dev);
> out:
> @@ -1661,6 +1749,8 @@ void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
> in_dev->mc_count--;
> igmp_group_dropped(i);
> ip_mc_clear_src(i);
> + ip_mc_addr_notify(in_dev->dev, addr,
> + RTM_DELMDB);
>
> if (!in_dev->dead)
> ip_rt_multicast_event(in_dev);
> @@ -3051,6 +3141,53 @@ static struct notifier_block igmp_notifier = {
> .notifier_call = igmp_netdev_event,
> };
>
> +static int igmp_mc_dump_ifaddrs(struct sk_buff *skb,
> + struct netlink_callback *cb)
> +{
> + struct net *net = sock_net(skb->sk);
> + int h, s_h;
> + int idx, s_idx;
> + struct net_device *dev;
> + struct in_device *in_dev;
> + struct hlist_head *head;
> +
> + s_h = cb->args[0];
> + idx = cb->args[1];
> + s_idx = idx;
> +
> + for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
> + idx = 0;
> + head = &net->dev_index_head[h];
> + rcu_read_lock();
> + cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
> + net->dev_base_seq;
> + hlist_for_each_entry_rcu(dev, head, index_hlist) {
> + if (idx < s_idx)
> + goto cont;
> + if (h > s_h || idx > s_idx)
> + cb->args[2] = 0;
> + in_dev = __in_dev_get_rcu(dev);
> + if (!in_dev)
> + goto cont;
> +
> + /* loop over multicast addresses */
> + if (ip_mc_dump_ifaddr(skb, cb, dev) < 0) {
> + rcu_read_unlock();
> + goto done;
> + }
> +cont:
> + idx++;
> + }
> + rcu_read_unlock();
> + }
> +
> +done:
> + cb->args[0] = h;
> + cb->args[1] = idx;
> +
> + return skb->len;
> +}
> +
> int __init igmp_mc_init(void)
> {
> #if defined(CONFIG_PROC_FS)
> @@ -3064,6 +3201,8 @@ int __init igmp_mc_init(void)
> goto reg_notif_fail;
> return 0;
>
> + rtnl_register(PF_INET, RTM_GETMDB, NULL, igmp_mc_dump_ifaddrs, 0);
> +
> reg_notif_fail:
> unregister_pernet_subsys(&igmp_net_ops);
> return err;
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH 2/7] mark root hnode explicitly
From: Al Viro @ 2018-09-07 3:49 UTC (permalink / raw)
To: Cong Wang; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <CAM_iQpV7H8v3Oebftd0_aHzwwU_Phr3-7a=Un2-TRmuuZr7VOw@mail.gmail.com>
On Thu, Sep 06, 2018 at 08:23:36PM -0700, Cong Wang wrote:
> Pretty sure there is a 'tp' in u32_set_parms() parameter list.
>
> Are you saying it is not what you want? If so, why?
>
> More importantly, why this information is again missing in your
> changelog? This patch is definitely not trivial, it deserves a detailed
> changelog.
>
>
> > our own root, sure. But there's nothing to stop doing the same via another
> > tcf_proto...
>
> To my best knowledge, the place where you set ->is_root=true
> is precisely same with where we set tp->root=root_ht, and it doesn't
> change after set. What am I missing here?
The fact that there can be two (or more) different tcf_proto instances sharing
->data, but not ->root. And since ->data is shared, u32_get() on one tp
will be able to return you ->root of *ANOTHER* one. So comparison with
tp->root doesn't protect you. Try this on mainline:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: u32 divisor 1
tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 801: u32
and watch the fun as soon as you get an incoming packet on eth0. That panic
is fixed by 1/7, but you get "Not allowed to delete root node" for removing
_your_ root, with "Can not delete in-use filter" for other's root (as in the
last line of the reproducer).
^ permalink raw reply
* [PATCH] netfilter: masquerade: don't flush all conntracks if only one address deleted on device
From: Tan Hu @ 2018-09-07 8:33 UTC (permalink / raw)
To: pablo, kadlec, fw, davem, kuznet, yoshfuji
Cc: netfilter-devel, coreteam, netdev, linux-kernel, zhong.weidong,
jiang.biao2
We configured iptables as below, which only allowed incoming data on
established connections:
iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT
iptables -t mangle -P PREROUTING DROP
When deleting a secondary address, current masquerade implements would
flush all conntracks on this device. All the established connections on
primary address also be deleted, then subsequent incoming data on the
connections would be dropped wrongly because it was identified as NEW
connection.
So when an address was delete, it should only flush connections related
with the address.
Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
---
net/ipv4/netfilter/nf_nat_masquerade_ipv4.c | 22 +++++++++++++++++++---
net/ipv6/netfilter/nf_nat_masquerade_ipv6.c | 19 ++++++++++++++++---
2 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
index ad3aeff..a9d5e01 100644
--- a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
+++ b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
@@ -104,12 +104,26 @@ static int masq_device_event(struct notifier_block *this,
return NOTIFY_DONE;
}
+static int inet_cmp(struct nf_conn *ct, void *ptr)
+{
+ struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+ struct net_device *dev = ifa->ifa_dev->dev;
+ struct nf_conntrack_tuple *tuple;
+
+ if (!device_cmp(ct, (void *)(long)dev->ifindex))
+ return 0;
+
+ tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+
+ return ifa->ifa_address == tuple->dst.u3.ip;
+}
+
static int masq_inet_event(struct notifier_block *this,
unsigned long event,
void *ptr)
{
struct in_device *idev = ((struct in_ifaddr *)ptr)->ifa_dev;
- struct netdev_notifier_info info;
+ struct net *net = dev_net(idev->dev);
/* The masq_dev_notifier will catch the case of the device going
* down. So if the inetdev is dead and being destroyed we have
@@ -119,8 +133,10 @@ static int masq_inet_event(struct notifier_block *this,
if (idev->dead)
return NOTIFY_DONE;
- netdev_notifier_info_init(&info, idev->dev);
- return masq_device_event(this, event, &info);
+ if (event == NETDEV_DOWN)
+ nf_ct_iterate_cleanup_net(net, inet_cmp, ptr, 0, 0);
+
+ return NOTIFY_DONE;
}
static struct notifier_block masq_dev_notifier = {
diff --git a/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c b/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c
index e6eb7cf..3e4bf22 100644
--- a/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c
+++ b/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c
@@ -87,18 +87,30 @@ static struct notifier_block masq_dev_notifier = {
struct masq_dev_work {
struct work_struct work;
struct net *net;
+ struct in6_addr addr;
int ifindex;
};
+static int inet_cmp(struct nf_conn *ct, void *work)
+{
+ struct masq_dev_work *w = (struct masq_dev_work *)work;
+ struct nf_conntrack_tuple *tuple;
+
+ if (!device_cmp(ct, (void *)(long)w->ifindex))
+ return 0;
+
+ tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+
+ return ipv6_addr_equal(&w->addr, &tuple->dst.u3.in6);
+}
+
static void iterate_cleanup_work(struct work_struct *work)
{
struct masq_dev_work *w;
- long index;
w = container_of(work, struct masq_dev_work, work);
- index = w->ifindex;
- nf_ct_iterate_cleanup_net(w->net, device_cmp, (void *)index, 0, 0);
+ nf_ct_iterate_cleanup_net(w->net, inet_cmp, (void *)w, 0, 0);
put_net(w->net);
kfree(w);
@@ -147,6 +159,7 @@ static int masq_inet_event(struct notifier_block *this,
INIT_WORK(&w->work, iterate_cleanup_work);
w->ifindex = dev->ifindex;
w->net = net;
+ w->addr = ifa->addr;
schedule_work(&w->work);
return NOTIFY_DONE;
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH 2/7] mark root hnode explicitly
From: Cong Wang @ 2018-09-07 4:14 UTC (permalink / raw)
To: Al Viro; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <20180907034958.GZ19965@ZenIV.linux.org.uk>
On Thu, Sep 6, 2018 at 8:50 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Sep 06, 2018 at 08:23:36PM -0700, Cong Wang wrote:
>
> > Pretty sure there is a 'tp' in u32_set_parms() parameter list.
> >
> > Are you saying it is not what you want? If so, why?
> >
> > More importantly, why this information is again missing in your
> > changelog? This patch is definitely not trivial, it deserves a detailed
> > changelog.
> >
> >
> > > our own root, sure. But there's nothing to stop doing the same via another
> > > tcf_proto...
> >
> > To my best knowledge, the place where you set ->is_root=true
> > is precisely same with where we set tp->root=root_ht, and it doesn't
> > change after set. What am I missing here?
>
> The fact that there can be two (or more) different tcf_proto instances sharing
> ->data, but not ->root. And since ->data is shared, u32_get() on one tp
They have to share tp->data, that is how we link hashtables together.
> will be able to return you ->root of *ANOTHER* one. So comparison with
> tp->root doesn't protect you. Try this on mainline:
Hmm, it is not u32_get(), it is u32_lookup_ht() which could get another
root ht... I see.
>
> tc qdisc add dev eth0 ingress
> tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: u32 divisor 1
> tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: u32 divisor 1
> tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 801: u32
>
> and watch the fun as soon as you get an incoming packet on eth0. That panic
> is fixed by 1/7, but you get "Not allowed to delete root node" for removing
> _your_ root, with "Can not delete in-use filter" for other's root (as in the
> last line of the reproducer).
Sure, please consider:
1. adding such a test case to tools/testing/selftests/tc-testing/
2. adding it in your changelog
This would save a lot of time for both of us. I don't need to ask you for
this if it is in the changelog, you don't have to explain it again.
This is a win-win.
^ permalink raw reply
* Re: [PATCH 6/7] get rid of tc_u_common ->rcu
From: Cong Wang @ 2018-09-07 4:18 UTC (permalink / raw)
To: Al Viro; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <20180905190414.5477-6-viro@ZenIV.linux.org.uk>
On Wed, Sep 5, 2018 at 12:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> From: Al Viro <viro@zeniv.linux.org.uk>
>
> unused
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> net/sched/cls_u32.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index 8a1a573487bd..be9240ae1417 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -98,7 +98,6 @@ struct tc_u_common {
> int refcnt;
> struct idr handle_idr;
> struct hlist_node hnode;
> - struct rcu_head rcu;
> };
Just FYI:
This was added when RCU was introduced to u32 fast path,
it looks like on fast path we never touch tc_u_common, we
only use tp->root, all the rest are slow paths with RTNL lock,
so it is probably fine to just remove it rather than converting
that kfree() to kfree_rcu().
^ permalink raw reply
* Re: [PATCH net-next v2] openvswitch: Derive IP protocol number for IPv6 later frags
From: Pravin Shelar @ 2018-09-07 4:22 UTC (permalink / raw)
To: Yi-Hung Wei; +Cc: Linux Kernel Network Developers, William Tu
In-Reply-To: <1536100421-16360-1-git-send-email-yihung.wei@gmail.com>
On Tue, Sep 4, 2018 at 3:37 PM Yi-Hung Wei <yihung.wei@gmail.com> wrote:
>
> Currently, OVS only parses the IP protocol number for the first
> IPv6 fragment, but sets the IP protocol number for the later fragments
> to be NEXTHDF_FRAGMENT. This patch tries to derive the IP protocol
> number for the IPV6 later frags so that we can match that.
>
> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Thanks.
^ permalink raw reply
* Re: [PATCH 6/7] get rid of tc_u_common ->rcu
From: Al Viro @ 2018-09-07 4:28 UTC (permalink / raw)
To: Cong Wang; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <CAM_iQpWnydTkLYwmZq2gQPnNBTS6E0iNNgy1zsmYFcXkKJAnTw@mail.gmail.com>
On Thu, Sep 06, 2018 at 09:18:47PM -0700, Cong Wang wrote:
> On Wed, Sep 5, 2018 at 12:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > From: Al Viro <viro@zeniv.linux.org.uk>
> >
> > unused
> >
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> > net/sched/cls_u32.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> > index 8a1a573487bd..be9240ae1417 100644
> > --- a/net/sched/cls_u32.c
> > +++ b/net/sched/cls_u32.c
> > @@ -98,7 +98,6 @@ struct tc_u_common {
> > int refcnt;
> > struct idr handle_idr;
> > struct hlist_node hnode;
> > - struct rcu_head rcu;
> > };
>
> Just FYI:
>
> This was added when RCU was introduced to u32 fast path,
> it looks like on fast path we never touch tc_u_common, we
> only use tp->root, all the rest are slow paths with RTNL lock,
> so it is probably fine to just remove it rather than converting
> that kfree() to kfree_rcu().
*nod*
In any case, if u32_classify() grows that dereference, we can always
re-add ->rcu at the same time.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox