* [net-next 03/11] fm10k: reduce the scope of qv local variable
From: Jeff Kirsher @ 2019-08-01 22:25 UTC (permalink / raw)
To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190801222548.15975-1-jeffrey.t.kirsher@intel.com>
From: Jacob Keller <jacob.e.keller@intel.com>
Reduce the scope of the qv vector pointer local variable in the
fm10k_set_coalesce function.
This was detected by cppcheck and resolves the following style warning
produced by that tool:
[fm10k_ethtool.c:658]: (style) The scope of the variable 'qv' can be
reduced.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
index 7b9440c0aee1..1f7e4a8f4557 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
@@ -650,7 +650,6 @@ static int fm10k_set_coalesce(struct net_device *dev,
struct ethtool_coalesce *ec)
{
struct fm10k_intfc *interface = netdev_priv(dev);
- struct fm10k_q_vector *qv;
u16 tx_itr, rx_itr;
int i;
@@ -676,7 +675,8 @@ static int fm10k_set_coalesce(struct net_device *dev,
/* update q_vectors */
for (i = 0; i < interface->num_q_vectors; i++) {
- qv = interface->q_vector[i];
+ struct fm10k_q_vector *qv = interface->q_vector[i];
+
qv->tx.itr = tx_itr;
qv->rx.itr = rx_itr;
}
--
2.21.0
^ permalink raw reply related
* [net-next 04/11] fm10k: reduce the scope of local err variable
From: Jeff Kirsher @ 2019-08-01 22:25 UTC (permalink / raw)
To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190801222548.15975-1-jeffrey.t.kirsher@intel.com>
From: Jacob Keller <jacob.e.keller@intel.com>
Reduce the scope of the local err variable in the fm10k_iov_alloc_data
function.
This was detected by cppcheck and resolves the following style warning
produced by that tool:
[fm10k_iov.c:426]: (style) The scope of the variable 'err' can be reduced.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
index 8de77155f2e7..afe1fafd2447 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_iov.c
@@ -1,5 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
-/* Copyright(c) 2013 - 2018 Intel Corporation. */
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
#include "fm10k.h"
#include "fm10k_vf.h"
@@ -426,7 +426,7 @@ static s32 fm10k_iov_alloc_data(struct pci_dev *pdev, int num_vfs)
struct fm10k_iov_data *iov_data = interface->iov_data;
struct fm10k_hw *hw = &interface->hw;
size_t size;
- int i, err;
+ int i;
/* return error if iov_data is already populated */
if (iov_data)
@@ -452,6 +452,7 @@ static s32 fm10k_iov_alloc_data(struct pci_dev *pdev, int num_vfs)
/* loop through vf_info structures initializing each entry */
for (i = 0; i < num_vfs; i++) {
struct fm10k_vf_info *vf_info = &iov_data->vf_info[i];
+ int err;
/* Record VF VSI value */
vf_info->vsi = i + 1;
--
2.21.0
^ permalink raw reply related
* linux-next: Fixes tag needs some work in the net-next tree
From: Stephen Rothwell @ 2019-08-01 22:54 UTC (permalink / raw)
To: David Miller, Networking
Cc: Linux Next Mailing List, Linux Kernel Mailing List, Jon Maloy
[-- Attachment #1: Type: text/plain, Size: 432 bytes --]
Hi all,
In commit
7c5b42055964 ("tipc: reduce risk of wakeup queue starvation")
Fixes tag
Fixes: 365ad35 ("tipc: reduce risk of user starvation during link congestion")
has these problem(s):
- SHA1 should be at least 12 digits long
Can be fixed by setting core.abbrev to 12 (or more) or (for git v2.11
or later) just making sure it is not set (or set to "auto").
--
Cheers,
Stephen Rothwell
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* [PATCH net] gve: Fix case where desc_cnt and data_cnt can get out of sync
From: Catherine Sullivan @ 2019-08-01 23:07 UTC (permalink / raw)
To: netdev; +Cc: Catherine Sullivan, Sagi Shahar
desc_cnt and data_cnt should always be equal. In the case of a dropped
packet desc_cnt was still getting updated (correctly), data_cnt
was not. To eliminate this bug and prevent it from recurring this
patch combines them into one ring level cnt.
Signed-off-by: Catherine Sullivan <csully@google.com>
Reviewed-by: Sagi Shahar <sagis@google.com>
---
drivers/net/ethernet/google/gve/gve.h | 8 ++---
drivers/net/ethernet/google/gve/gve_ethtool.c | 4 +--
drivers/net/ethernet/google/gve/gve_rx.c | 34 ++++++++-----------
3 files changed, 20 insertions(+), 26 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 92372dc43be8..ebc37e256922 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -31,9 +31,6 @@
struct gve_rx_desc_queue {
struct gve_rx_desc *desc_ring; /* the descriptor ring */
dma_addr_t bus; /* the bus for the desc_ring */
- u32 cnt; /* free-running total number of completed packets */
- u32 fill_cnt; /* free-running total number of descriptors posted */
- u32 mask; /* masks the cnt to the size of the ring */
u8 seqno; /* the next expected seqno for this desc*/
};
@@ -60,8 +57,6 @@ struct gve_rx_data_queue {
dma_addr_t data_bus; /* dma mapping of the slots */
struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
- u32 mask; /* masks the cnt to the size of the ring */
- u32 cnt; /* free-running total number of completed packets */
};
struct gve_priv;
@@ -73,6 +68,9 @@ struct gve_rx_ring {
struct gve_rx_data_queue data;
u64 rbytes; /* free-running bytes received */
u64 rpackets; /* free-running packets received */
+ u32 cnt; /* free-running total number of completed packets */
+ u32 fill_cnt; /* free-running total number of descs and buffs posted */
+ u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
u32 q_num; /* queue index */
u32 ntfy_id; /* notification block index */
struct gve_queue_resources *q_resources; /* head and tail pointer idx */
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
index 26540b856541..d8fa816f4473 100644
--- a/drivers/net/ethernet/google/gve/gve_ethtool.c
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -138,8 +138,8 @@ gve_get_ethtool_stats(struct net_device *netdev,
for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) {
struct gve_rx_ring *rx = &priv->rx[ring];
- data[i++] = rx->desc.cnt;
- data[i++] = rx->desc.fill_cnt;
+ data[i++] = rx->cnt;
+ data[i++] = rx->fill_cnt;
}
} else {
i += priv->rx_cfg.num_queues * NUM_GVE_RX_CNTS;
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 1914b8350da7..59564ac99d2a 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -37,7 +37,7 @@ static void gve_rx_free_ring(struct gve_priv *priv, int idx)
rx->data.qpl = NULL;
kvfree(rx->data.page_info);
- slots = rx->data.mask + 1;
+ slots = rx->mask + 1;
bytes = sizeof(*rx->data.data_ring) * slots;
dma_free_coherent(dev, bytes, rx->data.data_ring,
rx->data.data_bus);
@@ -64,7 +64,7 @@ static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
/* Allocate one page per Rx queue slot. Each page is split into two
* packet buffers, when possible we "page flip" between the two.
*/
- slots = rx->data.mask + 1;
+ slots = rx->mask + 1;
rx->data.page_info = kvzalloc(slots *
sizeof(*rx->data.page_info), GFP_KERNEL);
@@ -111,7 +111,7 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
rx->q_num = idx;
slots = priv->rx_pages_per_qpl;
- rx->data.mask = slots - 1;
+ rx->mask = slots - 1;
/* alloc rx data ring */
bytes = sizeof(*rx->data.data_ring) * slots;
@@ -125,7 +125,7 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
err = -ENOMEM;
goto abort_with_slots;
}
- rx->desc.fill_cnt = filled_pages;
+ rx->fill_cnt = filled_pages;
/* Ensure data ring slots (packet buffers) are visible. */
dma_wmb();
@@ -156,8 +156,8 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
err = -ENOMEM;
goto abort_with_q_resources;
}
- rx->desc.mask = slots - 1;
- rx->desc.cnt = 0;
+ rx->mask = slots - 1;
+ rx->cnt = 0;
rx->desc.seqno = 1;
gve_rx_add_to_block(priv, idx);
@@ -213,7 +213,7 @@ void gve_rx_write_doorbell(struct gve_priv *priv, struct gve_rx_ring *rx)
{
u32 db_idx = be32_to_cpu(rx->q_resources->db_index);
- iowrite32be(rx->desc.fill_cnt, &priv->db_bar2[db_idx]);
+ iowrite32be(rx->fill_cnt, &priv->db_bar2[db_idx]);
}
static enum pkt_hash_types gve_rss_type(__be16 pkt_flags)
@@ -273,7 +273,7 @@ static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
}
static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
- netdev_features_t feat)
+ netdev_features_t feat, u32 idx)
{
struct gve_rx_slot_page_info *page_info;
struct gve_priv *priv = rx->gve;
@@ -282,14 +282,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
struct sk_buff *skb;
int pagecount;
u16 len;
- u32 idx;
/* drop this packet */
if (unlikely(rx_desc->flags_seq & GVE_RXF_ERR))
return true;
len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
- idx = rx->data.cnt & rx->data.mask;
page_info = &rx->data.page_info[idx];
/* gvnic can only receive into registered segments. If the buffer
@@ -340,8 +338,6 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
if (!skb)
return true;
- rx->data.cnt++;
-
if (likely(feat & NETIF_F_RXCSUM)) {
/* NIC passes up the partial sum */
if (rx_desc->csum)
@@ -370,7 +366,7 @@ static bool gve_rx_work_pending(struct gve_rx_ring *rx)
__be16 flags_seq;
u32 next_idx;
- next_idx = rx->desc.cnt & rx->desc.mask;
+ next_idx = rx->cnt & rx->mask;
desc = rx->desc.desc_ring + next_idx;
flags_seq = desc->flags_seq;
@@ -385,8 +381,8 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
{
struct gve_priv *priv = rx->gve;
struct gve_rx_desc *desc;
- u32 cnt = rx->desc.cnt;
- u32 idx = cnt & rx->desc.mask;
+ u32 cnt = rx->cnt;
+ u32 idx = cnt & rx->mask;
u32 work_done = 0;
u64 bytes = 0;
@@ -401,10 +397,10 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
rx->q_num, GVE_SEQNO(desc->flags_seq),
rx->desc.seqno);
bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
- if (!gve_rx(rx, desc, feat))
+ if (!gve_rx(rx, desc, feat, idx))
gve_schedule_reset(priv);
cnt++;
- idx = cnt & rx->desc.mask;
+ idx = cnt & rx->mask;
desc = rx->desc.desc_ring + idx;
rx->desc.seqno = gve_next_seqno(rx->desc.seqno);
work_done++;
@@ -417,8 +413,8 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
rx->rpackets += work_done;
rx->rbytes += bytes;
u64_stats_update_end(&rx->statss);
- rx->desc.cnt = cnt;
- rx->desc.fill_cnt += work_done;
+ rx->cnt = cnt;
+ rx->fill_cnt += work_done;
/* restock desc ring slots */
dma_wmb(); /* Ensure descs are visible before ringing doorbell */
--
2.22.0.770.g0f2c4a37fd-goog
^ permalink raw reply related
* Re: [PATCH net-next v5 5/6] flow_offload: support get flow_block immediately
From: Jakub Kicinski @ 2019-08-01 23:11 UTC (permalink / raw)
To: wenxu; +Cc: jiri, pablo, fw, netfilter-devel, netdev, John Hurley
In-Reply-To: <1564628627-10021-6-git-send-email-wenxu@ucloud.cn>
On Thu, 1 Aug 2019 11:03:46 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
>
> The new flow-indr-block can't get the tcf_block
> directly. It provide a callback list to find the flow_block immediately
> when the device register and contain a ingress block.
>
> Signed-off-by: wenxu <wenxu@ucloud.cn>
First of all thanks for splitting the series up into more patches,
it is easier to follow the logic now!
> @@ -328,6 +348,7 @@ struct flow_indr_block_dev {
>
> INIT_LIST_HEAD(&indr_dev->cb_list);
> indr_dev->dev = dev;
> + flow_get_default_block(indr_dev);
> if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
> flow_indr_setup_block_ht_params)) {
> kfree(indr_dev);
I wonder if it's still practical to keep the block information in the
indr_dev structure at all. The way this used to work was:
[hash table of devices] --------------
| | netdev |
| | refcnt |
indir_dev[tun0]| ------ | cached block | ---- [ TC block ]
| | callbacks | .
| -------------- \__ [cb, cb_priv, cb_ident]
| [cb, cb_priv, cb_ident]
| --------------
| | netdev |
| | refcnt |
indir_dev[tun1]| ------ | cached block | ---- [ TC block ]
| | callbacks |.
----------------- -------------- \__ [cb, cb_priv, cb_ident]
[cb, cb_priv, cb_ident]
In the example above we have two tunnels tun0 and tun1, each one has a
indr_dev structure allocated, and for each one of them two drivers
registered for callbacks (hence the callbacks list has two entries).
We used to cache the TC block in the indr_dev structure, but now that
there are multiple subsytems using the indr_dev we either have to have
a list of cached blocks (with entries for each subsystem) or just always
iterate over the subsystems :(
After all the same device may have both a TC block and a NFT block.
I think always iterating would be easier:
The indr_dev struct would no longer have the block pointer, instead
when new driver registers for the callback instead of:
if (indr_dev->ing_cmd_cb)
indr_dev->ing_cmd_cb(indr_dev->dev, indr_dev->flow_block,
indr_block_cb->cb, indr_block_cb->cb_priv,
FLOW_BLOCK_BIND);
We'd have something like the loop in flow_get_default_block():
for each (subsystem)
subsystem->handle_new_indir_cb(indr_dev, cb);
And then per-subsystem logic would actually call the cb. Or:
for each (subsystem)
block = get_default_block(indir_dev)
indr_dev->ing_cmd_cb(...)
I hope this makes sense.
Also please double check nft unload logic has an RCU synchronization
point, I'm not 100% confident rcu_barrier() implies synchronize_rcu().
Perhaps someone more knowledgeable can chime in :)
^ permalink raw reply
* Re: [PATCH 2/2] cnic: Use refcount_t for refcount
From: Michael Chan @ 2019-08-01 23:15 UTC (permalink / raw)
To: Chuhong Yuan; +Cc: David S . Miller, Netdev, open list
In-Reply-To: <CANhBUQ1J8hXBZv4x3pJhG_08ZS1zR=9Uj2EUta2sgtyND_QKPw@mail.gmail.com>
On Wed, Jul 31, 2019 at 7:22 PM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> Michael Chan <michael.chan@broadcom.com> 于2019年8月1日周四 上午1:58写道:
> >
> > On Wed, Jul 31, 2019 at 5:22 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > > static void cnic_ctx_wr(struct cnic_dev *dev, u32 cid_addr, u32 off, u32 val)
> > > @@ -494,7 +494,7 @@ int cnic_register_driver(int ulp_type, struct cnic_ulp_ops *ulp_ops)
> > > }
> > > read_unlock(&cnic_dev_lock);
> > >
> > > - atomic_set(&ulp_ops->ref_count, 0);
> > > + refcount_set(&ulp_ops->ref_count, 0);
> > > rcu_assign_pointer(cnic_ulp_tbl[ulp_type], ulp_ops);
> > > mutex_unlock(&cnic_lock);
> > >
> >
> > Willem's comment applies here too. The driver needs to be modified to
> > count from 1 to use refcount_t.
> >
> > Thanks.
>
> I have revised this problem but find the other two refcounts -
> cnic_dev::ref_count
> and cnic_sock::ref_count have no set.
> I am not sure where to initialize them to 1.
>
> Besides, should ulp_ops->ref_count be set to 0 when unregistered?
I will send a patch to fix up the initialization of all the atomic ref
counts. After that, you can add your patch to convert to refcount_t.
^ permalink raw reply
* Re: [v2,0/2] tools: bpftool: add net attach/detach command to attach XDP prog
From: Jakub Kicinski @ 2019-08-01 23:21 UTC (permalink / raw)
To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190801081133.13200-1-danieltimlee@gmail.com>
On Thu, 1 Aug 2019 17:11:31 +0900, Daniel T. Lee wrote:
> Currently, bpftool net only supports dumping progs attached on the
> interface. To attach XDP prog on interface, user must use other tool
> (eg. iproute2). By this patch, with `bpftool net attach/detach`, user
> can attach/detach XDP prog on interface.
>
> $ ./bpftool prog
> ...
> 208: xdp name xdp_prog1 tag ad822e38b629553f gpl
> loaded_at 2019-07-28T18:03:11+0900 uid 0
> ...
> $ ./bpftool net attach id 208 xdpdrv enp6s0np1
> $ ./bpftool net
> xdp:
> enp6s0np1(5) driver id 208
> ...
> $ ./bpftool net detach xdpdrv enp6s0np1
> $ ./bpftool net
> xdp:
> ...
>
> While this patch only contains support for XDP, through `net
> attach/detach`, bpftool can further support other prog attach types.
>
> XDP attach/detach tested on Mellanox ConnectX-4 and Netronome Agilio.
Please provide documentation for man pages, and bash completions.
^ permalink raw reply
* Re: [v2,1/2] tools: bpftool: add net attach command to attach XDP on interface
From: Jakub Kicinski @ 2019-08-01 23:36 UTC (permalink / raw)
To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190801081133.13200-2-danieltimlee@gmail.com>
On Thu, 1 Aug 2019 17:11:32 +0900, Daniel T. Lee wrote:
> By this commit, using `bpftool net attach`, user can attach XDP prog on
> interface. New type of enum 'net_attach_type' has been made, as stated at
> cover-letter, the meaning of 'attach' is, prog will be attached on interface.
>
> BPF prog will be attached through libbpf 'bpf_set_link_xdp_fd'.
>
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> ---
> Changes in v2:
> - command 'load' changed to 'attach' for the consistency
> - 'NET_ATTACH_TYPE_XDP_DRIVE' changed to 'NET_ATTACH_TYPE_XDP_DRIVER'
>
> tools/bpf/bpftool/net.c | 107 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 106 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 67e99c56bc88..f3b57660b303 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -55,6 +55,35 @@ struct bpf_attach_info {
> __u32 flow_dissector_id;
> };
>
> +enum net_attach_type {
> + NET_ATTACH_TYPE_XDP,
> + NET_ATTACH_TYPE_XDP_GENERIC,
> + NET_ATTACH_TYPE_XDP_DRIVER,
> + NET_ATTACH_TYPE_XDP_OFFLOAD,
> + __MAX_NET_ATTACH_TYPE
> +};
> +
> +static const char * const attach_type_strings[] = {
> + [NET_ATTACH_TYPE_XDP] = "xdp",
> + [NET_ATTACH_TYPE_XDP_GENERIC] = "xdpgeneric",
> + [NET_ATTACH_TYPE_XDP_DRIVER] = "xdpdrv",
> + [NET_ATTACH_TYPE_XDP_OFFLOAD] = "xdpoffload",
> + [__MAX_NET_ATTACH_TYPE] = NULL,
Not sure if the terminator is necessary,
ARRAY_SIZE(attach_type_strings) should suffice?
> +};
> +
> +static enum net_attach_type parse_attach_type(const char *str)
> +{
> + enum net_attach_type type;
> +
> + for (type = 0; type < __MAX_NET_ATTACH_TYPE; type++) {
> + if (attach_type_strings[type] &&
> + is_prefix(str, attach_type_strings[type]))
> + return type;
> + }
> +
> + return __MAX_NET_ATTACH_TYPE;
> +}
> +
> static int dump_link_nlmsg(void *cookie, void *msg, struct nlattr **tb)
> {
> struct bpf_netdev_t *netinfo = cookie;
> @@ -223,6 +252,77 @@ static int query_flow_dissector(struct bpf_attach_info *attach_info)
> return 0;
> }
>
> +static int parse_attach_args(int argc, char **argv, int *progfd,
> + enum net_attach_type *attach_type, int *ifindex)
> +{
> + if (!REQ_ARGS(3))
> + return -EINVAL;
> +
> + *progfd = prog_parse_fd(&argc, &argv);
> + if (*progfd < 0)
> + return *progfd;
> +
> + *attach_type = parse_attach_type(*argv);
> + if (*attach_type == __MAX_NET_ATTACH_TYPE) {
> + p_err("invalid net attach/detach type");
> + return -EINVAL;
You should close the progfd on error paths.
> + }
Hm. I'm not too sure about the ordering of arguments, type should
probably be right after attach.
If we ever add tc attach support or some other hook, that's more
fundamental part of the command than the program. So I think:
bpftool net attach xdp id xyz dev ethN
> + NEXT_ARG();
> + if (!REQ_ARGS(1))
> + return -EINVAL;
Error message needed here.
> + *ifindex = if_nametoindex(*argv);
> + if (!*ifindex) {
> + p_err("Invalid ifname");
"ifname" is not mentioned in help, it'd be best to keep this error
message consistent with bpftool prog load.
> + return -EINVAL;
> + }
Please require the dev keyword before the interface name.
That'll make it feel closer to prog load syntax.
> + return 0;
> +}
> +
> +static int do_attach_detach_xdp(int *progfd, enum net_attach_type *attach_type,
> + int *ifindex)
> +{
> + __u32 flags;
> + int err;
> +
> + flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
Please add this as an option so that user can decide whether overwrite
is allowed or not.
> + if (*attach_type == NET_ATTACH_TYPE_XDP_GENERIC)
> + flags |= XDP_FLAGS_SKB_MODE;
> + if (*attach_type == NET_ATTACH_TYPE_XDP_DRIVER)
> + flags |= XDP_FLAGS_DRV_MODE;
> + if (*attach_type == NET_ATTACH_TYPE_XDP_OFFLOAD)
> + flags |= XDP_FLAGS_HW_MODE;
> +
> + err = bpf_set_link_xdp_fd(*ifindex, *progfd, flags);
> +
> + return err;
no need for the err variable here.
> +}
> +
> +static int do_attach(int argc, char **argv)
> +{
> + enum net_attach_type attach_type;
> + int err, progfd, ifindex;
> +
> + err = parse_attach_args(argc, argv, &progfd, &attach_type, &ifindex);
> + if (err)
> + return err;
Probably not the best idea to move this out into a helper.
> + if (is_prefix("xdp", attach_type_strings[attach_type]))
> + err = do_attach_detach_xdp(&progfd, &attach_type, &ifindex);
Hm. We either need an error to be reported if it's not xdp or since we
only accept XDP now perhaps the if() is superfluous?
> + if (err < 0) {
> + p_err("link set %s failed", attach_type_strings[attach_type]);
"link set"? So you are familiar with iproute2 syntax! :)
> + return -1;
> + }
> +
> + if (json_output)
> + jsonw_null(json_wtr);
> +
> + return 0;
> +}
> +
> static int do_show(int argc, char **argv)
> {
> struct bpf_attach_info attach_info = {};
> @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
>
> fprintf(stderr,
> "Usage: %s %s { show | list } [dev <devname>]\n"
> + " %s %s attach PROG LOAD_TYPE <devname>\n"
> " %s %s help\n"
> + "\n"
> + " " HELP_SPEC_PROGRAM "\n"
> + " LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
ATTACH_TYPE now?
Perhaps a new line before the "Note"?
> "Note: Only xdp and tc attachments are supported now.\n"
> " For progs attached to cgroups, use \"bpftool cgroup\"\n"
> " to dump program attachments. For program types\n"
> " sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> " consult iproute2.\n",
> - bin_name, argv[-2], bin_name, argv[-2]);
> + bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
>
> return 0;
> }
> @@ -319,6 +423,7 @@ static int do_help(int argc, char **argv)
> static const struct cmd cmds[] = {
> { "show", do_show },
> { "list", do_show },
> + { "attach", do_attach },
> { "help", do_help },
> { 0 }
> };
^ permalink raw reply
* Re: [PATCH] net: sched: use temporary variable for actions indexes
From: Cong Wang @ 2019-08-01 23:41 UTC (permalink / raw)
To: dmitrolin
Cc: Linux Kernel Network Developers, David Miller, Jiri Pirko,
Jamal Hadi Salim, Vlad Buslov
In-Reply-To: <1564664571-31508-1-git-send-email-dmitrolin@mellanox.com>
On Thu, Aug 1, 2019 at 6:03 AM <dmitrolin@mellanox.com> wrote:
>
> From: Dmytro Linkin <dmitrolin@mellanox.com>
>
> Currently init call of all actions (except ipt) init their 'parm'
> structure as a direct pointer to nla data in skb. This leads to race
> condition when some of the filter actions were initialized successfully
> (and were assigned with idr action index that was written directly
> into nla data), but then were deleted and retried (due to following
> action module missing or classifier-initiated retry), in which case
> action init code tries to insert action to idr with index that was
> assigned on previous iteration. During retry the index can be reused
> by another action that was inserted concurrently, which causes
> unintended action sharing between filters.
> To fix described race condition, save action idr index to temporary
> stack-allocated variable instead on nla data.
>
> Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
> Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
This is a sad side-effect we have to deal with for this retry logic,
we have to restore all global status in each retry loop. :-(
Thanks.
^ permalink raw reply
* [PATCH v5 1/3] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: john.hubbard @ 2019-08-01 23:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard,
Ira Weiny
In-Reply-To: <20190801234735.2149-1-jhubbard@nvidia.com>
From: John Hubbard <jhubbard@nvidia.com>
Provide more capable variation of put_user_pages_dirty_lock(),
and delete put_user_pages_dirty(). This is based on the
following:
1. Lots of call sites become simpler if a bool is passed
into put_user_page*(), instead of making the call site
choose which put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock()
is usually correct, and set_page_dirty() is usually a
bug, or at least questionable, within a put_user_page*()
calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
drivers/infiniband/core/umem.c | 5 +-
drivers/infiniband/hw/hfi1/user_pages.c | 5 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 13 +--
drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +-
drivers/infiniband/sw/siw/siw_mem.c | 18 +---
include/linux/mm.h | 5 +-
mm/gup.c | 115 +++++++++------------
7 files changed, 60 insertions(+), 106 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 08da840ed7ee..965cf9dea71a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,10 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
page = sg_page_iter_page(&sg_iter);
- if (umem->writable && dirty)
- put_user_pages_dirty_lock(&page, 1);
- else
- put_user_page(page);
+ put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
}
sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index b89a9b9aef7a..469acb961fbd 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,10 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
size_t npages, bool dirty)
{
- if (dirty)
- put_user_pages_dirty_lock(p, npages);
- else
- put_user_pages(p, npages);
+ put_user_pages_dirty_lock(p, npages, dirty);
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, &mm->pinned_vm);
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index bfbfbb7e0ff4..26c1fb8d45cc 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -37,15 +37,6 @@
#include "qib.h"
-static void __qib_release_user_pages(struct page **p, size_t num_pages,
- int dirty)
-{
- if (dirty)
- put_user_pages_dirty_lock(p, num_pages);
- else
- put_user_pages(p, num_pages);
-}
-
/**
* qib_map_page - a safety wrapper around pci_map_page()
*
@@ -124,7 +115,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
return 0;
bail_release:
- __qib_release_user_pages(p, got, 0);
+ put_user_pages_dirty_lock(p, got, false);
bail:
atomic64_sub(num_pages, ¤t->mm->pinned_vm);
return ret;
@@ -132,7 +123,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
void qib_release_user_pages(struct page **p, size_t num_pages)
{
- __qib_release_user_pages(p, num_pages, 1);
+ put_user_pages_dirty_lock(p, num_pages, true);
/* during close after signal, mm can be NULL */
if (current->mm)
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 0b0237d41613..62e6ffa9ad78 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -75,10 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
for_each_sg(chunk->page_list, sg, chunk->nents, i) {
page = sg_page(sg);
pa = sg_phys(sg);
- if (dirty)
- put_user_pages_dirty_lock(&page, 1);
- else
- put_user_page(page);
+ put_user_pages_dirty_lock(&page, 1, dirty);
usnic_dbg("pa: %pa\n", &pa);
}
kfree(chunk);
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index 67171c82b0c4..2284966e4499 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -60,20 +60,6 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
return NULL;
}
-static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
- bool dirty)
-{
- struct page **p = chunk->plist;
-
- while (num_pages--) {
- if (!PageDirty(*p) && dirty)
- put_user_pages_dirty_lock(p, 1);
- else
- put_user_page(*p);
- p++;
- }
-}
-
void siw_umem_release(struct siw_umem *umem, bool dirty)
{
struct mm_struct *mm_s = umem->owning_mm;
@@ -82,8 +68,8 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
for (i = 0; num_pages; i++) {
int to_free = min_t(int, PAGES_PER_CHUNK, num_pages);
- siw_free_plist(&umem->page_chunk[i], to_free,
- umem->writable && dirty);
+ put_user_pages_dirty_lock(&umem->page_chunk[i], to_free,
+ umem->writable && dirty);
kfree(umem->page_chunk[i].plist);
num_pages -= to_free;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..9759b6a24420 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1057,8 +1057,9 @@ static inline void put_user_page(struct page *page)
put_page(page);
}
-void put_user_pages_dirty(struct page **pages, unsigned long npages);
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+ bool make_dirty);
+
void put_user_pages(struct page **pages, unsigned long npages);
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..7fefd7ab02c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,85 +29,70 @@ struct follow_page_context {
unsigned int page_mask;
};
-typedef int (*set_dirty_func_t)(struct page *page);
-
-static void __put_user_pages_dirty(struct page **pages,
- unsigned long npages,
- set_dirty_func_t sdf)
-{
- unsigned long index;
-
- for (index = 0; index < npages; index++) {
- struct page *page = compound_head(pages[index]);
-
- /*
- * Checking PageDirty at this point may race with
- * clear_page_dirty_for_io(), but that's OK. Two key cases:
- *
- * 1) This code sees the page as already dirty, so it skips
- * the call to sdf(). That could happen because
- * clear_page_dirty_for_io() called page_mkclean(),
- * followed by set_page_dirty(). However, now the page is
- * going to get written back, which meets the original
- * intention of setting it dirty, so all is well:
- * clear_page_dirty_for_io() goes on to call
- * TestClearPageDirty(), and write the page back.
- *
- * 2) This code sees the page as clean, so it calls sdf().
- * The page stays dirty, despite being written back, so it
- * gets written back again in the next writeback cycle.
- * This is harmless.
- */
- if (!PageDirty(page))
- sdf(page);
-
- put_user_page(page);
- }
-}
-
/**
- * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
- * @pages: array of pages to be marked dirty and released.
+ * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
+ * @pages: array of pages to be maybe marked dirty, and definitely released.
* @npages: number of pages in the @pages array.
+ * @make_dirty: whether to mark the pages dirty
*
* "gup-pinned page" refers to a page that has had one of the get_user_pages()
* variants called on that page.
*
* For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if it was previously listed as clean. Then, release
- * the page using put_user_page().
+ * compound page) dirty, if @make_dirty is true, and if the page was previously
+ * listed as clean. In any case, releases all pages using put_user_page(),
+ * possibly via put_user_pages(), for the non-dirty case.
*
* Please see the put_user_page() documentation for details.
*
- * set_page_dirty(), which does not lock the page, is used here.
- * Therefore, it is the caller's responsibility to ensure that this is
- * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), put_user_page().
*
*/
-void put_user_pages_dirty(struct page **pages, unsigned long npages)
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+ bool make_dirty)
{
- __put_user_pages_dirty(pages, npages, set_page_dirty);
-}
-EXPORT_SYMBOL(put_user_pages_dirty);
+ unsigned long index;
-/**
- * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
- * @pages: array of pages to be marked dirty and released.
- * @npages: number of pages in the @pages array.
- *
- * For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if it was previously listed as clean. Then, release
- * the page using put_user_page().
- *
- * Please see the put_user_page() documentation for details.
- *
- * This is just like put_user_pages_dirty(), except that it invokes
- * set_page_dirty_lock(), instead of set_page_dirty().
- *
- */
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
-{
- __put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+ /*
+ * TODO: this can be optimized for huge pages: if a series of pages is
+ * physically contiguous and part of the same compound page, then a
+ * single operation to the head page should suffice.
+ */
+
+ if (!make_dirty) {
+ put_user_pages(pages, npages);
+ return;
+ }
+
+ for (index = 0; index < npages; index++) {
+ struct page *page = compound_head(pages[index]);
+ /*
+ * Checking PageDirty at this point may race with
+ * clear_page_dirty_for_io(), but that's OK. Two key
+ * cases:
+ *
+ * 1) This code sees the page as already dirty, so it
+ * skips the call to set_page_dirty(). That could happen
+ * because clear_page_dirty_for_io() called
+ * page_mkclean(), followed by set_page_dirty().
+ * However, now the page is going to get written back,
+ * which meets the original intention of setting it
+ * dirty, so all is well: clear_page_dirty_for_io() goes
+ * on to call TestClearPageDirty(), and write the page
+ * back.
+ *
+ * 2) This code sees the page as clean, so it calls
+ * set_page_dirty(). The page stays dirty, despite being
+ * written back, so it gets written back again in the
+ * next writeback cycle. This is harmless.
+ */
+ if (!PageDirty(page))
+ set_page_dirty_lock(page);
+ put_user_page(page);
+ }
}
EXPORT_SYMBOL(put_user_pages_dirty_lock);
--
2.22.0
^ permalink raw reply related
* [PATCH v5 3/3] net/xdp: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-01 23:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard
In-Reply-To: <20190801234735.2149-1-jhubbard@nvidia.com>
From: John Hubbard <jhubbard@nvidia.com>
For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
net/xdp/xdp_umem.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 83de74ca729a..17c4b3d3dc34 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -166,14 +166,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
static void xdp_umem_unpin_pages(struct xdp_umem *umem)
{
- unsigned int i;
-
- for (i = 0; i < umem->npgs; i++) {
- struct page *page = umem->pgs[i];
-
- set_page_dirty_lock(page);
- put_page(page);
- }
+ put_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
kfree(umem->pgs);
umem->pgs = NULL;
--
2.22.0
^ permalink raw reply related
* [PATCH v5 2/3] drivers/gpu/drm/via: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-01 23:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard
In-Reply-To: <20190801234735.2149-1-jhubbard@nvidia.com>
From: John Hubbard <jhubbard@nvidia.com>
For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").
Also reverse the order of a comparison, in order to placate
checkpatch.pl.
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
drivers/gpu/drm/via/via_dmablit.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
index 062067438f1d..b5b5bf0ba65e 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -171,7 +171,6 @@ via_map_blit_for_device(struct pci_dev *pdev,
static void
via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t *vsg)
{
- struct page *page;
int i;
switch (vsg->state) {
@@ -186,13 +185,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t *vsg)
kfree(vsg->desc_pages);
/* fall through */
case dr_via_pages_locked:
- for (i = 0; i < vsg->num_pages; ++i) {
- if (NULL != (page = vsg->pages[i])) {
- if (!PageReserved(page) && (DMA_FROM_DEVICE == vsg->direction))
- SetPageDirty(page);
- put_page(page);
- }
- }
+ put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+ (vsg->direction == DMA_FROM_DEVICE));
/* fall through */
case dr_via_pages_alloc:
vfree(vsg->pages);
--
2.22.0
^ permalink raw reply related
* [PATCH v5 0/3] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: john.hubbard @ 2019-08-01 23:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexander Viro, Björn Töpel, Boaz Harrosh,
Christoph Hellwig, Daniel Vetter, Dan Williams, Dave Chinner,
David Airlie, David S . Miller, Ilya Dryomov, Jan Kara,
Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard
From: John Hubbard <jhubbard@nvidia.com>
Changes since v4:
* Christophe Hellwig's review applied: deleted siw_free_plist() and
__qib_release_user_pages(), now that put_user_pages_dirty_lock() does
what those routines were doing.
* Applied Bjorn's ACK for net/xdp, and Christophe's Reviewed-by for patch
#1.
Changes since v3:
* Fixed an unused variable warning in siw_mem.c
Changes since v2:
* Critical bug fix: remove a stray "break;" from the new routine.
Changes since v1:
* Instead of providing __put_user_pages(), add an argument to
put_user_pages_dirty_lock(), and delete put_user_pages_dirty().
This is based on the following points:
1. Lots of call sites become simpler if a bool is passed
into put_user_page*(), instead of making the call site
choose which put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock()
is usually correct, and set_page_dirty() is usually a
bug, or at least questionable, within a put_user_page*()
calling chain.
* Added the Infiniband driver back to the patch series, because it is
a caller of put_user_pages_dirty_lock().
Unchanged parts from the v1 cover letter (except for the diffstat):
Notes about the remaining patches to come:
There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:
* The block/bio related changes (Jerome mostly wrote those, but I've
had to move stuff around extensively, and add a little code)
* mm/ changes
* other subsystem patches
* an RFC that shows the current state of the tracking patch set. That
can only be applied after all call sites are converted, but it's
good to get an early look at it.
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").
John Hubbard (3):
mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
drivers/gpu/drm/via: convert put_page() to put_user_page*()
net/xdp: convert put_page() to put_user_page*()
drivers/gpu/drm/via/via_dmablit.c | 10 +-
drivers/infiniband/core/umem.c | 5 +-
drivers/infiniband/hw/hfi1/user_pages.c | 5 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 13 +--
drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +-
drivers/infiniband/sw/siw/siw_mem.c | 18 +---
include/linux/mm.h | 5 +-
mm/gup.c | 115 +++++++++------------
net/xdp/xdp_umem.c | 9 +-
9 files changed, 63 insertions(+), 122 deletions(-)
--
2.22.0
^ permalink raw reply
* Re: [PATCH v3 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Alexei Starovoitov @ 2019-08-01 23:50 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, netdev, ast, daniel, yhs, songliubraving, andrii.nakryiko,
kernel-team
In-Reply-To: <20190801064803.2519675-3-andriin@fb.com>
On Wed, Jul 31, 2019 at 11:47:53PM -0700, Andrii Nakryiko wrote:
> This patch implements the core logic for BPF CO-RE offsets relocations.
> Every instruction that needs to be relocated has corresponding
> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> to match recorded "local" relocation spec against potentially many
> compatible "target" types, creating corresponding spec. Details of the
> algorithm are noted in corresponding comments in the code.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> Acked-by: Song Liu <songliubraving@fb.com>
...
> + if (btf_is_composite(t)) {
> + const struct btf_member *m = (void *)(t + 1);
> + __u32 offset;
> +
> + if (access_idx >= BTF_INFO_VLEN(t->info))
> + return -EINVAL;
> +
> + m = &m[access_idx];
> +
> + if (BTF_INFO_KFLAG(t->info)) {
> + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> + return -EINVAL;
> + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> + } else {
> + offset = m->offset;
> + }
very similar logic exists in btf_dump.c
probably makes sense to make a common helper at some point.
> +static size_t bpf_core_essential_name_len(const char *name)
> +{
> + size_t n = strlen(name);
> + int i = n - 3;
> +
> + while (i > 0) {
> + if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> + return i;
> + i--;
> + }
> + return n;
> +}
that's a bit of an eye irritant. How about?
size_t n = strlen(name);
int i, cnt = 0;
for (i = n - 1; i >= 0; i--) {
if (name[i] == '_') {
cnt++;
} else {
if (cnt == 3)
return i + 1;
cnt = 0;
}
}
return n;
> + case BTF_KIND_ARRAY: {
> + const struct btf_array *loc_a, *targ_a;
> +
> + loc_a = (void *)(local_type + 1);
> + targ_a = (void *)(targ_type + 1);
> + local_id = loc_a->type;
> + targ_id = targ_a->type;
can we add a helper like:
const struct btf_array *btf_array(cosnt struct btf_type *t)
{
return (const struct btf_array *)(t + 1);
}
then above will be:
case BTF_KIND_ARRAY: {
local_id = btf_array(local_type)->type;
targ_id = btf_array(targ_type)->type;
and a bunch of code in btf.c and btf_dump.c would be cleaner as well?
> + goto recur;
> + }
> + default:
> + pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> + kind, local_id, targ_id);
> + return 0;
> + }
> +}
> +
> +/*
> + * Given single high-level named field accessor in local type, find
> + * corresponding high-level accessor for a target type. Along the way,
> + * maintain low-level spec for target as well. Also keep updating target
> + * offset.
> + *
> + * Searching is performed through recursive exhaustive enumeration of all
> + * fields of a struct/union. If there are any anonymous (embedded)
> + * structs/unions, they are recursively searched as well. If field with
> + * desired name is found, check compatibility between local and target types,
> + * before returning result.
> + *
> + * 1 is returned, if field is found.
> + * 0 is returned if no compatible field is found.
> + * <0 is returned on error.
> + */
> +static int bpf_core_match_member(const struct btf *local_btf,
> + const struct bpf_core_accessor *local_acc,
> + const struct btf *targ_btf,
> + __u32 targ_id,
> + struct bpf_core_spec *spec,
> + __u32 *next_targ_id)
> +{
> + const struct btf_type *local_type, *targ_type;
> + const struct btf_member *local_member, *m;
> + const char *local_name, *targ_name;
> + __u32 local_id;
> + int i, n, found;
> +
> + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> + if (!targ_type)
> + return -EINVAL;
> + if (!btf_is_composite(targ_type))
> + return 0;
> +
> + local_id = local_acc->type_id;
> + local_type = btf__type_by_id(local_btf, local_id);
> + local_member = (void *)(local_type + 1);
> + local_member += local_acc->idx;
> + local_name = btf__name_by_offset(local_btf, local_member->name_off);
> +
> + n = BTF_INFO_VLEN(targ_type->info);
> + m = (void *)(targ_type + 1);
new btf_member() helper?
> + for (i = 0; i < n; i++, m++) {
> + __u32 offset;
> +
> + /* bitfield relocations not supported */
> + if (BTF_INFO_KFLAG(targ_type->info)) {
> + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> + continue;
> + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> + } else {
> + offset = m->offset;
> + }
> + if (offset % 8)
> + continue;
same bit of code again?
definitely could use a helper.
> + for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> + targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> + &targ_id);
> + if (!targ_type)
> + return -EINVAL;
> +
> + if (local_acc->name) {
> + matched = bpf_core_match_member(local_spec->btf,
> + local_acc,
> + targ_btf, targ_id,
> + targ_spec, &targ_id);
> + if (matched <= 0)
> + return matched;
> + } else {
> + /* for i=0, targ_id is already treated as array element
> + * type (because it's the original struct), for others
> + * we should find array element type first
> + */
> + if (i > 0) {
> + const struct btf_array *a;
> +
> + if (!btf_is_array(targ_type))
> + return 0;
> +
> + a = (void *)(targ_type + 1);
> + if (local_acc->idx >= a->nelems)
> + return 0;
am I reading it correctly that the local spec requested out-of-bounds
index in the target array type?
Why this is 'ignore' instead of -EINVAL?
> +/*
> + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> + * data out of it to use for target BTF.
> + */
> +static struct btf *bpf_core_find_kernel_btf(void)
> +{
> + const char *locations[] = {
> + "/lib/modules/%1$s/vmlinux-%1$s",
> + "/usr/lib/modules/%1$s/kernel/vmlinux",
> + };
> + char path[PATH_MAX + 1];
> + struct utsname buf;
> + struct btf *btf;
> + int i, err;
> +
> + err = uname(&buf);
> + if (err) {
> + pr_warning("failed to uname(): %d\n", err);
defensive programming ?
I think uname() can fail only if &buf points to non-existing page like null.
> + return ERR_PTR(err);
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(locations); i++) {
> + snprintf(path, PATH_MAX, locations[i], buf.release);
> + pr_debug("attempting to load kernel BTF from '%s'\n", path);
I think this debug message would have been more useful after access().
> +
> + if (access(path, R_OK))
> + continue;
> +
> + btf = btf__parse_elf(path, NULL);
> + if (IS_ERR(btf))
> + continue;
> +
> + pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> + return btf;
> + }
> +
> + pr_warning("failed to find valid kernel BTF\n");
> + return ERR_PTR(-ESRCH);
> +}
> +
> +/* Output spec definition in the format:
> + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> + */
> +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> +{
> + const struct btf_type *t;
> + const char *s;
> + __u32 type_id;
> + int i;
> +
> + type_id = spec->spec[0].type_id;
> + t = btf__type_by_id(spec->btf, type_id);
> + s = btf__name_by_offset(spec->btf, t->name_off);
> + libbpf_print(level, "[%u] (%s) + ", type_id, s);
imo extra []() don't improve readability of the dump.
> +
> + for (i = 0; i < spec->raw_len; i++)
> + libbpf_print(level, "%d%s", spec->raw_spec[i],
> + i == spec->raw_len - 1 ? " => " : ":");
> +
> + libbpf_print(level, "%u @ &x", spec->offset);
> +
> + for (i = 0; i < spec->len; i++) {
> + if (spec->spec[i].name)
> + libbpf_print(level, ".%s", spec->spec[i].name);
> + else
> + libbpf_print(level, "[%u]", spec->spec[i].idx);
> + }
> +
> +}
> +
> +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> +{
> + return (size_t)key;
> +}
> +
> +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> +{
> + return k1 == k2;
> +}
> +
> +static void *u32_to_ptr(__u32 x)
> +{
> + return (void *)(uintptr_t)x;
> +}
u32 to pointer on 64-bit arch?!
That surely needs a comment.
> +
> +/*
> + * CO-RE relocate single instruction.
> + *
> + * The outline and important points of the algorithm:
> + * 1. For given local type, find corresponding candidate target types.
> + * Candidate type is a type with the same "essential" name, ignoring
> + * everything after last triple underscore (___). E.g., `sample`,
> + * `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> + * for each other. Names with triple underscore are referred to as
> + * "flavors" and are useful, among other things, to allow to
> + * specify/support incompatible variations of the same kernel struct, which
> + * might differ between different kernel versions and/or build
> + * configurations.
> + *
> + * N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> + * converter, when deduplicated BTF of a kernel still contains more than
> + * one different types with the same name. In that case, ___2, ___3, etc
> + * are appended starting from second name conflict. But start flavors are
> + * also useful to be defined "locally", in BPF program, to extract same
> + * data from incompatible changes between different kernel
> + * versions/configurations. For instance, to handle field renames between
> + * kernel versions, one can use two flavors of the struct name with the
> + * same common name and use conditional relocations to extract that field,
> + * depending on target kernel version.
there are actual kernel types that have ___ in the name.
Ex: struct lmc___media
We probably need to revisit this 'flavor' convention.
> + for (i = 0, j = 0; i < cand_ids->len; i++) {
> + cand_id = cand_ids->data[i];
> + cand_type = btf__type_by_id(targ_btf, cand_id);
> + cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> +
> + err = bpf_core_spec_match(&local_spec, targ_btf,
> + cand_id, &cand_spec);
> + if (err < 0) {
> + pr_warning("prog '%s': relo #%d: failed to match spec ",
> + prog_name, relo_idx);
> + bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> + libbpf_print(LIBBPF_WARN,
> + " to candidate #%d [%d] (%s): %d\n",
> + i, cand_id, cand_name, err);
> + return err;
> + }
> + if (err == 0) {
> + pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> + prog_name, relo_idx, i, cand_id, cand_name);
> + bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> + libbpf_print(LIBBPF_DEBUG, "\n");
> + continue;
> + }
> +
> + pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> + prog_name, relo_idx, i);
did you mention that you're going to make a helper for this debug dumps?
^ permalink raw reply
* [PATCH net-next] selftests: Add l2tp tests
From: David Ahern @ 2019-08-01 23:54 UTC (permalink / raw)
To: davem; +Cc: netdev, David Ahern
From: David Ahern <dsahern@gmail.com>
Add IPv4 and IPv6 l2tp tests. Current set is over IP and with
IPsec.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
The ipsec tests expose a netdev refcount leak that I have not had
time to track down, but the tests themselves are good.
tools/testing/selftests/net/l2tp.sh | 382 ++++++++++++++++++++++++++++++++++++
1 file changed, 382 insertions(+)
create mode 100644 tools/testing/selftests/net/l2tp.sh
diff --git a/tools/testing/selftests/net/l2tp.sh b/tools/testing/selftests/net/l2tp.sh
new file mode 100644
index 000000000000..5782433886fc
--- /dev/null
+++ b/tools/testing/selftests/net/l2tp.sh
@@ -0,0 +1,382 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# L2TPv3 tunnel between 2 hosts
+#
+# host-1 | router | host-2
+# | |
+# lo l2tp | | l2tp lo
+# 172.16.101.1 172.16.1.1 | | 172.16.1.2 172.16.101.2
+# fc00:101::1 fc00:1::1 | | fc00:1::2 fc00:101::2
+# | |
+# eth0 | | eth0
+# 10.1.1.1 | | 10.1.2.1
+# 2001:db8:1::1 | | 2001:db8:2::1
+
+VERBOSE=0
+PAUSE_ON_FAIL=no
+
+which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping)
+
+################################################################################
+#
+log_test()
+{
+ local rc=$1
+ local expected=$2
+ local msg="$3"
+
+ if [ ${rc} -eq ${expected} ]; then
+ printf "TEST: %-60s [ OK ]\n" "${msg}"
+ nsuccess=$((nsuccess+1))
+ else
+ ret=1
+ nfail=$((nfail+1))
+ printf "TEST: %-60s [FAIL]\n" "${msg}"
+ if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+ echo
+ echo "hit enter to continue, 'q' to quit"
+ read a
+ [ "$a" = "q" ] && exit 1
+ fi
+ fi
+}
+
+run_cmd()
+{
+ local ns
+ local cmd
+ local out
+ local rc
+
+ ns="$1"
+ shift
+ cmd="$*"
+
+ if [ "$VERBOSE" = "1" ]; then
+ printf " COMMAND: $cmd\n"
+ fi
+
+ out=$(eval ip netns exec ${ns} ${cmd} 2>&1)
+ rc=$?
+ if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+ echo " $out"
+ fi
+
+ [ "$VERBOSE" = "1" ] && echo
+
+ return $rc
+}
+
+################################################################################
+# create namespaces and interconnects
+
+create_ns()
+{
+ local ns=$1
+ local addr=$2
+ local addr6=$3
+
+ [ -z "${addr}" ] && addr="-"
+ [ -z "${addr6}" ] && addr6="-"
+
+ ip netns add ${ns}
+
+ ip -netns ${ns} link set lo up
+ if [ "${addr}" != "-" ]; then
+ ip -netns ${ns} addr add dev lo ${addr}
+ fi
+ if [ "${addr6}" != "-" ]; then
+ ip -netns ${ns} -6 addr add dev lo ${addr6}
+ fi
+
+ ip -netns ${ns} ro add unreachable default metric 8192
+ ip -netns ${ns} -6 ro add unreachable default metric 8192
+
+ ip netns exec ${ns} sysctl -qw net.ipv4.ip_forward=1
+ ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
+ ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.forwarding=1
+ ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.forwarding=1
+ ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.accept_dad=0
+}
+
+# create veth pair to connect namespaces and apply addresses.
+connect_ns()
+{
+ local ns1=$1
+ local ns1_dev=$2
+ local ns1_addr=$3
+ local ns1_addr6=$4
+ local ns2=$5
+ local ns2_dev=$6
+ local ns2_addr=$7
+ local ns2_addr6=$8
+
+ ip -netns ${ns1} li add ${ns1_dev} type veth peer name tmp
+ ip -netns ${ns1} li set ${ns1_dev} up
+ ip -netns ${ns1} li set tmp netns ${ns2} name ${ns2_dev}
+ ip -netns ${ns2} li set ${ns2_dev} up
+
+ if [ "${ns1_addr}" != "-" ]; then
+ ip -netns ${ns1} addr add dev ${ns1_dev} ${ns1_addr}
+ ip -netns ${ns2} addr add dev ${ns2_dev} ${ns2_addr}
+ fi
+
+ if [ "${ns1_addr6}" != "-" ]; then
+ ip -netns ${ns1} addr add dev ${ns1_dev} ${ns1_addr6}
+ ip -netns ${ns2} addr add dev ${ns2_dev} ${ns2_addr6}
+ fi
+}
+
+################################################################################
+# test setup
+
+cleanup()
+{
+ local ns
+
+ for ns in host-1 host-2 router
+ do
+ ip netns del ${ns} 2>/dev/null
+ done
+}
+
+setup_l2tp_ipv4()
+{
+ #
+ # configure l2tpv3 tunnel on host-1
+ #
+ ip -netns host-1 l2tp add tunnel tunnel_id 1041 peer_tunnel_id 1042 \
+ encap ip local 10.1.1.1 remote 10.1.2.1
+ ip -netns host-1 l2tp add session name l2tp4 tunnel_id 1041 \
+ session_id 1041 peer_session_id 1042
+ ip -netns host-1 link set dev l2tp4 up
+ ip -netns host-1 addr add dev l2tp4 172.16.1.1 peer 172.16.1.2
+
+ #
+ # configure l2tpv3 tunnel on host-2
+ #
+ ip -netns host-2 l2tp add tunnel tunnel_id 1042 peer_tunnel_id 1041 \
+ encap ip local 10.1.2.1 remote 10.1.1.1
+ ip -netns host-2 l2tp add session name l2tp4 tunnel_id 1042 \
+ session_id 1042 peer_session_id 1041
+ ip -netns host-2 link set dev l2tp4 up
+ ip -netns host-2 addr add dev l2tp4 172.16.1.2 peer 172.16.1.1
+
+ #
+ # add routes to loopback addresses
+ #
+ ip -netns host-1 ro add 172.16.101.2/32 via 172.16.1.2
+ ip -netns host-2 ro add 172.16.101.1/32 via 172.16.1.1
+}
+
+setup_l2tp_ipv6()
+{
+ #
+ # configure l2tpv3 tunnel on host-1
+ #
+ ip -netns host-1 l2tp add tunnel tunnel_id 1061 peer_tunnel_id 1062 \
+ encap ip local 2001:db8:1::1 remote 2001:db8:2::1
+ ip -netns host-1 l2tp add session name l2tp6 tunnel_id 1061 \
+ session_id 1061 peer_session_id 1062
+ ip -netns host-1 link set dev l2tp6 up
+ ip -netns host-1 addr add dev l2tp6 fc00:1::1 peer fc00:1::2
+
+ #
+ # configure l2tpv3 tunnel on host-2
+ #
+ ip -netns host-2 l2tp add tunnel tunnel_id 1062 peer_tunnel_id 1061 \
+ encap ip local 2001:db8:2::1 remote 2001:db8:1::1
+ ip -netns host-2 l2tp add session name l2tp6 tunnel_id 1062 \
+ session_id 1062 peer_session_id 1061
+ ip -netns host-2 link set dev l2tp6 up
+ ip -netns host-2 addr add dev l2tp6 fc00:1::2 peer fc00:1::1
+
+ #
+ # add routes to loopback addresses
+ #
+ ip -netns host-1 -6 ro add fc00:101::2/128 via fc00:1::2
+ ip -netns host-2 -6 ro add fc00:101::1/128 via fc00:1::1
+}
+
+setup()
+{
+ # start clean
+ cleanup
+
+ set -e
+ create_ns host-1 172.16.101.1/32 fc00:101::1/128
+ create_ns host-2 172.16.101.2/32 fc00:101::2/128
+ create_ns router
+
+ connect_ns host-1 eth0 10.1.1.1/24 2001:db8:1::1/64 \
+ router eth1 10.1.1.2/24 2001:db8:1::2/64
+
+ connect_ns host-2 eth0 10.1.2.1/24 2001:db8:2::1/64 \
+ router eth2 10.1.2.2/24 2001:db8:2::2/64
+
+ ip -netns host-1 ro add 10.1.2.0/24 via 10.1.1.2
+ ip -netns host-1 -6 ro add 2001:db8:2::/64 via 2001:db8:1::2
+
+ ip -netns host-2 ro add 10.1.1.0/24 via 10.1.2.2
+ ip -netns host-2 -6 ro add 2001:db8:1::/64 via 2001:db8:2::2
+
+ setup_l2tp_ipv4
+ setup_l2tp_ipv6
+ set +e
+}
+
+setup_ipsec()
+{
+ #
+ # IPv4
+ #
+ run_cmd host-1 ip xfrm policy add \
+ src 10.1.1.1 dst 10.1.2.1 dir out \
+ tmpl proto esp mode transport
+
+ run_cmd host-1 ip xfrm policy add \
+ src 10.1.2.1 dst 10.1.1.1 dir in \
+ tmpl proto esp mode transport
+
+ run_cmd host-2 ip xfrm policy add \
+ src 10.1.1.1 dst 10.1.2.1 dir in \
+ tmpl proto esp mode transport
+
+ run_cmd host-2 ip xfrm policy add \
+ src 10.1.2.1 dst 10.1.1.1 dir out \
+ tmpl proto esp mode transport
+
+ ip -netns host-1 xfrm state add \
+ src 10.1.1.1 dst 10.1.2.1 \
+ spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ ip -netns host-1 xfrm state add \
+ src 10.1.2.1 dst 10.1.1.1 \
+ spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ ip -netns host-2 xfrm state add \
+ src 10.1.1.1 dst 10.1.2.1 \
+ spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ ip -netns host-2 xfrm state add \
+ src 10.1.2.1 dst 10.1.1.1 \
+ spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ #
+ # IPV6
+ #
+ run_cmd host-1 ip -6 xfrm policy add \
+ src 2001:db8:1::1 dst 2001:db8:2::1 dir out \
+ tmpl proto esp mode transport
+
+ run_cmd host-1 ip -6 xfrm policy add \
+ src 2001:db8:2::1 dst 2001:db8:1::1 dir in \
+ tmpl proto esp mode transport
+
+ run_cmd host-2 ip -6 xfrm policy add \
+ src 2001:db8:1::1 dst 2001:db8:2::1 dir in \
+ tmpl proto esp mode transport
+
+ run_cmd host-2 ip -6 xfrm policy add \
+ src 2001:db8:2::1 dst 2001:db8:1::1 dir out \
+ tmpl proto esp mode transport
+
+ ip -netns host-1 -6 xfrm state add \
+ src 2001:db8:1::1 dst 2001:db8:2::1 \
+ spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ ip -netns host-1 -6 xfrm state add \
+ src 2001:db8:2::1 dst 2001:db8:1::1 \
+ spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ ip -netns host-2 -6 xfrm state add \
+ src 2001:db8:1::1 dst 2001:db8:2::1 \
+ spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+ ip -netns host-2 -6 xfrm state add \
+ src 2001:db8:2::1 dst 2001:db8:1::1 \
+ spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+ 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+}
+
+teardown_ipsec()
+{
+ run_cmd host-1 ip xfrm state flush
+ run_cmd host-1 ip xfrm policy flush
+ run_cmd host-2 ip xfrm state flush
+ run_cmd host-2 ip xfrm policy flush
+}
+
+################################################################################
+# generate traffic through tunnel for various cases
+
+run_ping()
+{
+ local desc="$1"
+
+ run_cmd host-1 ping -c1 -w1 172.16.1.2
+ log_test $? 0 "IPv4 basic L2TP tunnel ${desc}"
+
+ run_cmd host-1 ping -c1 -w1 -I 172.16.101.1 172.16.101.2
+ log_test $? 0 "IPv4 route through L2TP tunnel ${desc}"
+
+ run_cmd host-1 ${ping6} -c1 -w1 fc00:1::2
+ log_test $? 0 "IPv6 basic L2TP tunnel ${desc}"
+
+ run_cmd host-1 ${ping6} -c1 -w1 -I fc00:101::1 fc00:101::2
+ log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
+}
+
+run_tests()
+{
+ local desc
+
+ setup
+ run_ping
+
+ setup_ipsec
+ run_ping "- with IPsec"
+ run_cmd host-1 ping -c1 -w1 172.16.1.2
+ log_test $? 0 "IPv4 basic L2TP tunnel ${desc}"
+
+ run_cmd host-1 ping -c1 -w1 -I 172.16.101.1 172.16.101.2
+ log_test $? 0 "IPv4 route through L2TP tunnel ${desc}"
+
+ run_cmd host-1 ${ping6} -c1 -w1 fc00:1::2
+ log_test $? 0 "IPv6 basic L2TP tunnel - with IPsec"
+
+ run_cmd host-1 ${ping6} -c1 -w1 -I fc00:101::1 fc00:101::2
+ log_test $? 0 "IPv6 route through L2TP tunnel - with IPsec"
+
+ teardown_ipsec
+ run_ping "- after IPsec teardown"
+}
+
+################################################################################
+# main
+
+declare -i nfail=0
+declare -i nsuccess=0
+
+while getopts :pv o
+do
+ case $o in
+ p) PAUSE_ON_FAIL=yes;;
+ v) VERBOSE=$(($VERBOSE + 1));;
+ *) exit 1;;
+ esac
+done
+
+run_tests
+cleanup
+
+printf "\nTests passed: %3d\n" ${nsuccess}
+printf "Tests failed: %3d\n" ${nfail}
--
2.11.0
^ permalink raw reply related
* Re: [net-next 2/9] i40e: make visible changed vf mac on host
From: Shannon Nelson @ 2019-08-02 0:11 UTC (permalink / raw)
To: Jeff Kirsher, davem
Cc: Aleksandr Loktionov, netdev, nhorman, sassmann, Andrew Bowers
In-Reply-To: <20190801205149.4114-3-jeffrey.t.kirsher@intel.com>
On 8/1/19 1:51 PM, Jeff Kirsher wrote:
> From: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>
> This patch makes changed VM mac address visible on host via
> ip link show command. This problem is fixed by copying last
> unicast mac filter to vf->default_lan_addr.addr. Without
> this patch if VF MAC was not set from host side and
> if you run ip link show $pf, on host side you'd always
> see a zero MAC, not the real VF MAC that VF assigned to
> itself.
>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> index 02b09a8ad54c..21f7ac514d1f 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> @@ -2629,6 +2629,9 @@ static int i40e_vc_add_mac_addr_msg(struct i40e_vf *vf, u8 *msg)
> } else {
> vf->num_mac++;
> }
> + if (is_valid_ether_addr(al->list[i].addr))
> + ether_addr_copy(vf->default_lan_addr.addr,
> + al->list[i].addr);
> }
> }
> spin_unlock_bh(&vsi->mac_filter_hash_lock);
Since this copy is done inside the for-loop, it looks like you are
copying every address in the list, not just the last one. This seems
wasteful and unnecessary.
Since it is possible, altho' unlikely, that the filter sync that happens
a little later could fail, might it be better to do the copy after you
know that the sync has succeeded?
Why is the last mac chosen for display rather than the first? Is there
anything special about the last mac as opposed to the first mac?
sln
^ permalink raw reply
* Re: [PATCH net-next 00/15] net: Add functional tests for L3 and L4
From: Alexei Starovoitov @ 2019-08-02 0:19 UTC (permalink / raw)
To: David Ahern; +Cc: davem, netdev, David Ahern
In-Reply-To: <20190801185648.27653-1-dsahern@kernel.org>
On Thu, Aug 01, 2019 at 11:56:33AM -0700, David Ahern wrote:
> From: David Ahern <dsahern@gmail.com>
>
> This is a port the functional test cases created during the development
> of the VRF feature. It covers various permutations of icmp, tcp and udp
> for IPv4 and IPv6 including negative tests.
Thanks a lot for doing this!
Is there expected output ?
All tests suppose to pass on the latest net-next?
I'm seeing:
./fcnal-test.sh
...
SYSCTL: net.ipv4.raw_l3mdev_accept=0
TEST: ping out - ns-B IP [ OK ]
TEST: ping out, device bind - ns-B IP [ OK ]
TEST: ping out, address bind - ns-B IP [FAIL]
TEST: ping out - ns-B loopback IP [ OK ]
TEST: ping out, device bind - ns-B loopback IP [ OK ]
TEST: ping out, address bind - ns-B loopback IP [FAIL]
TEST: ping in - ns-A IP [ OK ]
TEST: ping in - ns-A loopback IP [ OK ]
TEST: ping local - ns-A IP [ OK ]
TEST: ping local - ns-A loopback IP [ OK ]
TEST: ping local - loopback [ OK ]
TEST: ping local, device bind - ns-A IP [ OK ]
TEST: ping local, device bind - ns-A loopback IP [FAIL]
TEST: ping local, device bind - loopback [FAIL]
with -v I see:
COMMAND: ip netns exec ns-A ping -c1 -w1 -I 172.16.2.1 172.16.1.2
ping: unknown iface 172.16.2.1
TEST: ping out, address bind - ns-B IP [FAIL]
Any tips for further debug?
Do you really need 'sleep 1' everywhere?
It makes them so slow to run...
What happens if you just remove it ? Tests will fail? Why?
^ permalink raw reply
* Re: [PATCH net 0/2] flow_offload hardware priority fixes
From: Jakub Kicinski @ 2019-08-02 0:20 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: netfilter-devel, davem, netdev, marcelo.leitner, jiri, wenxu,
saeedm, paulb, gerlitz.or
In-Reply-To: <20190801112817.24976-1-pablo@netfilter.org>
On Thu, 1 Aug 2019 13:28:15 +0200, Pablo Neira Ayuso wrote:
> Please, apply, thank you.
I'm still waiting for a reply.
Perhaps since Pablo doesn't want to talk to me someone else can explain
to me why we want to seemingly diverge from the software model?
^ permalink raw reply
* Re: [PATCH net v3] net: bridge: move vlan init/deinit to NETDEV_REGISTER/UNREGISTER
From: Nikolay Aleksandrov @ 2019-08-02 0:38 UTC (permalink / raw)
To: netdev; +Cc: roopa, davem, bridge, michael-dev
In-Reply-To: <20190731224955.10908-1-nikolay@cumulusnetworks.com>
On 8/1/19 1:49 AM, Nikolay Aleksandrov wrote:
[snip]
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=204389
>
> Reported-by: michael-dev <michael-dev@fami-braun.de>
> Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> I tried a few different approaches to resolve this but they were all
> unsuitable for some kernels, this approach can go to stables easily
> and IMO is the way this had to be done from the start. Alternatively
> we could move only the br_vlan_add and pair it with a br_vlan_del of
> default_pvid on the same events, but I don't think it hurts to move
> the whole init/deinit there as it'd help older stable releases as well.
>
> I also tested the br_vlan_init error handling after the move by always
> returning errors from all over it. Since errors at NETDEV_REGISTER cause
> NETDEV_UNREGISTER we can deinit vlans properly for all cases regardless
> why it happened (e.g. device destruction or init error).
>
> net/bridge/br.c | 5 ++++-
> net/bridge/br_device.c | 10 ----------
> net/bridge/br_private.h | 20 +++++---------------
> net/bridge/br_vlan.c | 25 ++++++++++++++++++-------
> 4 files changed, 27 insertions(+), 33 deletions(-)
>
Self-NAK, after thinking more about how to best handle this and running more
tests I believe it'll be better to go with the alternative I suggested above -
to move out only the default pvid add out of br_vlan_init to NETDEV_REGSITER
and pair it with br_vlan_delete in NETDEV_UNREGISTER. That way we'll split
the init/deinit in 2 steps, but we'll keep the current order and will reduce
the churn for this fix, functionally it should be equivalent as that is the
problematic part of the init which has to be done after the netdev has been
registered.
I'll spin v4 tomorrow after running more tests with it.
^ permalink raw reply
* Re: [net v1 PATCH 4/4] net: fix bpf_xdp_adjust_head regression for generic-XDP
From: Jakub Kicinski @ 2019-08-02 0:44 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, David S. Miller, xdp-newbies, Daniel Borkmann,
brandon.cazander, Alexei Starovoitov
In-Reply-To: <156468243184.27559.7002090473019021952.stgit@firesoul>
On Thu, 01 Aug 2019 20:00:31 +0200, Jesper Dangaard Brouer wrote:
> When generic-XDP was moved to a later processing step by commit
> 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
> a regression was introduced when using bpf_xdp_adjust_head.
>
> The issue is that after this commit the skb->network_header is now
> changed prior to calling generic XDP and not after. Thus, if the header
> is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header
> also need to be updated again. Fix by calling skb_reset_network_header().
>
> Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
> Reported-by: Brandon Cazander <brandon.cazander@multapplied.net>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Out of curiosity what was your conclusion regarding resetting the
transport header as well?
^ permalink raw reply
* Re: [PATCH net-next 1/2] net: phy: broadcom: set features explicitly for BCM54616S
From: Tao Ren @ 2019-08-02 1:29 UTC (permalink / raw)
To: Heiner Kallweit, Andrew Lunn
Cc: Florian Fainelli, David S . Miller, Arun Parameswaran,
Justin Chen, Vladimir Oltean, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, Andrew Jeffery,
openbmc@lists.ozlabs.org
In-Reply-To: <88f4d709-d9bb-943c-37a9-aeebe8ca0ebc@fb.com>
On 7/31/19 10:20 PM, Tao Ren wrote:
> On 7/30/19 11:00 PM, Tao Ren wrote:
>> On 7/30/19 10:53 PM, Heiner Kallweit wrote:
>>> On 31.07.2019 02:12, Tao Ren wrote:
>>>> On 7/29/19 11:00 PM, Heiner Kallweit wrote:
>>>>> On 30.07.2019 07:05, Tao Ren wrote:
>>>>>> On 7/29/19 8:35 PM, Andrew Lunn wrote:
>>>>>>> On Mon, Jul 29, 2019 at 05:25:32PM -0700, Tao Ren wrote:
>>>>>>>> BCM54616S feature "PHY_GBIT_FEATURES" was removed by commit dcdecdcfe1fc
>>>>>>>> ("net: phy: switch drivers to use dynamic feature detection"). As dynamic
>>>>>>>> feature detection doesn't work when BCM54616S is working in RGMII-Fiber
>>>>>>>> mode (different sets of MII Control/Status registers being used), let's
>>>>>>>> set "PHY_GBIT_FEATURES" for BCM54616S explicitly.
>>>>>>>
>>>>>>> Hi Tao
>>>>>>>
>>>>>>> What exactly does it get wrong?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Andrew
>>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> BCM54616S is set to RGMII-Fiber (1000Base-X) mode on my platform, and none of the features (1000BaseT/100BaseT/10BaseT) can be detected by genphy_read_abilities(), because the PHY only reports 1000BaseX_Full|Half ability in this mode.
>>>>>>
>>>>> Are you going to use the PHY in copper or fibre mode?
>>>>> In case you use fibre mode, why do you need the copper modes set as supported?
>>>>> Or does the PHY just start in fibre mode and you want to switch it to copper mode?
>>>>
>>>> Hi Heiner,
>>>>
>>>> The phy starts in fiber mode and that's the mode I want.
>>>> My observation is: phydev->link is always 0 (Link status bit is never set in MII_BMSR) by using dynamic ability detection on my machine. I checked phydev->supported and it's set to "AutoNeg | TP | MII | Pause | Asym_Pause" by dynamic ability detection. Is it normal/expected? Or maybe the fix should go to different places? Thank you for your help.
>>>>
>>>
>>> Not sure whether you stated already which kernel version you're using.
>>> There's a brand-new extension to auto-detect 1000BaseX:
>>> f30e33bcdab9 ("net: phy: Add more 1000BaseX support detection")
>>> It's included in the 5.3-rc series.
>>
>> I'm running kernel 5.2.0. Thank you for the sharing and I didn't know the patch. Let me check it out.
>
> I applied above patch and ca72efb6bdc7 ("net: phy: Add detection of 1000BaseX link mode support") to my 5.2.0 tree but got following warning when booting up my machine:
>
> "PHY advertising (0,00000200,000062c0) more modes than genphy supports, some modes not advertised".
>
> The BCM54616S PHY on my machine only reports 1000-X features in RGMII->1000Base-KX mode. Is it a known problem?
>
> Anyways let me see if I missed some dependency/follow-up patches..
Let's ignore the patch ("net: phy: broadcom: set features explicitly for BCM54616S"): as Heiner pointed out, it doesn't make sense to turn on copper features for fiber mode (even though it "works" in my environment). I will work out new patch if 1000bx-auto-detection patches cannot solve my problem.
Thank you all for spending time on this.
Cheers,
Tao
^ permalink raw reply
* [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: john.hubbard @ 2019-08-02 2:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
xen-devel, John Hubbard
From: John Hubbard <jhubbard@nvidia.com>
Hi,
These are best characterized as miscellaneous conversions: many (not all)
call sites that don't involve biovec or iov_iter, nor mm/. It also leaves
out a few call sites that require some more work. These are mostly pretty
simple ones.
It's probably best to send all of these via Andrew's -mm tree, assuming
that there are no significant merge conflicts with ongoing work in other
trees (which I doubt, given that these are small changes).
These patches apply to the latest linux.git. Patch #1 is also already in
Andrew's tree, but given the broad non-linux-mm Cc list, I thought it
would be more convenient to just include that patch here, so that people
can use linux.git as the base--even though these are probably destined
for linux-mm.
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions"). That commit
has an extensive description of the problem and the planned steps to
solve it, but the highlites are:
1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
And a few references, also from that commit:
[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Ira Weiny (1):
fs/binfmt_elf: convert put_page() to put_user_page*()
John Hubbard (33):
mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
net/rds: convert put_page() to put_user_page*()
net/ceph: convert put_page() to put_user_page*()
x86/kvm: convert put_page() to put_user_page*()
drm/etnaviv: convert release_pages() to put_user_pages()
drm/i915: convert put_page() to put_user_page*()
drm/radeon: convert put_page() to put_user_page*()
media/ivtv: convert put_page() to put_user_page*()
media/v4l2-core/mm: convert put_page() to put_user_page*()
genwqe: convert put_page() to put_user_page*()
scif: convert put_page() to put_user_page*()
vmci: convert put_page() to put_user_page*()
rapidio: convert put_page() to put_user_page*()
oradax: convert put_page() to put_user_page*()
staging/vc04_services: convert put_page() to put_user_page*()
drivers/tee: convert put_page() to put_user_page*()
vfio: convert put_page() to put_user_page*()
fbdev/pvr2fb: convert put_page() to put_user_page*()
fsl_hypervisor: convert put_page() to put_user_page*()
xen: convert put_page() to put_user_page*()
fs/exec.c: convert put_page() to put_user_page*()
orangefs: convert put_page() to put_user_page*()
uprobes: convert put_page() to put_user_page*()
futex: convert put_page() to put_user_page*()
mm/frame_vector.c: convert put_page() to put_user_page*()
mm/gup_benchmark.c: convert put_page() to put_user_page*()
mm/memory.c: convert put_page() to put_user_page*()
mm/madvise.c: convert put_page() to put_user_page*()
mm/process_vm_access.c: convert put_page() to put_user_page*()
crypt: convert put_page() to put_user_page*()
nfs: convert put_page() to put_user_page*()
goldfish_pipe: convert put_page() to put_user_page*()
kernel/events/core.c: convert put_page() to put_user_page*()
arch/x86/kvm/svm.c | 4 +-
crypto/af_alg.c | 7 +-
drivers/gpu/drm/etnaviv/etnaviv_gem.c | 4 +-
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 9 +-
drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
drivers/infiniband/core/umem.c | 5 +-
drivers/infiniband/hw/hfi1/user_pages.c | 5 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 5 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +-
drivers/infiniband/sw/siw/siw_mem.c | 10 +-
drivers/media/pci/ivtv/ivtv-udma.c | 14 +--
drivers/media/pci/ivtv/ivtv-yuv.c | 10 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +-
drivers/misc/genwqe/card_utils.c | 17 +--
drivers/misc/mic/scif/scif_rma.c | 17 ++-
drivers/misc/vmw_vmci/vmci_context.c | 2 +-
drivers/misc/vmw_vmci/vmci_queue_pair.c | 11 +-
drivers/platform/goldfish/goldfish_pipe.c | 9 +-
drivers/rapidio/devices/rio_mport_cdev.c | 9 +-
drivers/sbus/char/oradax.c | 2 +-
.../interface/vchiq_arm/vchiq_2835_arm.c | 10 +-
drivers/tee/tee_shm.c | 10 +-
drivers/vfio/vfio_iommu_type1.c | 8 +-
drivers/video/fbdev/pvr2fb.c | 3 +-
drivers/virt/fsl_hypervisor.c | 7 +-
drivers/xen/gntdev.c | 5 +-
drivers/xen/privcmd.c | 7 +-
fs/binfmt_elf.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
fs/exec.c | 2 +-
fs/nfs/direct.c | 4 +-
fs/orangefs/orangefs-bufmap.c | 7 +-
include/linux/mm.h | 5 +-
kernel/events/core.c | 2 +-
kernel/events/uprobes.c | 6 +-
kernel/futex.c | 10 +-
mm/frame_vector.c | 4 +-
mm/gup.c | 115 ++++++++----------
mm/gup_benchmark.c | 2 +-
mm/madvise.c | 2 +-
mm/memory.c | 2 +-
mm/process_vm_access.c | 18 +--
net/ceph/pagevec.c | 8 +-
net/rds/info.c | 5 +-
net/rds/message.c | 2 +-
net/rds/rdma.c | 15 ++-
virt/kvm/kvm_main.c | 4 +-
47 files changed, 151 insertions(+), 266 deletions(-)
--
2.22.0
^ permalink raw reply
* [PATCH 01/34] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: john.hubbard @ 2019-08-02 2:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
xen-devel, John Hubbard, Matthew Wilcox, Christoph Hellwig
In-Reply-To: <20190802021653.4882-1-jhubbard@nvidia.com>
From: John Hubbard <jhubbard@nvidia.com>
Provide more capable variation of put_user_pages_dirty_lock(),
and delete put_user_pages_dirty(). This is based on the
following:
1. Lots of call sites become simpler if a bool is passed
into put_user_page*(), instead of making the call site
choose which put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock()
is usually correct, and set_page_dirty() is usually a
bug, or at least questionable, within a put_user_page*()
calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
drivers/infiniband/core/umem.c | 5 +-
drivers/infiniband/hw/hfi1/user_pages.c | 5 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 5 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +-
drivers/infiniband/sw/siw/siw_mem.c | 10 +-
include/linux/mm.h | 5 +-
mm/gup.c | 115 +++++++++------------
7 files changed, 58 insertions(+), 92 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 08da840ed7ee..965cf9dea71a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,10 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
page = sg_page_iter_page(&sg_iter);
- if (umem->writable && dirty)
- put_user_pages_dirty_lock(&page, 1);
- else
- put_user_page(page);
+ put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
}
sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index b89a9b9aef7a..469acb961fbd 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,10 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
size_t npages, bool dirty)
{
- if (dirty)
- put_user_pages_dirty_lock(p, npages);
- else
- put_user_pages(p, npages);
+ put_user_pages_dirty_lock(p, npages, dirty);
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, &mm->pinned_vm);
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index bfbfbb7e0ff4..6bf764e41891 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,10 +40,7 @@
static void __qib_release_user_pages(struct page **p, size_t num_pages,
int dirty)
{
- if (dirty)
- put_user_pages_dirty_lock(p, num_pages);
- else
- put_user_pages(p, num_pages);
+ put_user_pages_dirty_lock(p, num_pages, dirty);
}
/**
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 0b0237d41613..62e6ffa9ad78 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -75,10 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
for_each_sg(chunk->page_list, sg, chunk->nents, i) {
page = sg_page(sg);
pa = sg_phys(sg);
- if (dirty)
- put_user_pages_dirty_lock(&page, 1);
- else
- put_user_page(page);
+ put_user_pages_dirty_lock(&page, 1, dirty);
usnic_dbg("pa: %pa\n", &pa);
}
kfree(chunk);
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index 67171c82b0c4..ab83a9cec562 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -63,15 +63,7 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
bool dirty)
{
- struct page **p = chunk->plist;
-
- while (num_pages--) {
- if (!PageDirty(*p) && dirty)
- put_user_pages_dirty_lock(p, 1);
- else
- put_user_page(*p);
- p++;
- }
+ put_user_pages_dirty_lock(chunk->plist, num_pages, dirty);
}
void siw_umem_release(struct siw_umem *umem, bool dirty)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..9759b6a24420 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1057,8 +1057,9 @@ static inline void put_user_page(struct page *page)
put_page(page);
}
-void put_user_pages_dirty(struct page **pages, unsigned long npages);
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+ bool make_dirty);
+
void put_user_pages(struct page **pages, unsigned long npages);
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..7fefd7ab02c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,85 +29,70 @@ struct follow_page_context {
unsigned int page_mask;
};
-typedef int (*set_dirty_func_t)(struct page *page);
-
-static void __put_user_pages_dirty(struct page **pages,
- unsigned long npages,
- set_dirty_func_t sdf)
-{
- unsigned long index;
-
- for (index = 0; index < npages; index++) {
- struct page *page = compound_head(pages[index]);
-
- /*
- * Checking PageDirty at this point may race with
- * clear_page_dirty_for_io(), but that's OK. Two key cases:
- *
- * 1) This code sees the page as already dirty, so it skips
- * the call to sdf(). That could happen because
- * clear_page_dirty_for_io() called page_mkclean(),
- * followed by set_page_dirty(). However, now the page is
- * going to get written back, which meets the original
- * intention of setting it dirty, so all is well:
- * clear_page_dirty_for_io() goes on to call
- * TestClearPageDirty(), and write the page back.
- *
- * 2) This code sees the page as clean, so it calls sdf().
- * The page stays dirty, despite being written back, so it
- * gets written back again in the next writeback cycle.
- * This is harmless.
- */
- if (!PageDirty(page))
- sdf(page);
-
- put_user_page(page);
- }
-}
-
/**
- * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
- * @pages: array of pages to be marked dirty and released.
+ * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
+ * @pages: array of pages to be maybe marked dirty, and definitely released.
* @npages: number of pages in the @pages array.
+ * @make_dirty: whether to mark the pages dirty
*
* "gup-pinned page" refers to a page that has had one of the get_user_pages()
* variants called on that page.
*
* For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if it was previously listed as clean. Then, release
- * the page using put_user_page().
+ * compound page) dirty, if @make_dirty is true, and if the page was previously
+ * listed as clean. In any case, releases all pages using put_user_page(),
+ * possibly via put_user_pages(), for the non-dirty case.
*
* Please see the put_user_page() documentation for details.
*
- * set_page_dirty(), which does not lock the page, is used here.
- * Therefore, it is the caller's responsibility to ensure that this is
- * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), put_user_page().
*
*/
-void put_user_pages_dirty(struct page **pages, unsigned long npages)
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+ bool make_dirty)
{
- __put_user_pages_dirty(pages, npages, set_page_dirty);
-}
-EXPORT_SYMBOL(put_user_pages_dirty);
+ unsigned long index;
-/**
- * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
- * @pages: array of pages to be marked dirty and released.
- * @npages: number of pages in the @pages array.
- *
- * For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if it was previously listed as clean. Then, release
- * the page using put_user_page().
- *
- * Please see the put_user_page() documentation for details.
- *
- * This is just like put_user_pages_dirty(), except that it invokes
- * set_page_dirty_lock(), instead of set_page_dirty().
- *
- */
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
-{
- __put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+ /*
+ * TODO: this can be optimized for huge pages: if a series of pages is
+ * physically contiguous and part of the same compound page, then a
+ * single operation to the head page should suffice.
+ */
+
+ if (!make_dirty) {
+ put_user_pages(pages, npages);
+ return;
+ }
+
+ for (index = 0; index < npages; index++) {
+ struct page *page = compound_head(pages[index]);
+ /*
+ * Checking PageDirty at this point may race with
+ * clear_page_dirty_for_io(), but that's OK. Two key
+ * cases:
+ *
+ * 1) This code sees the page as already dirty, so it
+ * skips the call to set_page_dirty(). That could happen
+ * because clear_page_dirty_for_io() called
+ * page_mkclean(), followed by set_page_dirty().
+ * However, now the page is going to get written back,
+ * which meets the original intention of setting it
+ * dirty, so all is well: clear_page_dirty_for_io() goes
+ * on to call TestClearPageDirty(), and write the page
+ * back.
+ *
+ * 2) This code sees the page as clean, so it calls
+ * set_page_dirty(). The page stays dirty, despite being
+ * written back, so it gets written back again in the
+ * next writeback cycle. This is harmless.
+ */
+ if (!PageDirty(page))
+ set_page_dirty_lock(page);
+ put_user_page(page);
+ }
}
EXPORT_SYMBOL(put_user_pages_dirty_lock);
--
2.22.0
^ permalink raw reply related
* [PATCH 02/34] net/rds: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-02 2:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
xen-devel, John Hubbard, Santosh Shilimkar, David S . Miller
In-Reply-To: <20190802021653.4882-1-jhubbard@nvidia.com>
From: John Hubbard <jhubbard@nvidia.com>
For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
net/rds/info.c | 5 ++---
net/rds/message.c | 2 +-
net/rds/rdma.c | 15 +++++++--------
3 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/net/rds/info.c b/net/rds/info.c
index 03f6fd56d237..ca6af2889adf 100644
--- a/net/rds/info.c
+++ b/net/rds/info.c
@@ -162,7 +162,6 @@ int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval,
struct rds_info_lengths lens;
unsigned long nr_pages = 0;
unsigned long start;
- unsigned long i;
rds_info_func func;
struct page **pages = NULL;
int ret;
@@ -235,8 +234,8 @@ int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval,
ret = -EFAULT;
out:
- for (i = 0; pages && i < nr_pages; i++)
- put_page(pages[i]);
+ if (pages)
+ put_user_pages(pages, nr_pages);
kfree(pages);
return ret;
diff --git a/net/rds/message.c b/net/rds/message.c
index 50f13f1d4ae0..d7b0d266c437 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -404,7 +404,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
int i;
for (i = 0; i < rm->data.op_nents; i++)
- put_page(sg_page(&rm->data.op_sg[i]));
+ put_user_page(sg_page(&rm->data.op_sg[i]));
mmp = &rm->data.op_mmp_znotifier->z_mmp;
mm_unaccount_pinned_pages(mmp);
ret = -EFAULT;
diff --git a/net/rds/rdma.c b/net/rds/rdma.c
index 916f5ec373d8..6762e8696b99 100644
--- a/net/rds/rdma.c
+++ b/net/rds/rdma.c
@@ -162,8 +162,7 @@ static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages,
pages);
if (ret >= 0 && ret < nr_pages) {
- while (ret--)
- put_page(pages[ret]);
+ put_user_pages(pages, ret);
ret = -EFAULT;
}
@@ -276,7 +275,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args,
if (IS_ERR(trans_private)) {
for (i = 0 ; i < nents; i++)
- put_page(sg_page(&sg[i]));
+ put_user_page(sg_page(&sg[i]));
kfree(sg);
ret = PTR_ERR(trans_private);
goto out;
@@ -464,9 +463,10 @@ void rds_rdma_free_op(struct rm_rdma_op *ro)
* to local memory */
if (!ro->op_write) {
WARN_ON(!page->mapping && irqs_disabled());
- set_page_dirty(page);
+ put_user_pages_dirty_lock(&page, 1, true);
+ } else {
+ put_user_page(page);
}
- put_page(page);
}
kfree(ro->op_notifier);
@@ -481,8 +481,7 @@ void rds_atomic_free_op(struct rm_atomic_op *ao)
/* Mark page dirty if it was possibly modified, which
* is the case for a RDMA_READ which copies from remote
* to local memory */
- set_page_dirty(page);
- put_page(page);
+ put_user_pages_dirty_lock(&page, 1, true);
kfree(ao->op_notifier);
ao->op_notifier = NULL;
@@ -867,7 +866,7 @@ int rds_cmsg_atomic(struct rds_sock *rs, struct rds_message *rm,
return ret;
err:
if (page)
- put_page(page);
+ put_user_page(page);
rm->atomic.op_active = 0;
kfree(rm->atomic.op_notifier);
--
2.22.0
^ permalink raw reply related
* [PATCH 03/34] net/ceph: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-02 2:19 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
xen-devel, John Hubbard, Ilya Dryomov, Sage Weil,
David S . Miller
In-Reply-To: <20190802022005.5117-1-jhubbard@nvidia.com>
From: John Hubbard <jhubbard@nvidia.com>
For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Sage Weil <sage@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
net/ceph/pagevec.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 64305e7056a1..c88fff2ab9bd 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -12,13 +12,7 @@
void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty)
{
- int i;
-
- for (i = 0; i < num_pages; i++) {
- if (dirty)
- set_page_dirty_lock(pages[i]);
- put_page(pages[i]);
- }
+ put_user_pages_dirty_lock(pages, num_pages, dirty);
kvfree(pages);
}
EXPORT_SYMBOL(ceph_put_page_vector);
--
2.22.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox