* Re: [PATCH 20/34] xen: convert put_page() to put_user_page*()
From: John Hubbard @ 2019-08-02 5:48 UTC (permalink / raw)
To: Juergen Gross, john.hubbard, Andrew Morton
Cc: devel, Dave Chinner, Christoph Hellwig, Dan Williams, Ira Weiny,
x86, linux-mm, Dave Hansen, amd-gfx, dri-devel, intel-gfx,
linux-arm-kernel, linux-rpi-kernel, devel, xen-devel,
Boris Ostrovsky, rds-devel, Jérôme Glisse, Jan Kara,
ceph-devel, kvm, linux-block, linux-crypto, linux-fbdev,
linux-fsdevel, LKML, linux-media, linux-nfs, linux-rdma,
linux-xfs, netdev, sparclinux, Jason Gunthorpe
In-Reply-To: <4471e9dc-a315-42c1-0c3c-55ba4eeeb106@suse.com>
On 8/1/19 9:36 PM, Juergen Gross wrote:
> On 02.08.19 04:19, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
...
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 2f5ce7230a43..29e461dbee2d 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
>> @@ -611,15 +611,10 @@ static int lock_pages(
>> static void unlock_pages(struct page *pages[], unsigned int nr_pages)
>> {
>> - unsigned int i;
>> -
>> if (!pages)
>> return;
>> - for (i = 0; i < nr_pages; i++) {
>> - if (pages[i])
>> - put_page(pages[i]);
>> - }
>> + put_user_pages(pages, nr_pages);
>
> You are not handling the case where pages[i] is NULL here. Or am I
> missing a pending patch to put_user_pages() here?
>
Hi Juergen,
You are correct--this no longer handles the cases where pages[i]
is NULL. It's intentional, though possibly wrong. :)
I see that I should have added my standard blurb to this
commit description. I missed this one, but some of the other patches
have it. It makes the following, possibly incorrect claim:
"This changes the release code slightly, because each page slot in the
page_list[] array is no longer checked for NULL. However, that check
was wrong anyway, because the get_user_pages() pattern of usage here
never allowed for NULL entries within a range of pinned pages."
The way I've seen these page arrays used with get_user_pages(),
things are either done single page, or with a contiguous range. So
unless I'm missing a case where someone is either
a) releasing individual pages within a range (and thus likely messing
up their count of pages they have), or
b) allocating two gup ranges within the same pages[] array, with a
gap between the allocations,
...then it should be correct. If so, then I'll add the above blurb
to this patch's commit description.
If that's not the case (both here, and in 3 or 4 other patches in this
series, then as you said, I should add NULL checks to put_user_pages()
and put_user_pages_dirty_lock().
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply
* Re: [PATCH 20/34] xen: convert put_page() to put_user_page*()
From: Juergen Gross @ 2019-08-02 6:10 UTC (permalink / raw)
To: John Hubbard, john.hubbard, Andrew Morton
Cc: devel, Dave Chinner, Christoph Hellwig, Dan Williams, Ira Weiny,
x86, linux-mm, Dave Hansen, amd-gfx, dri-devel, intel-gfx,
linux-arm-kernel, linux-rpi-kernel, devel, xen-devel,
Boris Ostrovsky, rds-devel, Jérôme Glisse, Jan Kara,
ceph-devel, kvm, linux-block, linux-crypto, linux-fbdev,
linux-fsdevel, LKML, linux-media, linux-nfs, linux-rdma,
linux-xfs, netdev, sparclinux, Jason Gunthorpe
In-Reply-To: <d5140833-e9ee-beb5-ff0a-2d13a4fe819f@nvidia.com>
[-- Attachment #1: Type: text/plain, Size: 2441 bytes --]
On 02.08.19 07:48, John Hubbard wrote:
> On 8/1/19 9:36 PM, Juergen Gross wrote:
>> On 02.08.19 04:19, john.hubbard@gmail.com wrote:
>>> From: John Hubbard <jhubbard@nvidia.com>
> ...
>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>> index 2f5ce7230a43..29e461dbee2d 100644
>>> --- a/drivers/xen/privcmd.c
>>> +++ b/drivers/xen/privcmd.c
>>> @@ -611,15 +611,10 @@ static int lock_pages(
>>> static void unlock_pages(struct page *pages[], unsigned int nr_pages)
>>> {
>>> - unsigned int i;
>>> -
>>> if (!pages)
>>> return;
>>> - for (i = 0; i < nr_pages; i++) {
>>> - if (pages[i])
>>> - put_page(pages[i]);
>>> - }
>>> + put_user_pages(pages, nr_pages);
>>
>> You are not handling the case where pages[i] is NULL here. Or am I
>> missing a pending patch to put_user_pages() here?
>>
>
> Hi Juergen,
>
> You are correct--this no longer handles the cases where pages[i]
> is NULL. It's intentional, though possibly wrong. :)
>
> I see that I should have added my standard blurb to this
> commit description. I missed this one, but some of the other patches
> have it. It makes the following, possibly incorrect claim:
>
> "This changes the release code slightly, because each page slot in the
> page_list[] array is no longer checked for NULL. However, that check
> was wrong anyway, because the get_user_pages() pattern of usage here
> never allowed for NULL entries within a range of pinned pages."
>
> The way I've seen these page arrays used with get_user_pages(),
> things are either done single page, or with a contiguous range. So
> unless I'm missing a case where someone is either
>
> a) releasing individual pages within a range (and thus likely messing
> up their count of pages they have), or
>
> b) allocating two gup ranges within the same pages[] array, with a
> gap between the allocations,
>
> ...then it should be correct. If so, then I'll add the above blurb
> to this patch's commit description.
>
> If that's not the case (both here, and in 3 or 4 other patches in this
> series, then as you said, I should add NULL checks to put_user_pages()
> and put_user_pages_dirty_lock().
In this case it is not correct, but can easily be handled. The NULL case
can occur only in an error case with the pages array filled partially or
not at all.
I'd prefer something like the attached patch here.
Juergen
[-- Attachment #2: gup.patch --]
[-- Type: text/x-patch, Size: 2001 bytes --]
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 2f5ce7230a43..12bd3154126d 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -582,10 +582,11 @@ static long privcmd_ioctl_mmap_batch(
static int lock_pages(
struct privcmd_dm_op_buf kbufs[], unsigned int num,
- struct page *pages[], unsigned int nr_pages)
+ struct page *pages[], unsigned int *nr_pages)
{
- unsigned int i;
+ unsigned int i, free = *nr_pages;
+ *nr_pages = 0;
for (i = 0; i < num; i++) {
unsigned int requested;
int pinned;
@@ -593,35 +594,22 @@ static int lock_pages(
requested = DIV_ROUND_UP(
offset_in_page(kbufs[i].uptr) + kbufs[i].size,
PAGE_SIZE);
- if (requested > nr_pages)
+ if (requested > free)
return -ENOSPC;
pinned = get_user_pages_fast(
(unsigned long) kbufs[i].uptr,
- requested, FOLL_WRITE, pages);
+ requested, FOLL_WRITE, pages + *nr_pages);
if (pinned < 0)
return pinned;
- nr_pages -= pinned;
- pages += pinned;
+ free -= pinned;
+ *nr_pages += pinned;
}
return 0;
}
-static void unlock_pages(struct page *pages[], unsigned int nr_pages)
-{
- unsigned int i;
-
- if (!pages)
- return;
-
- for (i = 0; i < nr_pages; i++) {
- if (pages[i])
- put_page(pages[i]);
- }
-}
-
static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
{
struct privcmd_data *data = file->private_data;
@@ -681,11 +669,12 @@ static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
xbufs = kcalloc(kdata.num, sizeof(*xbufs), GFP_KERNEL);
if (!xbufs) {
+ nr_pages = 0;
rc = -ENOMEM;
goto out;
}
- rc = lock_pages(kbufs, kdata.num, pages, nr_pages);
+ rc = lock_pages(kbufs, kdata.num, pages, &nr_pages);
if (rc)
goto out;
@@ -699,7 +688,8 @@ static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
xen_preemptible_hcall_end();
out:
- unlock_pages(pages, nr_pages);
+ if (pages)
+ put_user_pages(pages, nr_pages);
kfree(xbufs);
kfree(pages);
kfree(kbufs);
^ permalink raw reply related
* [PATCH net-next] cnic: Explicitly initialize all reference counts to 0.
From: Michael Chan @ 2019-08-02 6:17 UTC (permalink / raw)
To: davem; +Cc: netdev, hslester96, Rasesh Mody, GR-Linux-NIC-Dev
The driver is relying on zero'ed allocated memory and does not
explicitly call atomic_set() to initialize the ref counts to 0. Add
these atomic_set() calls so that it will be more straight forward
to convert atomic ref counts to refcount_t.
Reported-by: Chuhong Yuan <hslester96@gmail.com>
Cc: Rasesh Mody <rmody@marvell.com>
Cc: <GR-Linux-NIC-Dev@marvell.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
drivers/net/ethernet/broadcom/cnic.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 57dc3cb..155599d 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -4096,12 +4096,16 @@ static int cnic_cm_alloc_mem(struct cnic_dev *dev)
{
struct cnic_local *cp = dev->cnic_priv;
u32 port_id;
+ int i;
cp->csk_tbl = kvcalloc(MAX_CM_SK_TBL_SZ, sizeof(struct cnic_sock),
GFP_KERNEL);
if (!cp->csk_tbl)
return -ENOMEM;
+ for (i = 0; i < MAX_CM_SK_TBL_SZ; i++)
+ atomic_set(&cp->csk_tbl[i].ref_count, 0);
+
port_id = prandom_u32();
port_id %= CNIC_LOCAL_PORT_RANGE;
if (cnic_init_id_tbl(&cp->csk_port_tbl, CNIC_LOCAL_PORT_RANGE,
@@ -5480,6 +5484,7 @@ static struct cnic_dev *cnic_alloc_dev(struct net_device *dev,
cdev->unregister_device = cnic_unregister_device;
cdev->iscsi_nl_msg_recv = cnic_iscsi_nl_msg_recv;
cdev->get_fc_npiv_tbl = cnic_get_fc_npiv_tbl;
+ atomic_set(&cdev->ref_count, 0);
cp = cdev->cnic_priv;
cp->dev = cdev;
--
2.5.1
^ permalink raw reply related
* Re: [v2,1/2] tools: bpftool: add net attach command to attach XDP on interface
From: Y Song @ 2019-08-02 6:23 UTC (permalink / raw)
To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190801081133.13200-2-danieltimlee@gmail.com>
On Thu, Aug 1, 2019 at 2:04 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> By this commit, using `bpftool net attach`, user can attach XDP prog on
> interface. New type of enum 'net_attach_type' has been made, as stated at
> cover-letter, the meaning of 'attach' is, prog will be attached on interface.
>
> BPF prog will be attached through libbpf 'bpf_set_link_xdp_fd'.
>
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> ---
> Changes in v2:
> - command 'load' changed to 'attach' for the consistency
> - 'NET_ATTACH_TYPE_XDP_DRIVE' changed to 'NET_ATTACH_TYPE_XDP_DRIVER'
>
> tools/bpf/bpftool/net.c | 107 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 106 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 67e99c56bc88..f3b57660b303 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -55,6 +55,35 @@ struct bpf_attach_info {
> __u32 flow_dissector_id;
> };
>
> +enum net_attach_type {
> + NET_ATTACH_TYPE_XDP,
> + NET_ATTACH_TYPE_XDP_GENERIC,
> + NET_ATTACH_TYPE_XDP_DRIVER,
> + NET_ATTACH_TYPE_XDP_OFFLOAD,
> + __MAX_NET_ATTACH_TYPE
> +};
> +
> +static const char * const attach_type_strings[] = {
> + [NET_ATTACH_TYPE_XDP] = "xdp",
> + [NET_ATTACH_TYPE_XDP_GENERIC] = "xdpgeneric",
> + [NET_ATTACH_TYPE_XDP_DRIVER] = "xdpdrv",
> + [NET_ATTACH_TYPE_XDP_OFFLOAD] = "xdpoffload",
> + [__MAX_NET_ATTACH_TYPE] = NULL,
> +};
> +
> +static enum net_attach_type parse_attach_type(const char *str)
> +{
> + enum net_attach_type type;
> +
> + for (type = 0; type < __MAX_NET_ATTACH_TYPE; type++) {
> + if (attach_type_strings[type] &&
> + is_prefix(str, attach_type_strings[type]))
> + return type;
> + }
> +
> + return __MAX_NET_ATTACH_TYPE;
> +}
> +
> static int dump_link_nlmsg(void *cookie, void *msg, struct nlattr **tb)
> {
> struct bpf_netdev_t *netinfo = cookie;
> @@ -223,6 +252,77 @@ static int query_flow_dissector(struct bpf_attach_info *attach_info)
> return 0;
> }
>
> +static int parse_attach_args(int argc, char **argv, int *progfd,
> + enum net_attach_type *attach_type, int *ifindex)
> +{
> + if (!REQ_ARGS(3))
> + return -EINVAL;
> +
> + *progfd = prog_parse_fd(&argc, &argv);
> + if (*progfd < 0)
> + return *progfd;
> +
> + *attach_type = parse_attach_type(*argv);
> + if (*attach_type == __MAX_NET_ATTACH_TYPE) {
> + p_err("invalid net attach/detach type");
> + return -EINVAL;
> + }
> +
> + NEXT_ARG();
> + if (!REQ_ARGS(1))
> + return -EINVAL;
> +
> + *ifindex = if_nametoindex(*argv);
> + if (!*ifindex) {
> + p_err("Invalid ifname");
Do you want to use the full function name "invalid if_nametoindex" here?
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int do_attach_detach_xdp(int *progfd, enum net_attach_type *attach_type,
> + int *ifindex)
You can just use plain int as the argument type here.
> +{
> + __u32 flags;
> + int err;
> +
> + flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> + if (*attach_type == NET_ATTACH_TYPE_XDP_GENERIC)
> + flags |= XDP_FLAGS_SKB_MODE;
> + if (*attach_type == NET_ATTACH_TYPE_XDP_DRIVER)
> + flags |= XDP_FLAGS_DRV_MODE;
> + if (*attach_type == NET_ATTACH_TYPE_XDP_OFFLOAD)
> + flags |= XDP_FLAGS_HW_MODE;
> +
> + err = bpf_set_link_xdp_fd(*ifindex, *progfd, flags);
> +
> + return err;
Just do "return bpf_set_link_xdp_fd(...)" and you do not need variable err.
> +}
> +
> +static int do_attach(int argc, char **argv)
> +{
> + enum net_attach_type attach_type;
> + int err, progfd, ifindex;
> +
> + err = parse_attach_args(argc, argv, &progfd, &attach_type, &ifindex);
> + if (err)
> + return err;
> +
> + if (is_prefix("xdp", attach_type_strings[attach_type]))
> + err = do_attach_detach_xdp(&progfd, &attach_type, &ifindex);
> +
> + if (err < 0) {
> + p_err("link set %s failed", attach_type_strings[attach_type]);
> + return -1;
> + }
The above "if (err < 0)" can be true only under the above "if (is_prefix(..))"
conditions. But compiler may optimize this. So I think current form is okay.
But could you change the return value to "return err" instead of "return -1"?
> +
> + if (json_output)
> + jsonw_null(json_wtr);
> +
> + return 0;
> +}
> +
> static int do_show(int argc, char **argv)
> {
> struct bpf_attach_info attach_info = {};
> @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
>
> fprintf(stderr,
> "Usage: %s %s { show | list } [dev <devname>]\n"
> + " %s %s attach PROG LOAD_TYPE <devname>\n"
> " %s %s help\n"
> + "\n"
> + " " HELP_SPEC_PROGRAM "\n"
> + " LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
> "Note: Only xdp and tc attachments are supported now.\n"
> " For progs attached to cgroups, use \"bpftool cgroup\"\n"
> " to dump program attachments. For program types\n"
> " sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> " consult iproute2.\n",
> - bin_name, argv[-2], bin_name, argv[-2]);
> + bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
>
> return 0;
> }
> @@ -319,6 +423,7 @@ static int do_help(int argc, char **argv)
> static const struct cmd cmds[] = {
> { "show", do_show },
> { "list", do_show },
> + { "attach", do_attach },
> { "help", do_help },
> { 0 }
> };
> --
> 2.20.1
>
^ permalink raw reply
* Re: [v2,2/2] tools: bpftool: add net detach command to detach XDP on interface
From: Y Song @ 2019-08-02 6:25 UTC (permalink / raw)
To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190801081133.13200-3-danieltimlee@gmail.com>
On Thu, Aug 1, 2019 at 2:04 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> By this commit, using `bpftool net detach`, the attached XDP prog can
> be detached. Detaching the BPF prog will be done through libbpf
> 'bpf_set_link_xdp_fd' with the progfd set to -1.
>
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> ---
> Changes in v2:
> - command 'unload' changed to 'detach' for the consistency
>
> tools/bpf/bpftool/net.c | 55 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 54 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index f3b57660b303..2ae9a613b05c 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -281,6 +281,31 @@ static int parse_attach_args(int argc, char **argv, int *progfd,
> return 0;
> }
>
> +static int parse_detach_args(int argc, char **argv,
> + enum net_attach_type *attach_type, int *ifindex)
> +{
> + if (!REQ_ARGS(2))
> + return -EINVAL;
> +
> + *attach_type = parse_attach_type(*argv);
> + if (*attach_type == __MAX_NET_ATTACH_TYPE) {
> + p_err("invalid net attach/detach type");
> + return -EINVAL;
> + }
> +
> + NEXT_ARG();
> + if (!REQ_ARGS(1))
> + return -EINVAL;
> +
> + *ifindex = if_nametoindex(*argv);
> + if (!*ifindex) {
> + p_err("Invalid ifname");
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> static int do_attach_detach_xdp(int *progfd, enum net_attach_type *attach_type,
> int *ifindex)
> {
> @@ -323,6 +348,31 @@ static int do_attach(int argc, char **argv)
> return 0;
> }
>
> +static int do_detach(int argc, char **argv)
> +{
> + enum net_attach_type attach_type;
> + int err, progfd, ifindex;
> +
> + err = parse_detach_args(argc, argv, &attach_type, &ifindex);
> + if (err)
> + return err;
> +
> + /* to detach xdp prog */
> + progfd = -1;
> + if (is_prefix("xdp", attach_type_strings[attach_type]))
> + err = do_attach_detach_xdp(&progfd, &attach_type, &ifindex);
Similar to previous patch, parameters no need to be pointer.
> +
> + if (err < 0) {
> + p_err("link set %s failed", attach_type_strings[attach_type]);
> + return -1;
Maybe "return err"?
> + }
> +
> + if (json_output)
> + jsonw_null(json_wtr);
> +
> + return 0;
> +}
> +
> static int do_show(int argc, char **argv)
> {
> struct bpf_attach_info attach_info = {};
> @@ -406,6 +456,7 @@ static int do_help(int argc, char **argv)
> fprintf(stderr,
> "Usage: %s %s { show | list } [dev <devname>]\n"
> " %s %s attach PROG LOAD_TYPE <devname>\n"
> + " %s %s detach LOAD_TYPE <devname>\n"
> " %s %s help\n"
> "\n"
> " " HELP_SPEC_PROGRAM "\n"
> @@ -415,7 +466,8 @@ static int do_help(int argc, char **argv)
> " to dump program attachments. For program types\n"
> " sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> " consult iproute2.\n",
> - bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
> + bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
> + bin_name, argv[-2]);
>
> return 0;
> }
> @@ -424,6 +476,7 @@ static const struct cmd cmds[] = {
> { "show", do_show },
> { "list", do_show },
> { "attach", do_attach },
> + { "detach", do_detach },
> { "help", do_help },
> { 0 }
> };
> --
> 2.20.1
>
^ permalink raw reply
* Re: [v2,0/2] tools: bpftool: add net attach/detach command to attach XDP prog
From: Y Song @ 2019-08-02 6:26 UTC (permalink / raw)
To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190801081133.13200-1-danieltimlee@gmail.com>
On Thu, Aug 1, 2019 at 1:33 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> Currently, bpftool net only supports dumping progs attached on the
> interface. To attach XDP prog on interface, user must use other tool
> (eg. iproute2). By this patch, with `bpftool net attach/detach`, user
> can attach/detach XDP prog on interface.
>
> $ ./bpftool prog
> ...
> 208: xdp name xdp_prog1 tag ad822e38b629553f gpl
> loaded_at 2019-07-28T18:03:11+0900 uid 0
> ...
> $ ./bpftool net attach id 208 xdpdrv enp6s0np1
> $ ./bpftool net
> xdp:
> enp6s0np1(5) driver id 208
> ...
> $ ./bpftool net detach xdpdrv enp6s0np1
> $ ./bpftool net
> xdp:
> ...
>
> While this patch only contains support for XDP, through `net
> attach/detach`, bpftool can further support other prog attach types.
new bpftool net subcommands are added. I guess doc and bash
completion needs update as well.
>
> XDP attach/detach tested on Mellanox ConnectX-4 and Netronome Agilio.
>
> ---
> Changes in v2:
> - command 'load/unload' changed to 'attach/detach' for the consistency
>
> Daniel T. Lee (2):
> tools: bpftool: add net attach command to attach XDP on interface
> tools: bpftool: add net detach command to detach XDP on interface
>
> tools/bpf/bpftool/net.c | 160 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 159 insertions(+), 1 deletion(-)
>
> --
> 2.20.1
>
^ permalink raw reply
* Re: [PATCH 16/34] drivers/tee: convert put_page() to put_user_page*()
From: Jens Wiklander @ 2019-08-02 6:29 UTC (permalink / raw)
To: john.hubbard
Cc: Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
Dave Hansen, Ira Weiny, Jan Kara, Jason Gunthorpe,
Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
dri-devel, intel-gfx, kvm, Linux ARM, linux-block,
open list:HARDWARE RANDOM NUMBER GENERATOR CORE, linux-fbdev,
linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
xen-devel, John Hubbard
In-Reply-To: <20190802022005.5117-17-jhubbard@nvidia.com>
On Fri, Aug 2, 2019 at 4:20 AM <john.hubbard@gmail.com> wrote:
>
> From: John Hubbard <jhubbard@nvidia.com>
>
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
>
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
>
> Cc: Jens Wiklander <jens.wiklander@linaro.org>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> drivers/tee/tee_shm.c | 10 ++--------
> 1 file changed, 2 insertions(+), 8 deletions(-)
Acked-by: Jens Wiklander <jens.wiklander@linaro.org>
I suppose you're taking this via your own tree or such.
Thanks,
Jens
>
> diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
> index 2da026fd12c9..c967d0420b67 100644
> --- a/drivers/tee/tee_shm.c
> +++ b/drivers/tee/tee_shm.c
> @@ -31,16 +31,13 @@ static void tee_shm_release(struct tee_shm *shm)
>
> poolm->ops->free(poolm, shm);
> } else if (shm->flags & TEE_SHM_REGISTER) {
> - size_t n;
> int rc = teedev->desc->ops->shm_unregister(shm->ctx, shm);
>
> if (rc)
> dev_err(teedev->dev.parent,
> "unregister shm %p failed: %d", shm, rc);
>
> - for (n = 0; n < shm->num_pages; n++)
> - put_page(shm->pages[n]);
> -
> + put_user_pages(shm->pages, shm->num_pages);
> kfree(shm->pages);
> }
>
> @@ -313,16 +310,13 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, unsigned long addr,
> return shm;
> err:
> if (shm) {
> - size_t n;
> -
> if (shm->id >= 0) {
> mutex_lock(&teedev->mutex);
> idr_remove(&teedev->idr, shm->id);
> mutex_unlock(&teedev->mutex);
> }
> if (shm->pages) {
> - for (n = 0; n < shm->num_pages; n++)
> - put_page(shm->pages[n]);
> + put_user_pages(shm->pages, shm->num_pages);
> kfree(shm->pages);
> }
> }
> --
> 2.22.0
>
^ permalink raw reply
* Re: [PATCH 05/14] gpio: lpc32xx: allow building on non-lpc32xx targets
From: Bartosz Golaszewski @ 2019-08-02 7:10 UTC (permalink / raw)
To: Arnd Bergmann
Cc: soc, arm-soc, Vladimir Zapolskiy, Sylvain Lemieux, Russell King,
Gregory Clement, Linus Walleij, Jason Cooper, Andrew Lunn,
Sebastian Hesselbarth, David S. Miller, Greg Kroah-Hartman,
Alan Stern, Guenter Roeck, linux-gpio, netdev, linux-serial,
linux-usb, LINUXWATCHDOG, Lee Jones, LKML
In-Reply-To: <20190731195713.3150463-6-arnd@arndb.de>
śr., 31 lip 2019 o 22:06 Arnd Bergmann <arnd@arndb.de> napisał(a):
>
> The driver uses hardwire MMIO addresses instead of the data
> that is passed in device tree. Change it over to only
> hardcode the register offset values and allow compile-testing.
>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Hi Arnd,
thanks for working on this.
> ---
> drivers/gpio/Kconfig | 8 +++++
> drivers/gpio/Makefile | 2 +-
> drivers/gpio/gpio-lpc32xx.c | 63 ++++++++++++++++++++++++-------------
> 3 files changed, 50 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
> index bb13c266c329..ae86ee963eae 100644
> --- a/drivers/gpio/Kconfig
> +++ b/drivers/gpio/Kconfig
> @@ -311,6 +311,14 @@ config GPIO_LPC18XX
> Select this option to enable GPIO driver for
> NXP LPC18XX/43XX devices.
>
> +config GPIO_LPC32XX
> + tristate "NXP LPC32XX GPIO support"
> + default ARCH_LPC32XX
> + depends on OF_GPIO && (ARCH_LPC32XX || COMPILE_TEST)
> + help
> + Select this option to enable GPIO driver for
> + NXP LPC32XX devices.
> +
> config GPIO_LYNXPOINT
> tristate "Intel Lynxpoint GPIO support"
> depends on ACPI && X86
> diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
> index a4e91175c708..87d659ae95eb 100644
> --- a/drivers/gpio/Makefile
> +++ b/drivers/gpio/Makefile
> @@ -74,7 +74,7 @@ obj-$(CONFIG_GPIO_LP3943) += gpio-lp3943.o
> obj-$(CONFIG_GPIO_LP873X) += gpio-lp873x.o
> obj-$(CONFIG_GPIO_LP87565) += gpio-lp87565.o
> obj-$(CONFIG_GPIO_LPC18XX) += gpio-lpc18xx.o
> -obj-$(CONFIG_ARCH_LPC32XX) += gpio-lpc32xx.o
> +obj-$(CONFIG_GPIO_LPC32XX) += gpio-lpc32xx.o
> obj-$(CONFIG_GPIO_LYNXPOINT) += gpio-lynxpoint.o
> obj-$(CONFIG_GPIO_MADERA) += gpio-madera.o
> obj-$(CONFIG_GPIO_MAX3191X) += gpio-max3191x.o
> diff --git a/drivers/gpio/gpio-lpc32xx.c b/drivers/gpio/gpio-lpc32xx.c
> index 24885b3db3d5..548f7cb69386 100644
> --- a/drivers/gpio/gpio-lpc32xx.c
> +++ b/drivers/gpio/gpio-lpc32xx.c
> @@ -16,8 +16,7 @@
> #include <linux/platform_device.h>
> #include <linux/module.h>
>
> -#include <mach/hardware.h>
> -#include <mach/platform.h>
> +#define _GPREG(x) (x)
What purpose does this macro serve?
>
> #define LPC32XX_GPIO_P3_INP_STATE _GPREG(0x000)
> #define LPC32XX_GPIO_P3_OUTP_SET _GPREG(0x004)
> @@ -72,12 +71,12 @@
> #define LPC32XX_GPO_P3_GRP (LPC32XX_GPI_P3_GRP + LPC32XX_GPI_P3_MAX)
>
> struct gpio_regs {
> - void __iomem *inp_state;
> - void __iomem *outp_state;
> - void __iomem *outp_set;
> - void __iomem *outp_clr;
> - void __iomem *dir_set;
> - void __iomem *dir_clr;
> + unsigned long inp_state;
> + unsigned long outp_state;
> + unsigned long outp_set;
> + unsigned long outp_clr;
> + unsigned long dir_set;
> + unsigned long dir_clr;
> };
>
> /*
> @@ -167,14 +166,26 @@ struct lpc32xx_gpio_chip {
> struct gpio_regs *gpio_grp;
> };
>
> +void __iomem *gpio_reg_base;
Any reason why this can't be made part of struct lpc32xx_gpio_chip?
> +
> +static inline u32 gpreg_read(unsigned long offset)
Here and elsewhere: could you please keep the lpc32xx_gpio prefix for
all symbols?
> +{
> + return __raw_readl(gpio_reg_base + offset);
> +}
> +
> +static inline void gpreg_write(u32 val, unsigned long offset)
> +{
> + __raw_writel(val, gpio_reg_base + offset);
> +}
> +
> static void __set_gpio_dir_p012(struct lpc32xx_gpio_chip *group,
> unsigned pin, int input)
> {
> if (input)
> - __raw_writel(GPIO012_PIN_TO_BIT(pin),
> + gpreg_write(GPIO012_PIN_TO_BIT(pin),
> group->gpio_grp->dir_clr);
> else
> - __raw_writel(GPIO012_PIN_TO_BIT(pin),
> + gpreg_write(GPIO012_PIN_TO_BIT(pin),
> group->gpio_grp->dir_set);
> }
>
> @@ -184,19 +195,19 @@ static void __set_gpio_dir_p3(struct lpc32xx_gpio_chip *group,
> u32 u = GPIO3_PIN_TO_BIT(pin);
>
> if (input)
> - __raw_writel(u, group->gpio_grp->dir_clr);
> + gpreg_write(u, group->gpio_grp->dir_clr);
> else
> - __raw_writel(u, group->gpio_grp->dir_set);
> + gpreg_write(u, group->gpio_grp->dir_set);
> }
>
> static void __set_gpio_level_p012(struct lpc32xx_gpio_chip *group,
> unsigned pin, int high)
> {
> if (high)
> - __raw_writel(GPIO012_PIN_TO_BIT(pin),
> + gpreg_write(GPIO012_PIN_TO_BIT(pin),
> group->gpio_grp->outp_set);
> else
> - __raw_writel(GPIO012_PIN_TO_BIT(pin),
> + gpreg_write(GPIO012_PIN_TO_BIT(pin),
> group->gpio_grp->outp_clr);
> }
>
> @@ -206,31 +217,31 @@ static void __set_gpio_level_p3(struct lpc32xx_gpio_chip *group,
> u32 u = GPIO3_PIN_TO_BIT(pin);
>
> if (high)
> - __raw_writel(u, group->gpio_grp->outp_set);
> + gpreg_write(u, group->gpio_grp->outp_set);
> else
> - __raw_writel(u, group->gpio_grp->outp_clr);
> + gpreg_write(u, group->gpio_grp->outp_clr);
> }
>
> static void __set_gpo_level_p3(struct lpc32xx_gpio_chip *group,
> unsigned pin, int high)
> {
> if (high)
> - __raw_writel(GPO3_PIN_TO_BIT(pin), group->gpio_grp->outp_set);
> + gpreg_write(GPO3_PIN_TO_BIT(pin), group->gpio_grp->outp_set);
> else
> - __raw_writel(GPO3_PIN_TO_BIT(pin), group->gpio_grp->outp_clr);
> + gpreg_write(GPO3_PIN_TO_BIT(pin), group->gpio_grp->outp_clr);
> }
>
> static int __get_gpio_state_p012(struct lpc32xx_gpio_chip *group,
> unsigned pin)
> {
> - return GPIO012_PIN_IN_SEL(__raw_readl(group->gpio_grp->inp_state),
> + return GPIO012_PIN_IN_SEL(gpreg_read(group->gpio_grp->inp_state),
> pin);
> }
>
> static int __get_gpio_state_p3(struct lpc32xx_gpio_chip *group,
> unsigned pin)
> {
> - int state = __raw_readl(group->gpio_grp->inp_state);
> + int state = gpreg_read(group->gpio_grp->inp_state);
>
> /*
> * P3 GPIO pin input mapping is not contiguous, GPIOP3-0..4 is mapped
> @@ -242,13 +253,13 @@ static int __get_gpio_state_p3(struct lpc32xx_gpio_chip *group,
> static int __get_gpi_state_p3(struct lpc32xx_gpio_chip *group,
> unsigned pin)
> {
> - return GPI3_PIN_IN_SEL(__raw_readl(group->gpio_grp->inp_state), pin);
> + return GPI3_PIN_IN_SEL(gpreg_read(group->gpio_grp->inp_state), pin);
> }
>
> static int __get_gpo_state_p3(struct lpc32xx_gpio_chip *group,
> unsigned pin)
> {
> - return GPO3_PIN_IN_SEL(__raw_readl(group->gpio_grp->outp_state), pin);
> + return GPO3_PIN_IN_SEL(gpreg_read(group->gpio_grp->outp_state), pin);
> }
>
> /*
> @@ -498,6 +509,10 @@ static int lpc32xx_gpio_probe(struct platform_device *pdev)
> {
> int i;
>
> + gpio_reg_base = devm_platform_ioremap_resource(pdev, 0);
> + if (gpio_reg_base)
> + return -ENXIO;
> +
> for (i = 0; i < ARRAY_SIZE(lpc32xx_gpiochip); i++) {
> if (pdev->dev.of_node) {
> lpc32xx_gpiochip[i].chip.of_xlate = lpc32xx_of_xlate;
> @@ -527,3 +542,7 @@ static struct platform_driver lpc32xx_gpio_driver = {
> };
>
> module_platform_driver(lpc32xx_gpio_driver);
> +
> +MODULE_AUTHOR("Kevin Wells <kevin.wells@nxp.com>");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("GPIO driver for LPC32xx SoC");
> --
> 2.20.0
>
Bart
^ permalink raw reply
* Re: [PATCH v3 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-08-02 7:16 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
Daniel Borkmann, Yonghong Song, Song Liu, Kernel Team
In-Reply-To: <20190801235030.bzssmwzuvzdy7h7t@ast-mbp.dhcp.thefacebook.com>
On Thu, Aug 1, 2019 at 4:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 31, 2019 at 11:47:53PM -0700, Andrii Nakryiko wrote:
> > This patch implements the core logic for BPF CO-RE offsets relocations.
> > Every instruction that needs to be relocated has corresponding
> > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > to match recorded "local" relocation spec against potentially many
> > compatible "target" types, creating corresponding spec. Details of the
> > algorithm are noted in corresponding comments in the code.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > Acked-by: Song Liu <songliubraving@fb.com>
> ...
> > + if (btf_is_composite(t)) {
> > + const struct btf_member *m = (void *)(t + 1);
> > + __u32 offset;
> > +
> > + if (access_idx >= BTF_INFO_VLEN(t->info))
> > + return -EINVAL;
> > +
> > + m = &m[access_idx];
> > +
> > + if (BTF_INFO_KFLAG(t->info)) {
> > + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> > + return -EINVAL;
> > + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> > + } else {
> > + offset = m->offset;
> > + }
>
> very similar logic exists in btf_dump.c
> probably makes sense to make a common helper at some point.
Will add btf_member_bit_offset(type, member) and
btf_member_bit_size(type, member).
>
> > +static size_t bpf_core_essential_name_len(const char *name)
> > +{
> > + size_t n = strlen(name);
> > + int i = n - 3;
> > +
> > + while (i > 0) {
> > + if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> > + return i;
> > + i--;
> > + }
> > + return n;
> > +}
>
> that's a bit of an eye irritant. How about?
> size_t n = strlen(name);
> int i, cnt = 0;
>
> for (i = n - 1; i >= 0; i--) {
> if (name[i] == '_') {
> cnt++;
> } else {
> if (cnt == 3)
> return i + 1;
> cnt = 0;
> }
> }
> return n;
I find this one much harder to read and understand. What's
eye-irritating about that loop?
Your loop will also handle `a____b` differently. My version will
return "a_" as essential name, yours "a____b". Was this intentional on
your part?
I'd rather use this instead, if you hate the first one:
size_t n = strlen(name);
int i;
for (i = n - 3; i > 0; i--) {
if (strncmp(name + i, "___", 3) == 0)
return i;
}
Is this better?
>
> > + case BTF_KIND_ARRAY: {
> > + const struct btf_array *loc_a, *targ_a;
> > +
> > + loc_a = (void *)(local_type + 1);
> > + targ_a = (void *)(targ_type + 1);
> > + local_id = loc_a->type;
> > + targ_id = targ_a->type;
>
> can we add a helper like:
Yes, we can. I was thinking about that, but decided to not expand
patch set. But we do need to extract all those small, but nice
helpers. I'll put them in libbpf_internal.h for now, but I think it
might be good idea to expose them as part of btf.h. Thoughts?
> const struct btf_array *btf_array(cosnt struct btf_type *t)
> {
> return (const struct btf_array *)(t + 1);
> }
>
> then above will be:
> case BTF_KIND_ARRAY: {
> local_id = btf_array(local_type)->type;
> targ_id = btf_array(targ_type)->type;
>
> and a bunch of code in btf.c and btf_dump.c would be cleaner as well?
Yep, some of those are already scattered around btf.c and btf_dump.c,
will clean up and add patch to this patch set.
>
> > + goto recur;
> > + }
> > + default:
> > + pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> > + kind, local_id, targ_id);
> > + return 0;
> > + }
> > +}
> > +
> > +/*
> > + * Given single high-level named field accessor in local type, find
> > + * corresponding high-level accessor for a target type. Along the way,
> > + * maintain low-level spec for target as well. Also keep updating target
> > + * offset.
> > + *
> > + * Searching is performed through recursive exhaustive enumeration of all
> > + * fields of a struct/union. If there are any anonymous (embedded)
> > + * structs/unions, they are recursively searched as well. If field with
> > + * desired name is found, check compatibility between local and target types,
> > + * before returning result.
> > + *
> > + * 1 is returned, if field is found.
> > + * 0 is returned if no compatible field is found.
> > + * <0 is returned on error.
> > + */
> > +static int bpf_core_match_member(const struct btf *local_btf,
> > + const struct bpf_core_accessor *local_acc,
> > + const struct btf *targ_btf,
> > + __u32 targ_id,
> > + struct bpf_core_spec *spec,
> > + __u32 *next_targ_id)
> > +{
> > + const struct btf_type *local_type, *targ_type;
> > + const struct btf_member *local_member, *m;
> > + const char *local_name, *targ_name;
> > + __u32 local_id;
> > + int i, n, found;
> > +
> > + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> > + if (!targ_type)
> > + return -EINVAL;
> > + if (!btf_is_composite(targ_type))
> > + return 0;
> > +
> > + local_id = local_acc->type_id;
> > + local_type = btf__type_by_id(local_btf, local_id);
> > + local_member = (void *)(local_type + 1);
> > + local_member += local_acc->idx;
> > + local_name = btf__name_by_offset(local_btf, local_member->name_off);
> > +
> > + n = BTF_INFO_VLEN(targ_type->info);
> > + m = (void *)(targ_type + 1);
>
> new btf_member() helper?
>
> > + for (i = 0; i < n; i++, m++) {
> > + __u32 offset;
> > +
> > + /* bitfield relocations not supported */
> > + if (BTF_INFO_KFLAG(targ_type->info)) {
> > + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> > + continue;
> > + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> > + } else {
> > + offset = m->offset;
> > + }
> > + if (offset % 8)
> > + continue;
>
> same bit of code again?
> definitely could use a helper.
Different handling (continue here, return error above), but can use
those helpers I mentioned above.
>
> > + for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> > + targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> > + &targ_id);
> > + if (!targ_type)
> > + return -EINVAL;
> > +
> > + if (local_acc->name) {
> > + matched = bpf_core_match_member(local_spec->btf,
> > + local_acc,
> > + targ_btf, targ_id,
> > + targ_spec, &targ_id);
> > + if (matched <= 0)
> > + return matched;
> > + } else {
> > + /* for i=0, targ_id is already treated as array element
> > + * type (because it's the original struct), for others
> > + * we should find array element type first
> > + */
> > + if (i > 0) {
> > + const struct btf_array *a;
> > +
> > + if (!btf_is_array(targ_type))
> > + return 0;
> > +
> > + a = (void *)(targ_type + 1);
> > + if (local_acc->idx >= a->nelems)
> > + return 0;
>
> am I reading it correctly that the local spec requested out-of-bounds
> index in the target array type?
> Why this is 'ignore' instead of -EINVAL?
Similar to any other mismatch (e.g., int in local type vs int64 in
target type). It just makes candidate not matching. Why would that be
error that will stop the whole relocation and subsequently object
loading process?
>
> > +/*
> > + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> > + * data out of it to use for target BTF.
> > + */
> > +static struct btf *bpf_core_find_kernel_btf(void)
> > +{
> > + const char *locations[] = {
> > + "/lib/modules/%1$s/vmlinux-%1$s",
> > + "/usr/lib/modules/%1$s/kernel/vmlinux",
> > + };
> > + char path[PATH_MAX + 1];
> > + struct utsname buf;
> > + struct btf *btf;
> > + int i, err;
> > +
> > + err = uname(&buf);
> > + if (err) {
> > + pr_warning("failed to uname(): %d\n", err);
>
> defensive programming ?
> I think uname() can fail only if &buf points to non-existing page like null.
I haven't checked source for this syscall, but man page specified that
it might return -1 on error.
>
> > + return ERR_PTR(err);
> > + }
> > +
> > + for (i = 0; i < ARRAY_SIZE(locations); i++) {
> > + snprintf(path, PATH_MAX, locations[i], buf.release);
> > + pr_debug("attempting to load kernel BTF from '%s'\n", path);
>
> I think this debug message would have been more useful after access().
Sure, will move.
>
> > +
> > + if (access(path, R_OK))
> > + continue;
> > +
> > + btf = btf__parse_elf(path, NULL);
> > + if (IS_ERR(btf))
> > + continue;
> > +
> > + pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> > + return btf;
> > + }
> > +
> > + pr_warning("failed to find valid kernel BTF\n");
> > + return ERR_PTR(-ESRCH);
> > +}
> > +
> > +/* Output spec definition in the format:
> > + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> > + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> > + */
> > +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> > +{
> > + const struct btf_type *t;
> > + const char *s;
> > + __u32 type_id;
> > + int i;
> > +
> > + type_id = spec->spec[0].type_id;
> > + t = btf__type_by_id(spec->btf, type_id);
> > + s = btf__name_by_offset(spec->btf, t->name_off);
> > + libbpf_print(level, "[%u] (%s) + ", type_id, s);
>
> imo extra []() don't improve readability of the dump.
[<num>] is the general convention I've been using throughout libbpf to
specify type ID, so I'd rather keep it for consistency. I can drop
parens, though, no problem.
>
> > +
> > + for (i = 0; i < spec->raw_len; i++)
> > + libbpf_print(level, "%d%s", spec->raw_spec[i],
> > + i == spec->raw_len - 1 ? " => " : ":");
> > +
> > + libbpf_print(level, "%u @ &x", spec->offset);
> > +
> > + for (i = 0; i < spec->len; i++) {
> > + if (spec->spec[i].name)
> > + libbpf_print(level, ".%s", spec->spec[i].name);
> > + else
> > + libbpf_print(level, "[%u]", spec->spec[i].idx);
> > + }
> > +
> > +}
> > +
> > +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> > +{
> > + return (size_t)key;
> > +}
> > +
> > +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> > +{
> > + return k1 == k2;
> > +}
> > +
> > +static void *u32_to_ptr(__u32 x)
> > +{
> > + return (void *)(uintptr_t)x;
> > +}
>
> u32 to pointer on 64-bit arch?!
> That surely needs a comment.
I should probably call it u32_to_hash_key() to make it obvious it's
conversion to hashmap's generic `void *` key type.
>
> > +
> > +/*
> > + * CO-RE relocate single instruction.
> > + *
> > + * The outline and important points of the algorithm:
> > + * 1. For given local type, find corresponding candidate target types.
> > + * Candidate type is a type with the same "essential" name, ignoring
> > + * everything after last triple underscore (___). E.g., `sample`,
> > + * `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> > + * for each other. Names with triple underscore are referred to as
> > + * "flavors" and are useful, among other things, to allow to
> > + * specify/support incompatible variations of the same kernel struct, which
> > + * might differ between different kernel versions and/or build
> > + * configurations.
> > + *
> > + * N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> > + * converter, when deduplicated BTF of a kernel still contains more than
> > + * one different types with the same name. In that case, ___2, ___3, etc
> > + * are appended starting from second name conflict. But start flavors are
> > + * also useful to be defined "locally", in BPF program, to extract same
> > + * data from incompatible changes between different kernel
> > + * versions/configurations. For instance, to handle field renames between
> > + * kernel versions, one can use two flavors of the struct name with the
> > + * same common name and use conditional relocations to extract that field,
> > + * depending on target kernel version.
>
> there are actual kernel types that have ___ in the name.
> Ex: struct lmc___media
> We probably need to revisit this 'flavor' convention.
There are only these:
- lmc___softc
- lmc___media
- lmc___ctl (all three in drivers/net/wan/lmc/lmc_var.h)
- ____ftrace_##name set of structs
I couldn't come up with anything cleaner-looking. I think we can still
keep ___ convention, but:
1. Match only exactly 3 underscores, delimited by non-underscore from
both sides (so similar to your proposed loop above);
2. We can also try matching candidates assuming full name without
___xxx part removed, in addition to current logic. This seems like an
overkill at this point and unlikely to be useful in practice, so I'd
postpone implementing this until we really need it.
What do you think? Which other convention did you have in mind?
>
> > + for (i = 0, j = 0; i < cand_ids->len; i++) {
> > + cand_id = cand_ids->data[i];
> > + cand_type = btf__type_by_id(targ_btf, cand_id);
> > + cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> > +
> > + err = bpf_core_spec_match(&local_spec, targ_btf,
> > + cand_id, &cand_spec);
> > + if (err < 0) {
> > + pr_warning("prog '%s': relo #%d: failed to match spec ",
> > + prog_name, relo_idx);
> > + bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> > + libbpf_print(LIBBPF_WARN,
> > + " to candidate #%d [%d] (%s): %d\n",
> > + i, cand_id, cand_name, err);
> > + return err;
> > + }
> > + if (err == 0) {
> > + pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> > + prog_name, relo_idx, i, cand_id, cand_name);
> > + bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> > + libbpf_print(LIBBPF_DEBUG, "\n");
> > + continue;
> > + }
> > +
> > + pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> > + prog_name, relo_idx, i);
>
> did you mention that you're going to make a helper for this debug dumps?
yeah, I added bpf_core_dump_spec(), but I don't know how to shorten
this further... This output is extremely useful to understand what's
happening and will be invaluable when users will inevitably report
confusing behavior in some cases, so I still want to keep it.
>
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH bpf-next v4 03/11] libbpf: add flags to umem config
From: Andrii Nakryiko @ 2019-08-02 7:19 UTC (permalink / raw)
To: Björn Töpel
Cc: Kevin Laatz, Netdev, Alexei Starovoitov, Daniel Borkmann,
Björn Töpel, Karlsson, Magnus, Jakub Kicinski,
Jonathan Lemon, Saeed Mahameed, Maxim Mikityanskiy,
Stephen Hemminger, Bruce Richardson, ciara.loftus,
intel-wired-lan, bpf
In-Reply-To: <CAJ+HfNhYe_FgV0tGTLzaFGVSiimVnthgESN8Psdtpxw696w0OQ@mail.gmail.com>
On Thu, Aug 1, 2019 at 12:34 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> On Thu, 1 Aug 2019 at 08:59, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jul 31, 2019 at 8:21 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
> > >
> > > On Tue, 30 Jul 2019 at 19:43, Kevin Laatz <kevin.laatz@intel.com> wrote:
> > > >
> > > > This patch adds a 'flags' field to the umem_config and umem_reg structs.
> > > > This will allow for more options to be added for configuring umems.
> > > >
> > > > The first use for the flags field is to add a flag for unaligned chunks
> > > > mode. These flags can either be user-provided or filled with a default.
> > > >
> > > > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > > >
> > > > ---
> > > > v2:
> > > > - Removed the headroom check from this patch. It has moved to the
> > > > previous patch.
> > > >
> > > > v4:
> > > > - modified chunk flag define
> > > > ---
> >
> > [...]
> >
> > > > diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> > > > index 833a6e60d065..44a03d8c34b9 100644
> > > > --- a/tools/lib/bpf/xsk.h
> > > > +++ b/tools/lib/bpf/xsk.h
> > > > @@ -170,12 +170,14 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
> > > > #define XSK_UMEM__DEFAULT_FRAME_SHIFT 12 /* 4096 bytes */
> > > > #define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
> > > > #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
> > > > +#define XSK_UMEM__DEFAULT_FLAGS 0
> > > >
> > > > struct xsk_umem_config {
> > > > __u32 fill_size;
> > > > __u32 comp_size;
> > > > __u32 frame_size;
> > > > __u32 frame_headroom;
> > > > + __u32 flags;
> > >
> > > And the flags addition here, unfortunately, requires symbol versioning
> > > of xsk_umem__create(). That'll be the first in libbpf! :-)
> >
> > xsk_umem_config is passed by pointer to xsk_umem__create(), so this
> > doesn't break ABI, does it?
> >
>
> Old application, dynamically linked to new libbpf.so will crash,
> right? Old application passes old version of xsk_umem_config, and new
> library accesses (non-existing) flag struct member.
I think we have similar problems for all the _xattr type of commands
(as well some of btf stuff accepting extra opts structs). How is this
problem solved in general? Do we version same function multiple times,
one for each added field? It feels like there should be some better
way to handle this...
>
>
> Björn
>
>
> > >
> > >
> > > Björn
> > >
> > > > };
> > > >
> > > > /* Flags for the libbpf_flags field. */
> > > > --
> > > > 2.17.1
> > > >
> > > > _______________________________________________
> > > > Intel-wired-lan mailing list
> > > > Intel-wired-lan@osuosl.org
> > > > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-08-02 7:21 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
LSM List
In-Reply-To: <CALCETrVoZL1YGUxx3kM-d21TWVRKdKw=f2B8aE5wc2zmX1cQ4g@mail.gmail.com>
Hi Andy,
> On Jul 31, 2019, at 12:09 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jul 31, 2019 at 1:10 AM Song Liu <songliubraving@fb.com> wrote:
>>
>>
>>
>>> On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>>>>
>>>> Hi Andy,
>>>>
>>>>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>>
>>
>> [...]
>>
>>>>>
>>>>
>>>> I would like more comments on this.
>>>>
>>>> Currently, bpf permission is more or less "root or nothing", which we
>>>> would like to change.
>>>>
>>>> The short term goal is to separate bpf from root, in other words, it is
>>>> "all or nothing". Special user space utilities, such as systemd, would
>>>> benefit from this. Once this is implemented, systemd can call sys_bpf()
>>>> when it is not running as root.
>>>
>>> As generally nasty as Linux capabilities are, this sounds like a good
>>> use for CAP_BPF_ADMIN.
>>
>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
>
> It's been done before -- it's not that hard. IMO the main tricky bit
> would be try be somewhat careful about defining exactly what
> CAP_BPF_ADMIN does.
Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
Plumbers conference.
OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
like systemd to do sys_bpf() without root.
>
>>> I don't see why you need to invent a whole new mechanism for this.
>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>> write permission on files in cgroupfs to control access. Why can't
>>> bpf() do the same thing?
>>
>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>> we should have target concept for all these commands. But that is a
>> much bigger project. OTOH, "all or nothing" model allows all these
>> commands at once.
>
> For BPF_PROG_LOAD, I admit I've never understood why permission is
> required at all. I think that CAP_SYS_ADMIN or similar should be
> needed to get is_priv in the verifier, but I think that should mainly
> be useful for tracing, and that requires lots of privilege anyway.
> BPF_MAP_* is probably the trickiest part. One solution would be some
> kind of bpffs, but I'm sure other solutions are possible.
Improving permission management of cgroup_bpf is another good topic to
discuss. However, it is also an overkill for current use case.
Let me get more details about the use case, so that we have a clear
picture about short term and long term goals.
Thanks again for your suggestions.
Song
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH bpf-next v4 03/11] libbpf: add flags to umem config
From: Björn Töpel @ 2019-08-02 7:26 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Kevin Laatz, Netdev, Alexei Starovoitov, Daniel Borkmann,
Björn Töpel, Karlsson, Magnus, Jakub Kicinski,
Jonathan Lemon, Saeed Mahameed, Maxim Mikityanskiy,
Stephen Hemminger, Bruce Richardson, ciara.loftus,
intel-wired-lan, bpf
In-Reply-To: <CAEf4Bzar-KgCjUEfKVeWzcB77xvXDagZFRQKDvWo1o9-JzCirw@mail.gmail.com>
On Fri, 2 Aug 2019 at 09:19, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Aug 1, 2019 at 12:34 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
> >
[...]
> >
> > Old application, dynamically linked to new libbpf.so will crash,
> > right? Old application passes old version of xsk_umem_config, and new
> > library accesses (non-existing) flag struct member.
>
> I think we have similar problems for all the _xattr type of commands
> (as well some of btf stuff accepting extra opts structs). How is this
> problem solved in general? Do we version same function multiple times,
> one for each added field? It feels like there should be some better
> way to handle this...
>
If the size of the struct was passed as an argument (and extra care is
taken when adding members to the struct), it could be handled w/o
versioning... but that's not the case here. :-( Versioning is a mess
to deal with, so I'd be happy if it could be avoided...
^ permalink raw reply
* Re: [patch net-next 0/3] net: devlink: Finish network namespace support
From: Jiri Pirko @ 2019-08-02 7:42 UTC (permalink / raw)
To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <de94a881-cb87-e251-d55c-9f40d24b08d5@gmail.com>
Thu, Aug 01, 2019 at 12:31:52AM CEST, dsahern@gmail.com wrote:
>On 7/31/19 4:28 PM, Jakub Kicinski wrote:
>> On Wed, 31 Jul 2019 16:07:31 -0600, David Ahern wrote:
>>> On 7/31/19 4:02 PM, Jakub Kicinski wrote:
>>>> Can you elaborate further? Ports for most purposes are represented by
>>>> netdevices. Devlink port instances expose global topological view of
>>>> the ports which is primarily relevant if you can see the entire ASIC.
>>>> I think the global configuration and global view of resources is still
>>>> the most relevant need, so in your diagram you must account for some
>>>> "all-seeing" instance, e.g.:
>>>>
>>>> namespace 1 | namespace 2 | ... | namespace N
>>>> | | |
>>>> { ports 1 } | { ports 2 } | ... | { ports N }
>>>> | | |
>>>> subdevlink 1 | subdevlink 2 | ... | subdevlink N
>>>> \______ | _______/
>>>> master ASIC devlink
>>>> =================================================
>>>> driver
>>>>
>>>> No?
>>>
>>> sure, there could be a master devlink visible to the user if that makes
>>> sense or the driver can account for it behind the scenes as the sum of
>>> the devlink instances.
>>>
>>> The goal is to allow ports within an asic [1] to be divided across
>>> network namespace where each namespace sees a subset of the ports. This
>>> allows creating multiple logical switches from a single physical asic.
>>>
>>> [1] within constraints imposed by the driver/hardware - for example to
>>> account for resources shared by a set of ports. e.g., front panel ports
>>> 1 - 4 have shared resources and must always be in the same devlink instance.
>>
>> So the ASIC would start out all partitioned? Presumably some would
>> still like to use it non-partitioned? What follows there must be a top
>> level instance to decide on partitioning, and moving resources between
>> sub-instances.
>>
>> Right now I don't think there is much info in devlink ports which would
>> be relevant without full view of the ASIC..
>>
>
>not sure how it would play out. really just 'thinking out loud' about
>the above use case to make sure devlink with proper namespace support
>allows it - or does not prevent it.
I Don't see reason or usecase to have ports or other objects of devlink
in separate namespaces. Devlink and it's objects are one big item,
should be always together.
^ permalink raw reply
* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-08-02 7:43 UTC (permalink / raw)
To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <087f584d-06c5-f4b9-722b-ccb72ce0e5de@gmail.com>
Wed, Jul 31, 2019 at 09:46:13PM CEST, dsahern@gmail.com wrote:
>On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>> check. e.g., what happens if a resource controller has been configured
>>> for the devlink instance and it is moved to a namespace whose existing
>>> config exceeds those limits?
>>
>> It's moved with all the values. The whole instance is moved.
>>
>
>The values are moved, but the FIB in a namespace could already contain
>more routes than the devlink instance allows.
There is no relation between fib and devlink.
>
^ permalink raw reply
* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-08-02 7:48 UTC (permalink / raw)
To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <89dc6908-68b8-5b0d-0ef7-1eaf1e4e886b@gmail.com>
Wed, Jul 31, 2019 at 09:58:10PM CEST, dsahern@gmail.com wrote:
>On 7/31/19 1:46 PM, David Ahern wrote:
>> On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>>> check. e.g., what happens if a resource controller has been configured
>>>> for the devlink instance and it is moved to a namespace whose existing
>>>> config exceeds those limits?
>>>
>>> It's moved with all the values. The whole instance is moved.
>>>
>>
>> The values are moved, but the FIB in a namespace could already contain
>> more routes than the devlink instance allows.
>>
>
>From a quick test your recent refactoring to netdevsim broke the
>resource controller. It was, and is intended to be, per network namespace.
unifying devlink instances with network namespace in netdevsim was
really odd. Netdevsim is also a device, like any other. With other
devices, you do not do this so I don't see why to do this with netdevsim.
Now you create netdevsim instance in sysfs, there is proper bus probe
mechanism done, there is a devlink instance created for this device,
there are netdevices and devlink ports created. Same as for the real
hardware.
Honestly, creating a devlink instance per-network namespace
automagically, no relation to netdevsim devices, that is simply wrong.
There should be always 1:1 relationshin between a device and devlink
instance.
^ permalink raw reply
* Re: [PATCH v2 00/11] VSOCK: add vsock_test test suite
From: Stefano Garzarella @ 2019-08-02 7:55 UTC (permalink / raw)
To: Dexuan Cui
Cc: netdev@vger.kernel.org, kvm@vger.kernel.org, Stefan Hajnoczi,
virtualization@lists.linux-foundation.org, David S. Miller,
Jorgen Hansen, linux-kernel@vger.kernel.org
In-Reply-To: <PU1P153MB0169B265ECA51CB0AE1212DEBFDE0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>
On Thu, Aug 01, 2019 at 04:16:37PM +0000, Dexuan Cui wrote:
> > From: Stefano Garzarella <sgarzare@redhat.com>
> > Sent: Thursday, August 1, 2019 8:26 AM
> >
> > The vsock_diag.ko module already has a test suite but the core AF_VSOCK
> > functionality has no tests. This patch series adds several test cases that
> > exercise AF_VSOCK SOCK_STREAM socket semantics (send/recv,
> > connect/accept,
> > half-closed connections, simultaneous connections).
> >
> > Dexuan: Do you think can be useful to test HyperV?
>
> Hi Stefano,
> Thanks! This should be useful, though I have to write the Windows host side
> code to use the test program(s). :-)
>
Oh, yeah, I thought so :-)
Let me know when you'll try to find out if there's a problem.
Thanks,
Stefano
^ permalink raw reply
* Re: [net v1 PATCH 4/4] net: fix bpf_xdp_adjust_head regression for generic-XDP
From: Jesper Dangaard Brouer @ 2019-08-02 7:53 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, David S. Miller, xdp-newbies, Daniel Borkmann,
brandon.cazander, Alexei Starovoitov, brouer
In-Reply-To: <20190801174406.0b554bb9@cakuba.netronome.com>
On Thu, 1 Aug 2019 17:44:06 -0700
Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> On Thu, 01 Aug 2019 20:00:31 +0200, Jesper Dangaard Brouer wrote:
> > When generic-XDP was moved to a later processing step by commit
> > 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
> > a regression was introduced when using bpf_xdp_adjust_head.
> >
> > The issue is that after this commit the skb->network_header is now
> > changed prior to calling generic XDP and not after. Thus, if the header
> > is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header
> > also need to be updated again. Fix by calling skb_reset_network_header().
> >
> > Fixes: 458bf2f224f0 ("net: core: support XDP generic on stacked devices.")
> > Reported-by: Brandon Cazander <brandon.cazander@multapplied.net>
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> Out of curiosity what was your conclusion regarding resetting the
> transport header as well?
Well, I don't know... I need some review, from e.g. Stephen that
changed this... I've added code snippets below signature to helper
reviewers (also helps understand below paragraph).
I think, we perhaps should call skb_reset_transport_header(), as we
change skb->data (via either __skb_pull() or __skb_push()), *BUT* I'm
not sure it is needed/required, as someone/something afterwards still
need to call skb_set_transport_header(), which also calls
skb_reset_transport_header() anyway.
I also want to ask reviewers, if we should move the call to
skb_reset_mac_len() (which Stephen added) in __netif_receive_skb_core()
into netif_receive_generic_xdp() (after the call to
skb_reset_network_header()), (it would be more natural to keep them
together)?
- -
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Code snippet
/* check if bpf_xdp_adjust_head was used */
off = xdp->data - orig_data;
if (off) {
if (off > 0)
__skb_pull(skb, off);
else if (off < 0)
__skb_push(skb, -off);
skb->mac_header += off;
skb_reset_network_header(skb);
}
static inline bool skb_transport_header_was_set(const struct sk_buff *skb)
{
return skb->transport_header != (typeof(skb->transport_header))~0U;
}
static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
{
return skb->head + skb->transport_header;
}
static inline void skb_reset_transport_header(struct sk_buff *skb)
{
skb->transport_header = skb->data - skb->head;
}
static inline void skb_set_transport_header(struct sk_buff *skb,
const int offset)
{
skb_reset_transport_header(skb);
skb->transport_header += offset;
}
^ permalink raw reply
* [PATCH v2 1/2] bnxt_en: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-02 8:05 UTC (permalink / raw)
Cc: Michael Chan, David S . Miller, netdev, linux-kernel,
Chuhong Yuan
refcount_t is better for reference counters since its
implementation can prevent overflows.
So convert atomic_t ref counters to refcount_t.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
---
Changes in v2:
- Convert refcount from 0-base to 1-base.
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 8 ++++----
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
index fc77caf0a076..742a8ed200cb 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
@@ -49,7 +49,7 @@ static int bnxt_register_dev(struct bnxt_en_dev *edev, int ulp_id,
return -ENOMEM;
}
- atomic_set(&ulp->ref_count, 0);
+ refcount_set(&ulp->ref_count, 1);
ulp->handle = handle;
rcu_assign_pointer(ulp->ulp_ops, ulp_ops);
@@ -87,7 +87,7 @@ static int bnxt_unregister_dev(struct bnxt_en_dev *edev, int ulp_id)
synchronize_rcu();
ulp->max_async_event_id = 0;
ulp->async_events_bmap = NULL;
- while (atomic_read(&ulp->ref_count) != 0 && i < 10) {
+ while (refcount_read(&ulp->ref_count) != 1 && i < 10) {
msleep(100);
i++;
}
@@ -246,12 +246,12 @@ static int bnxt_send_msg(struct bnxt_en_dev *edev, int ulp_id,
static void bnxt_ulp_get(struct bnxt_ulp *ulp)
{
- atomic_inc(&ulp->ref_count);
+ refcount_inc(&ulp->ref_count);
}
static void bnxt_ulp_put(struct bnxt_ulp *ulp)
{
- atomic_dec(&ulp->ref_count);
+ refcount_dec(&ulp->ref_count);
}
void bnxt_ulp_stop(struct bnxt *bp)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h
index cd78453d0bf0..fc4aa582d190 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h
@@ -52,7 +52,7 @@ struct bnxt_ulp {
u16 max_async_event_id;
u16 msix_requested;
u16 msix_base;
- atomic_t ref_count;
+ refcount_t ref_count;
};
struct bnxt_en_dev {
--
2.20.1
^ permalink raw reply related
* [PATCH v2 2/2] cnic: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-02 8:05 UTC (permalink / raw)
Cc: David S . Miller, netdev, linux-kernel, Chuhong Yuan
refcount_t is better for reference counters since its
implementation can prevent overflows.
So convert atomic_t ref counters to refcount_t.
This patch depends on the patch:
"cnic: Explicitly initialize all reference counts to 0."
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
---
Changes in v2:
- Convert refcount from 0-base to 1-base.
drivers/net/ethernet/broadcom/cnic.c | 30 ++++++++++++-------------
drivers/net/ethernet/broadcom/cnic_if.h | 6 ++---
2 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 155599dcee76..1f59f9606b85 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -141,22 +141,22 @@ static int cnic_uio_close(struct uio_info *uinfo, struct inode *inode)
static inline void cnic_hold(struct cnic_dev *dev)
{
- atomic_inc(&dev->ref_count);
+ refcount_inc(&dev->ref_count);
}
static inline void cnic_put(struct cnic_dev *dev)
{
- atomic_dec(&dev->ref_count);
+ refcount_dec(&dev->ref_count);
}
static inline void csk_hold(struct cnic_sock *csk)
{
- atomic_inc(&csk->ref_count);
+ refcount_inc(&csk->ref_count);
}
static inline void csk_put(struct cnic_sock *csk)
{
- atomic_dec(&csk->ref_count);
+ refcount_dec(&csk->ref_count);
}
static struct cnic_dev *cnic_from_netdev(struct net_device *netdev)
@@ -177,12 +177,12 @@ static struct cnic_dev *cnic_from_netdev(struct net_device *netdev)
static inline void ulp_get(struct cnic_ulp_ops *ulp_ops)
{
- atomic_inc(&ulp_ops->ref_count);
+ refcount_inc(&ulp_ops->ref_count);
}
static inline void ulp_put(struct cnic_ulp_ops *ulp_ops)
{
- atomic_dec(&ulp_ops->ref_count);
+ refcount_dec(&ulp_ops->ref_count);
}
static void cnic_ctx_wr(struct cnic_dev *dev, u32 cid_addr, u32 off, u32 val)
@@ -494,7 +494,7 @@ int cnic_register_driver(int ulp_type, struct cnic_ulp_ops *ulp_ops)
}
read_unlock(&cnic_dev_lock);
- atomic_set(&ulp_ops->ref_count, 0);
+ refcount_set(&ulp_ops->ref_count, 1);
rcu_assign_pointer(cnic_ulp_tbl[ulp_type], ulp_ops);
mutex_unlock(&cnic_lock);
@@ -545,12 +545,12 @@ int cnic_unregister_driver(int ulp_type)
mutex_unlock(&cnic_lock);
synchronize_rcu();
- while ((atomic_read(&ulp_ops->ref_count) != 0) && (i < 20)) {
+ while ((refcount_read(&ulp_ops->ref_count) != 1) && (i < 20)) {
msleep(100);
i++;
}
- if (atomic_read(&ulp_ops->ref_count) != 0)
+ if (refcount_read(&ulp_ops->ref_count) != 1)
pr_warn("%s: Failed waiting for ref count to go to zero\n",
__func__);
return 0;
@@ -3596,7 +3596,7 @@ static int cnic_cm_create(struct cnic_dev *dev, int ulp_type, u32 cid,
}
csk1 = &cp->csk_tbl[l5_cid];
- if (atomic_read(&csk1->ref_count))
+ if (refcount_read(&csk1->ref_count) != 1)
return -EAGAIN;
if (test_and_set_bit(SK_F_INUSE, &csk1->flags))
@@ -3651,7 +3651,7 @@ static int cnic_cm_destroy(struct cnic_sock *csk)
csk_hold(csk);
clear_bit(SK_F_INUSE, &csk->flags);
smp_mb__after_atomic();
- while (atomic_read(&csk->ref_count) != 1)
+ while (refcount_read(&csk->ref_count) != 2)
msleep(1);
cnic_cm_cleanup(csk);
@@ -4104,7 +4104,7 @@ static int cnic_cm_alloc_mem(struct cnic_dev *dev)
return -ENOMEM;
for (i = 0; i < MAX_CM_SK_TBL_SZ; i++)
- atomic_set(&cp->csk_tbl[i].ref_count, 0);
+ refcount_set(&cp->csk_tbl[i].ref_count, 1);
port_id = prandom_u32();
port_id %= CNIC_LOCAL_PORT_RANGE;
@@ -5436,11 +5436,11 @@ static void cnic_free_dev(struct cnic_dev *dev)
{
int i = 0;
- while ((atomic_read(&dev->ref_count) != 0) && i < 10) {
+ while ((refcount_read(&dev->ref_count) != 1) && i < 10) {
msleep(100);
i++;
}
- if (atomic_read(&dev->ref_count) != 0)
+ if (refcount_read(&dev->ref_count) != 1)
netdev_err(dev->netdev, "Failed waiting for ref count to go to zero\n");
netdev_info(dev->netdev, "Removed CNIC device\n");
@@ -5484,7 +5484,7 @@ static struct cnic_dev *cnic_alloc_dev(struct net_device *dev,
cdev->unregister_device = cnic_unregister_device;
cdev->iscsi_nl_msg_recv = cnic_iscsi_nl_msg_recv;
cdev->get_fc_npiv_tbl = cnic_get_fc_npiv_tbl;
- atomic_set(&cdev->ref_count, 0);
+ refcount_set(&cdev->ref_count, 1);
cp = cdev->cnic_priv;
cp->dev = cdev;
diff --git a/drivers/net/ethernet/broadcom/cnic_if.h b/drivers/net/ethernet/broadcom/cnic_if.h
index 789e5c7e9311..5232a05ac7ba 100644
--- a/drivers/net/ethernet/broadcom/cnic_if.h
+++ b/drivers/net/ethernet/broadcom/cnic_if.h
@@ -300,7 +300,7 @@ struct cnic_sock {
#define SK_F_CLOSING 7
#define SK_F_HW_ERR 8
- atomic_t ref_count;
+ refcount_t ref_count;
u32 state;
struct kwqe kwqe1;
struct kwqe kwqe2;
@@ -335,7 +335,7 @@ struct cnic_dev {
#define CNIC_F_CNIC_UP 1
#define CNIC_F_BNX2_CLASS 3
#define CNIC_F_BNX2X_CLASS 4
- atomic_t ref_count;
+ refcount_t ref_count;
u8 mac_addr[ETH_ALEN];
int max_iscsi_conn;
@@ -378,7 +378,7 @@ struct cnic_ulp_ops {
char *data, u16 data_size);
int (*cnic_get_stats)(void *ulp_ctx);
struct module *owner;
- atomic_t ref_count;
+ refcount_t ref_count;
};
int cnic_register_driver(int ulp_type, struct cnic_ulp_ops *ulp_ops);
--
2.20.1
^ permalink raw reply related
* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: Peter Zijlstra @ 2019-08-02 8:05 UTC (permalink / raw)
To: john.hubbard
Cc: Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
Dave Hansen, Ira Weiny, Jan Kara, Jason Gunthorpe,
Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
dri-devel, intel-gfx, kvm, linux-arm-kernel, linux-block,
linux-crypto, linux-fbdev, linux-fsdevel, linux-media, linux-mm,
linux-nfs, linux-rdma, linux-rpi-kernel, linux-xfs, netdev,
rds-devel, sparclinux, x86, xen-devel, John Hubbard
In-Reply-To: <20190802021653.4882-1-jhubbard@nvidia.com>
On Thu, Aug 01, 2019 at 07:16:19PM -0700, john.hubbard@gmail.com wrote:
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions"). That commit
> has an extensive description of the problem and the planned steps to
> solve it, but the highlites are:
That is one horridly mangled Changelog there :-/ It looks like it's
partially duplicated.
Anyway; no objections to any of that, but I just wanted to mention that
there are other problems with long term pinning that haven't been
mentioned, notably they inhibit compaction.
A long time ago I proposed an interface to mark pages as pinned, such
that we could run compaction before we actually did the pinning.
^ permalink raw reply
* [PATCH bpf-next v4 0/2] net: xdp: XSKMAP improvements
From: Björn Töpel @ 2019-08-02 8:11 UTC (permalink / raw)
To: ast, daniel, netdev
Cc: Björn Töpel, magnus.karlsson, bjorn.topel,
bruce.richardson, songliubraving, bpf
This series (v4 and counting) add two improvements for the XSKMAP,
used by AF_XDP sockets.
1. Automatic cleanup when an AF_XDP socket goes out of scope/is
released. Instead of require that the user manually clears the
"released" state socket from the map, this is done
automatically. Each socket tracks which maps it resides in, and
remove itself from those maps at relase. A notable implementation
change, is that the sockets references the map, instead of the map
referencing the sockets. Which implies that when the XSKMAP is
freed, it is by definition cleared of sockets.
2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
which this patch addresses.
Daniel, I (hopefully...) addressed the issues you found in
[1]. Instead of popping the tracked map, it's simply read. Then, the
socket is removed iff it's the same socket, i.e. no updates has
occurred. There are some code comments in the xsk_delete_from_maps()
function as well.
Thanks,
Björn
[1] https://lore.kernel.org/bpf/2417e1ab-16fa-d3ed-564e-1a50c4cb6717@iogearbox.net/
v1->v2: Fixed deadlock and broken cleanup. (Daniel)
v2->v3: Rebased onto bpf-next
v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
Socket release/map update race. (Daniel)
Björn Töpel (2):
xsk: remove AF_XDP socket from map when the socket is released
xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
include/net/xdp_sock.h | 18 ++++++
kernel/bpf/xskmap.c | 130 ++++++++++++++++++++++++++++++++++-------
net/xdp/xsk.c | 48 +++++++++++++++
3 files changed, 175 insertions(+), 21 deletions(-)
--
2.20.1
^ permalink raw reply
* [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Björn Töpel @ 2019-08-02 8:11 UTC (permalink / raw)
To: ast, daniel, netdev
Cc: Björn Töpel, magnus.karlsson, bruce.richardson,
songliubraving, bpf
In-Reply-To: <20190802081154.30962-1-bjorn.topel@gmail.com>
From: Björn Töpel <bjorn.topel@intel.com>
When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.
In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.
Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
include/net/xdp_sock.h | 18 +++++++
kernel/bpf/xskmap.c | 113 ++++++++++++++++++++++++++++++++++-------
net/xdp/xsk.c | 48 +++++++++++++++++
3 files changed, 160 insertions(+), 19 deletions(-)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..066e3ae446a8 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -50,6 +50,16 @@ struct xdp_umem {
struct list_head xsk_list;
};
+/* Nodes are linked in the struct xdp_sock map_list field, and used to
+ * track which maps a certain socket reside in.
+ */
+struct xsk_map;
+struct xsk_map_node {
+ struct list_head node;
+ struct xsk_map *map;
+ struct xdp_sock **map_entry;
+};
+
struct xdp_sock {
/* struct sock must be the first member of struct xdp_sock */
struct sock sk;
@@ -75,6 +85,9 @@ struct xdp_sock {
/* Protects generic receive. */
spinlock_t rx_lock;
u64 rx_dropped;
+ struct list_head map_list;
+ /* Protects map_list */
+ spinlock_t map_list_lock;
};
struct xdp_buff;
@@ -96,6 +109,11 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+ struct xdp_sock **map_entry);
+int xsk_map_inc(struct xsk_map *map);
+void xsk_map_put(struct xsk_map *map);
+
static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
{
return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9bb96ace9fa1..780639309f6b 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -13,8 +13,71 @@ struct xsk_map {
struct bpf_map map;
struct xdp_sock **xsk_map;
struct list_head __percpu *flush_list;
+ spinlock_t lock; /* Synchronize map updates */
};
+int xsk_map_inc(struct xsk_map *map)
+{
+ struct bpf_map *m = &map->map;
+
+ m = bpf_map_inc(m, false);
+ return IS_ERR(m) ? PTR_ERR(m) : 0;
+}
+
+void xsk_map_put(struct xsk_map *map)
+{
+ bpf_map_put(&map->map);
+}
+
+static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
+ struct xdp_sock **map_entry)
+{
+ struct xsk_map_node *node;
+ int err;
+
+ node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
+ if (!node)
+ return NULL;
+
+ err = xsk_map_inc(map);
+ if (err) {
+ kfree(node);
+ return ERR_PTR(err);
+ }
+
+ node->map = map;
+ node->map_entry = map_entry;
+ return node;
+}
+
+static void xsk_map_node_free(struct xsk_map_node *node)
+{
+ xsk_map_put(node->map);
+ kfree(node);
+}
+
+static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
+{
+ spin_lock_bh(&xs->map_list_lock);
+ list_add_tail(&node->node, &xs->map_list);
+ spin_unlock_bh(&xs->map_list_lock);
+}
+
+static void xsk_map_sock_delete(struct xdp_sock *xs,
+ struct xdp_sock **map_entry)
+{
+ struct xsk_map_node *n, *tmp;
+
+ spin_lock_bh(&xs->map_list_lock);
+ list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
+ if (map_entry == n->map_entry) {
+ list_del(&n->node);
+ xsk_map_node_free(n);
+ }
+ }
+ spin_unlock_bh(&xs->map_list_lock);
+}
+
static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
{
struct xsk_map *m;
@@ -34,6 +97,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
return ERR_PTR(-ENOMEM);
bpf_map_init_from_attr(&m->map, attr);
+ spin_lock_init(&m->lock);
cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
cost += sizeof(struct list_head) * num_possible_cpus();
@@ -71,21 +135,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
static void xsk_map_free(struct bpf_map *map)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
- int i;
bpf_clear_redirect_map(map);
synchronize_net();
-
- for (i = 0; i < map->max_entries; i++) {
- struct xdp_sock *xs;
-
- xs = m->xsk_map[i];
- if (!xs)
- continue;
-
- sock_put((struct sock *)xs);
- }
-
free_percpu(m->flush_list);
bpf_map_area_free(m->xsk_map);
kfree(m);
@@ -165,7 +217,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
u32 i = *(u32 *)key, fd = *(u32 *)value;
- struct xdp_sock *xs, *old_xs;
+ struct xdp_sock *xs, *old_xs, **entry;
+ struct xsk_map_node *node;
struct socket *sock;
int err;
@@ -192,11 +245,19 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
return -EOPNOTSUPP;
}
- sock_hold(sock->sk);
+ entry = &m->xsk_map[i];
+ node = xsk_map_node_alloc(m, entry);
+ if (IS_ERR(node)) {
+ sockfd_put(sock);
+ return PTR_ERR(node);
+ }
- old_xs = xchg(&m->xsk_map[i], xs);
+ spin_lock_bh(&m->lock);
+ xsk_map_sock_add(xs, node);
+ old_xs = xchg(entry, xs);
if (old_xs)
- sock_put((struct sock *)old_xs);
+ xsk_map_sock_delete(old_xs, entry);
+ spin_unlock_bh(&m->lock);
sockfd_put(sock);
return 0;
@@ -205,19 +266,33 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
static int xsk_map_delete_elem(struct bpf_map *map, void *key)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
- struct xdp_sock *old_xs;
+ struct xdp_sock *old_xs, **map_entry;
int k = *(u32 *)key;
if (k >= map->max_entries)
return -EINVAL;
- old_xs = xchg(&m->xsk_map[k], NULL);
+ spin_lock_bh(&m->lock);
+ map_entry = &m->xsk_map[k];
+ old_xs = xchg(map_entry, NULL);
if (old_xs)
- sock_put((struct sock *)old_xs);
+ xsk_map_sock_delete(old_xs, map_entry);
+ spin_unlock_bh(&m->lock);
return 0;
}
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+ struct xdp_sock **map_entry)
+{
+ spin_lock_bh(&map->lock);
+ if (READ_ONCE(*map_entry) == xs) {
+ WRITE_ONCE(*map_entry, NULL);
+ xsk_map_sock_delete(xs, map_entry);
+ }
+ spin_unlock_bh(&map->lock);
+}
+
const struct bpf_map_ops xsk_map_ops = {
.map_alloc = xsk_map_alloc,
.map_free = xsk_map_free,
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59b57d708697..c3447bad608a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
dev_put(dev);
}
+static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
+ struct xdp_sock ***map_entry)
+{
+ struct xsk_map *map = NULL;
+ struct xsk_map_node *node;
+
+ *map_entry = NULL;
+
+ spin_lock_bh(&xs->map_list_lock);
+ node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
+ node);
+ if (node) {
+ WARN_ON(xsk_map_inc(node->map));
+ map = node->map;
+ *map_entry = node->map_entry;
+ }
+ spin_unlock_bh(&xs->map_list_lock);
+ return map;
+}
+
+static void xsk_delete_from_maps(struct xdp_sock *xs)
+{
+ /* This function removes the current XDP socket from all the
+ * maps it resides in. We need to take extra care here, due to
+ * the two locks involved. Each map has a lock synchronizing
+ * updates to the entries, and each socket has a lock that
+ * synchronizes access to the list of maps (map_list). For
+ * deadlock avoidance the locks need to be taken in the order
+ * "map lock"->"socket map list lock". We start off by
+ * accessing the socket map list, and take a reference to the
+ * map to guarantee existence. Then we ask the map to remove
+ * the socket, which tries to remove the socket from the
+ * map. Note that there might be updates to the map between
+ * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
+ */
+ struct xdp_sock **map_entry = NULL;
+ struct xsk_map *map;
+
+ while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
+ xsk_map_try_sock_delete(map, xs, map_entry);
+ xsk_map_put(map);
+ }
+}
+
static int xsk_release(struct socket *sock)
{
struct sock *sk = sock->sk;
@@ -381,6 +425,7 @@ static int xsk_release(struct socket *sock)
sock_prot_inuse_add(net, sk->sk_prot, -1);
local_bh_enable();
+ xsk_delete_from_maps(xs);
xsk_unbind_dev(xs);
xskq_destroy(xs->rx);
@@ -855,6 +900,9 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
spin_lock_init(&xs->rx_lock);
spin_lock_init(&xs->tx_completion_lock);
+ INIT_LIST_HEAD(&xs->map_list);
+ spin_lock_init(&xs->map_list_lock);
+
mutex_lock(&net->xdp.lock);
sk_add_node_rcu(sk, &net->xdp.list);
mutex_unlock(&net->xdp.lock);
--
2.20.1
^ permalink raw reply related
* [PATCH bpf-next v4 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
From: Björn Töpel @ 2019-08-02 8:11 UTC (permalink / raw)
To: ast, daniel, netdev
Cc: Björn Töpel, magnus.karlsson, bruce.richardson,
songliubraving, bpf
In-Reply-To: <20190802081154.30962-1-bjorn.topel@gmail.com>
From: Björn Töpel <bjorn.topel@intel.com>
The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
kernel/bpf/xskmap.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 780639309f6b..8864dfe1d9ef 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -226,8 +226,6 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
return -EINVAL;
if (unlikely(i >= m->map.max_entries))
return -E2BIG;
- if (unlikely(map_flags == BPF_NOEXIST))
- return -EEXIST;
sock = sockfd_lookup(fd, &err);
if (!sock)
@@ -253,14 +251,29 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
}
spin_lock_bh(&m->lock);
+ entry = &m->xsk_map[i];
+ old_xs = READ_ONCE(*entry);
+ if (old_xs && map_flags == BPF_NOEXIST) {
+ err = -EEXIST;
+ goto out;
+ } else if (!old_xs && map_flags == BPF_EXIST) {
+ err = -ENOENT;
+ goto out;
+ }
xsk_map_sock_add(xs, node);
- old_xs = xchg(entry, xs);
+ WRITE_ONCE(*entry, xs);
if (old_xs)
xsk_map_sock_delete(old_xs, entry);
spin_unlock_bh(&m->lock);
sockfd_put(sock);
return 0;
+
+out:
+ spin_unlock_bh(&m->lock);
+ sockfd_put(sock);
+ xsk_map_node_free(node);
+ return err;
}
static int xsk_map_delete_elem(struct bpf_map *map, void *key)
--
2.20.1
^ permalink raw reply related
* Re: [PATCH net-next] net: can: Fix compiling warning
From: Oliver Hartkopp @ 2019-08-02 8:10 UTC (permalink / raw)
To: Mao Wenan, davem, netdev; +Cc: linux-kernel, kernel-janitors
In-Reply-To: <20190802033643.84243-1-maowenan@huawei.com>
On 02/08/2019 05.36, Mao Wenan wrote:
> There are two warings in net/can, fix them by setting bcm_sock_no_ioctlcmd
> and raw_sock_no_ioctlcmd as static.
>
> net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
> net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?
>
> Fixes: 473d924d7d46 ("can: fix ioctl function removal")
>
> Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Thanks Mao!
Btw. what kind of compiler/make switches are you using so that I can see
these warnings myself the next time?
Best regards,
Oliver
> ---
> net/can/bcm.c | 2 +-
> net/can/raw.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index bf1d0bbecec8..b8a32b4ac368 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> return size;
> }
>
> -int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> unsigned long arg)
> {
> /* no ioctls for socket layer -> hand it down to NIC layer */
> diff --git a/net/can/raw.c b/net/can/raw.c
> index da386f1fa815..a01848ff9b12 100644
> --- a/net/can/raw.c
> +++ b/net/can/raw.c
> @@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> return size;
> }
>
> -int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> unsigned long arg)
> {
> /* no ioctls for socket layer -> hand it down to NIC layer */
>
^ permalink raw reply
* RE: [net-next 2/9] i40e: make visible changed vf mac on host
From: Loktionov, Aleksandr @ 2019-08-02 8:14 UTC (permalink / raw)
To: Shannon Nelson, Kirsher, Jeffrey T, davem@davemloft.net
Cc: netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
Bowers, AndrewX
In-Reply-To: <9a3a4675-b031-7666-f259-978d18b6db19@pensando.io>
Good day Nelson
In 99% cases VF has _only one_ unicast mac anyway, and the last MAC has been chosen because of VF mac address change algo - it marks unicast filter for deletion and appends a new unicast filter to the list.
The implementation has been chosen because of simplicity /* Just 3 more lines to solve the issue */, from one point it may look wasteful for some 1% of VF corner cases.
But from another point of view, more complicated code will affect 99% normal cases. Modern cpu are sensitive to cache thrash by code size and pipeline stalls by conditionals.
Alex
-----Original Message-----
From: Shannon Nelson [mailto:snelson@pensando.io]
Sent: Friday, August 2, 2019 2:11 AM
To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net
Cc: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; netdev@vger.kernel.org; nhorman@redhat.com; sassmann@redhat.com; Bowers, AndrewX <andrewx.bowers@intel.com>
Subject: Re: [net-next 2/9] i40e: make visible changed vf mac on host
On 8/1/19 1:51 PM, Jeff Kirsher wrote:
> From: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>
> This patch makes changed VM mac address visible on host via ip link
> show command. This problem is fixed by copying last unicast mac filter
> to vf->default_lan_addr.addr. Without this patch if VF MAC was not set
> from host side and if you run ip link show $pf, on host side you'd
> always see a zero MAC, not the real VF MAC that VF assigned to itself.
>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> index 02b09a8ad54c..21f7ac514d1f 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
> @@ -2629,6 +2629,9 @@ static int i40e_vc_add_mac_addr_msg(struct i40e_vf *vf, u8 *msg)
> } else {
> vf->num_mac++;
> }
> + if (is_valid_ether_addr(al->list[i].addr))
> + ether_addr_copy(vf->default_lan_addr.addr,
> + al->list[i].addr);
> }
> }
> spin_unlock_bh(&vsi->mac_filter_hash_lock);
Since this copy is done inside the for-loop, it looks like you are copying every address in the list, not just the last one. This seems wasteful and unnecessary.
Since it is possible, altho' unlikely, that the filter sync that happens a little later could fail, might it be better to do the copy after you know that the sync has succeeded?
Why is the last mac chosen for display rather than the first? Is there anything special about the last mac as opposed to the first mac?
sln
--------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox