* Re: [PATCH net-next v8 2/3] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
From: Jamal Hadi Salim @ 2017-04-28 1:22 UTC (permalink / raw)
To: Jiri Pirko
Cc: davem, xiyou.wangcong, eric.dumazet, netdev, Simon Horman,
Benjamin LaHaise
In-Reply-To: <20170427063039.GB1870@nanopsycho.orion>
On 17-04-27 02:30 AM, Jiri Pirko wrote:
> Wed, Apr 26, 2017 at 10:07:08PM CEST, jhs@mojatatu.com wrote:
>> On 17-04-26 09:56 AM, Jiri Pirko wrote:
>>> Wed, Apr 26, 2017 at 03:14:38PM CEST, jhs@mojatatu.com wrote:
>> I think to have flags at that level is useful but it
>> is a different hierarchy level. I am not sure the
>> "actions dump large messages" is a fit for that level.
>
> Jamal, the idea is to have exactly what you want to have. Only does not
> use NLA_U32 attr for that but a special attr NLA_FLAGS which would have
> well defined semantics and set of helpers to work with and enforce it.
>
> Then, this could be easily reused in other subsystem that uses netlink
>
Maybe I am misunderstanding:
Recall, this is what it looks like with this patchset:
<nlh><subsytem-header>[TCA_ROOT_XXXX]
TCA_ROOT_XXX is very subsystem specific. classifiers, qdiscs and many
subsystems defined their own semantics for that TLV level. This specific
"dump max" is very very specific to actions. They were crippled by the
fact you could only send 32 at a time - this allows more to be sent.
I thought initially you meant:
<nlh>[NLA_XXX]<subsytem-header>[TCA_ROOT_XXXX]
I think at the NLA_XXX you could fit netlink wide TLVs - but if i said
"do a large dump" it is of no use to any other subsystem.
cheers,
jamal
^ permalink raw reply
* prog ID and next steps. Was: [RFC net-next 0/2] Introduce bpf_prog ID and iteration
From: Alexei Starovoitov @ 2017-04-28 1:11 UTC (permalink / raw)
To: Hannes Frederic Sowa, Martin KaFai Lau, netdev
Cc: Daniel Borkmann, kernel-team, David S. Miller,
Jesper Dangaard Brouer, John Fastabend, Thomas Graf
In-Reply-To: <e81805c8-0499-a0a5-b788-0168947d9b8c@stressinduktion.org>
On 4/27/17 6:36 AM, Hannes Frederic Sowa wrote:
> On 27.04.2017 08:24, Martin KaFai Lau wrote:
>> This patchset introduces the bpf_prog ID and a new bpf cmd to
>> iterate all bpf_prog in the system.
>>
>> It is still incomplete. The idea can be extended to bpf_map.
>>
>> Martin KaFai Lau (2):
>> bpf: Introduce bpf_prog ID
>> bpf: Test for bpf_prog ID and BPF_PROG_GET_NEXT_ID
>
> Thanks Martin, I like the approach.
>
> I think the progid is also much more suitable to be used in kallsyms
> because it handles collisions correctly and let's correctly walk the
> chain (for example imaging loading two identical programs but install
> them at different hooks, kallsysms doesn't allow to find out which
> program is installed where).
i disagree re: kallsyms. The goal of prog_tag is to let program writers
understand which program is running in a stable way.
id is assigned dynamically and not suitable for that purpose.
> It would help a lot if you could pass the prog_id back during program
> creation, otherwise it will be kind of difficult to get a hold on which
> program is where. ;)
yes, but not a creation time. bpf_prog_load command will keep returning
an FD and all operations on programs will be allowed with FD only.
Think of this 'ID' as program handle or program pointer.
In other words it's obfuscated kernel 'struct bpf_prog *' given to
user space, so that user space can later convert this ID into FD.
The other patch (not shown) will take ID from user space and will
convert it to FD if prog->aux->user is the same or root.
We tried really hard to keep everything FD based. Unfortunately
netlink is not suitable to pass FDs, so to query TC and XDP
we either have to invent a way to install FD from netlink in recvmsg()
or pass something that can be converted to FD later.
That's what program ID is solving.
This set of patches look trivial with simple use of idr,
but it took us long time to get there.
We tried to use 64-bit ID to avoid wrap around issue, but association
between ID and bpf_prog needs to be kept somewhere. The obvious
answer is rhashtable, but it cannot be iterated easily.
Like we'd need to dump the whole thing through bpf syscall which
is not practical.
Then we tried to use 32-bit idr's id + 32-bit timestamp/random.
It works better, but then we hit the issue that bpf_prog_get_next_id
cannot be iterated in a stable way when programs are being deleted
while user space iterates over the whole list.
So at the end we scraped all the fancy things and went with
simple 32-bit ID allocated in _cyclic_ way via idr.
The reason for cyclic is to avoid prog delete/create races,
so ID seen by user space stays stable for 2B ids.
We were concerned that somebody might try to load/delete
a program 2B times to cause the counter to wrap around, but
it turned out not to be an issue. In that sense prog ID is similar
to PID.
So more complete picture of what we're trying to do:
- new bpf_get_fd_from_id syscall cmd will be used to convert
prog ID into prog FD
- tc/xdp/sockets/tracing attachment points will return prog ID
- existing bpf_map_lookup() cmd from prog_array will be returning
prog ID
- bpf_prog_next_id syscall cmd (this patch) is used to iterate
over all prog IDs
- new bpf_prog_get_info syscall cmd (based on prog FD) will be used
to get all or partial info about the program that kernel knows about
Example usage:
- if user space want to see instructions of all loaded programs
it can use a loop like:
while (!bpf_prog_get_next_id(next_id, &next_id)) {
int fd = bpf_prog_get_fd_from_id(next_id);
struct bpf_prog_info info;
bpf_prog_get_info(fd, &info, flags);
// look into info.insns[]
close(fd);
}
- if user space want to see prog_tag of xdp program attached to eth0
// netlink sendmsg() into ifindex of eth0 that returns prog ID
int fd = bpf_prog_get_fd_from_id(id_from_netlink);
struct bpf_prog_info info;
bpf_prog_get_info(fd, &info, flags);
// look into info.prog_tag
close(fd);
the 'flags' argument of bpf_prog_get_info() will be used
to tell kernel which info about the program needs to be dumped.
Otherwise if kernel always dumps everything about the program,
it will make the syscall too slow and too cumbersome.
Possible combinations:
- prog_type, prog_tag, license, prog ID
- array of prog instructions
- array of map IDs
Here we'll introduce similar IDs for maps and
bpf_map_get_info() syscall cmd that will return map_type, map_id, sizes.
If user wants to iterate over all elements of the map, they can
use map_fd = bpf_map_get_fd_from_id(map_id); command
and later use existing bpf_map_get_next_key+bpf_map_lookup_elem.
We believe this way the user space will be able to see _everything_
about bpf programs and maps and can pick and choose whether
it wants to see only programs or only maps or partial info
about progs (without instructions) and so on.
Once we have CTF (debug info) available for maps and progs,
we will extend bpf_prog_get_info() and bpf_map_get_info()
commands to optionally return that as well.
^ permalink raw reply
* [PATCH net-next] virtio-net: use netif_tx_napi_add for tx napi
From: Willem de Bruijn @ 2017-04-28 0:37 UTC (permalink / raw)
To: netdev; +Cc: mst, davem, Willem de Bruijn
From: Willem de Bruijn <willemb@google.com>
Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
drivers/net/virtio_net.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 82f1c3a73345..7877551fe4e0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2228,8 +2228,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
vi->rq[i].pages = NULL;
netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
napi_weight);
- netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
- napi_tx ? napi_weight : 0);
+ netif_tx_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+ napi_tx ? napi_weight : 0);
sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
--
2.13.0.rc0.306.g87b477812d-goog
^ permalink raw reply related
* RE: [PATCH net] driver/net: Fix possible memleaks when fail to register_netdevice
From: Gao Feng @ 2017-04-28 0:27 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: jiri, davem, kuznet, jmorris, yoshfuji, kaber, steffen.klassert,
netdev, 'Gao Feng'
In-Reply-To: <20170427081559.GA1058@gondor.apana.org.au>
> From: Gao Feng [mailto:gfree.wind@foxmail.com]
> Sent: Thursday, April 27, 2017 4:33 PM
> > From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
> > Sent: Thursday, April 27, 2017 4:16 PM On Tue, Apr 25, 2017 at
> > 08:01:50PM +0800, gfree.wind@foxmail.com wrote:
> > > From: Gao Feng <fgao@ikuai8.com>
> > >
> [...]
> >
[...]
I think I get one method which could be safe to free the mem in ndo_uninit.
Thanks, I would send the v2 patch later.
Best Regards
Feng
^ permalink raw reply
* Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages
From: Roopa Prabhu @ 2017-04-28 0:11 UTC (permalink / raw)
To: Vlad Yasevich
Cc: David Ahern, Vladislav Yasevich, netdev@vger.kernel.org,
Jiri Pirko
In-Reply-To: <8986b8c8-9bf1-21fc-49e5-e196630cd318@redhat.com>
On Thu, Apr 27, 2017 at 12:51 PM, Vlad Yasevich <vyasevic@redhat.com> wrote:
> On 04/24/2017 11:14 AM, Roopa Prabhu wrote:
>> On Sun, Apr 23, 2017 at 6:07 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>>>
>>> On 4/21/17 11:31 AM, Vladislav Yasevich wrote:
>>>> @@ -1276,9 +1277,40 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
>>>> return err;
>>>> }
>>>>
>>>> +static int rtnl_fill_link_event(struct sk_buff *skb, unsigned long event)
>>>> +{
>>>> + u32 rtnl_event;
>>>> +
>>>> + switch (event) {
>>>> + case NETDEV_REBOOT:
>>>> + rtnl_event = IFLA_EVENT_REBOOT;
>>>> + break;
>>>> + case NETDEV_FEAT_CHANGE:
>>>> + rtnl_event = IFLA_EVENT_FEAT_CHANGE;
>>>> + break;
>>>> + case NETDEV_BONDING_FAILOVER:
>>>> + rtnl_event = IFLA_EVENT_BONDING_FAILOVER;
>>>> + break;
>>>> + case NETDEV_NOTIFY_PEERS:
>>>> + rtnl_event = IFLA_EVENT_NOTIFY_PEERS;
>>>> + break;
>>>> + case NETDEV_RESEND_IGMP:
>>>> + rtnl_event = IFLA_EVENT_RESEND_IGMP;
>>>> + break;
>>>> + case NETDEV_CHANGEINFODATA:
>>>> + rtnl_event = IFLA_EVENT_CHANGE_INFO_DATA;
>>>> + break;
>>>> + default:
>>>> + return 0;
>>>> + }
>>>> +
>>>> + return nla_put_u32(skb, IFLA_EVENT, rtnl_event);
>>>> +}
>>>> +
>>>
>>> I still have doubts about encoding kernel events into a uapi.
>>
>> agree. I don't see why user-space will need NETDEV_CHANGEINFODATA and
>> others david listed.
>>
>
> Well, I am not sure about CHANGEINFODATA as well, but I can see use
> cases for others.
>
>> My other concerns are, once we have this exposed to user-space and
>> user-space starts relying on it, it will need accurate information and
>> will expect to have this event information all the time.
>> IIUC, we cannot cover multiple events in a single notification and not
>> all link notifications will contain an IFLA_EVENT attribute.
>
> Uhm... If the rtnetlink message was a result of an event, it will have
> an IFLA_EVENT. If a message is something else, then it will not have
> an event. That's the point. Not all netlink attributes are in every
> netlink message.
>
>> In other
>> words, we will be telling user-space to not expect that the kernel
>> will send IFLA_EVENT every time.
>>
>
> No, we are telling the user that if it is interested in a specific event
> (let's say NOTIFY_PEERS or RESEND_IGMP), then it now can monitor netlink
> traffic for those events.
> As things stand right now, that's not possible.
>
> I've done this specifically for all events for which we currently generate
> a netlink message.
>
> The only concern I have is that if in the future we remove a certain netdev
> event, it may impact applications. But we may be doing it right now as well,
> only silently, and the apps may have to find some ways to work around it.
>
ok, fair enough. it might be ok then....except for the specific
attributes that user-space may not be interested like CHANGEINFODATA.
^ permalink raw reply
* Re: [Patch net-next] ipv4: get rid of ip_ra_lock
From: Eric Dumazet @ 2017-04-27 23:54 UTC (permalink / raw)
To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpXexP9OzzqyPJ-yHmyq-ZF=cNboaG8566yaxZKbn0+TTg@mail.gmail.com>
On Thu, 2017-04-27 at 16:46 -0700, Cong Wang wrote:
> On Thu, Apr 27, 2017 at 5:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2017-04-26 at 13:55 -0700, Cong Wang wrote:
> >> After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
> >> we always take RTNL lock for ip_ra_control() which is the only place
> >> we update the list ip_ra_chain, so the ip_ra_lock is no longer needed,
> >> we just need to disable BH there.
> >>
> >> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> >> ---
> >
> > Looks great, but reading again this code, I believe we do not need to
> > disable BH at all ?
> >
>
> Hmm, if we don't disable BH here, a reader in BH could jump in and
> break this critical section? Or that is fine for RCU?
It should be fine for RCU.
The spinlock (or mutex if this is RTNL) is protecting writers among
themselves. Here it should run in process context, with no specific
rules to disable preemption, hard or soft irqs.
The reader(s) do not care of how writer(s) enforce their mutual
protection, and if writer(s) disable hard or soft irqs.
^ permalink raw reply
* Re: [PATCH iproute2] routel: fix infinite loop in line parser
From: Stephen Hemminger @ 2017-04-27 23:46 UTC (permalink / raw)
To: Michal Kubecek; +Cc: netdev
In-Reply-To: <20170427094347.3EB63A0F1C@unicorn.suse.cz>
On Thu, 27 Apr 2017 11:43:47 +0200 (CEST)
Michal Kubecek <mkubecek@suse.cz> wrote:
> As noticed by one of the few users of routel script, it ends up in an
> infinite loop when they pull out the cable from the NIC used for some
> route. This is caused by its parser expecting the line of "ip route show"
> output consists of "key value" pairs (except for the initial target range),
> together with an old trap of Bourne style shells that "shift 2" does
> nothing if there is only one argument left. Some keywords, e.g. "linkdown",
> are not followed by a value.
>
> Improve the parser to
>
> (1) only set variables for keywords we care about
> (2) recognize (currently) known keywords without value
>
> This is still far from perfect (and certainly not future proof) but to
> fully fix the script, one would probably have to rewrite the logic
> completely (and I'm not sure it's worth the effort).
>
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Appled, but this really needs to be done better.
Either as a simplified output of route command. See ip -br link
Or ip route should have a json output option and use python/perl/xss
to reformat.
^ permalink raw reply
* Re: [Patch net-next] ipv4: get rid of ip_ra_lock
From: Cong Wang @ 2017-04-27 23:46 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Linux Kernel Network Developers
In-Reply-To: <1493297179.6453.105.camel@edumazet-glaptop3.roam.corp.google.com>
On Thu, Apr 27, 2017 at 5:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-04-26 at 13:55 -0700, Cong Wang wrote:
>> After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
>> we always take RTNL lock for ip_ra_control() which is the only place
>> we update the list ip_ra_chain, so the ip_ra_lock is no longer needed,
>> we just need to disable BH there.
>>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>
> Looks great, but reading again this code, I believe we do not need to
> disable BH at all ?
>
Hmm, if we don't disable BH here, a reader in BH could jump in and
break this critical section? Or that is fine for RCU?
^ permalink raw reply
* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: Alexei Starovoitov @ 2017-04-27 23:31 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Andy Gospodarek
Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Daniel Borkmann, netdev@vger.kernel.org,
xdp-newbies@vger.kernel.org
In-Reply-To: <20170427104121.32df2178@redhat.com>
On 4/27/17 1:41 AM, Jesper Dangaard Brouer wrote:
> When registering/attaching a XDP/bpf program, we would just send the
> file-descriptor for this port-map along (like we do with the bpf_prog
> FD). Plus, it own ingress-port number this program is in the port-map.
>
> It is not clear to me, in-which-data-structure on the kernel-side we
> store this reference to the port-map and ingress-port. As today we only
> have the "raw" struct bpf_prog pointer. I see several options:
>
> 1. Create a new xdp_prog struct that contains existing bpf_prog,
> a port-map pointer and ingress-port. (IMHO easiest solution)
>
> 2. Just create a new pointer to port-map and store it in driver rx-ring
> struct (like existing bpf_prog), but this create a race-challenge
> replacing (cmpxchg) the program (or perhaps it's not a problem as it
> runs under rcu and RTNL-lock).
>
> 3. Extend bpf_prog to store this port-map and ingress-port, and have a
> fast-way to access it. I assume it will be accessible via
> bpf_prog->bpf_prog_aux->used_maps[X] but it will be too slow for XDP.
I'm not sure I completely follow the 3 proposals.
Are you suggesting to have only one netdev_array per program?
Why not to allow any number like we do for tailcall+prog_array, etc?
We can teach verifier to allow new helper
bpf_tx_port(netdev_array, port_num);
to only be used with netdev_array map type.
It will fetch netdevice pointer from netdev_array[port_num]
and will tx the packet into it.
We can make it similar to bpf_tail_call(), so that program will
finish on successful bpf_tx_port() or
make it into 'delayed' tx which will be executed when program finishes.
Not sure which approach is better.
We can also extend this netdev_array into broadcast/multicast. Like
bpf_tx_allports(&netdev_array);
call from the program will xmit the packet to all netdevices
in that 'netdev_array' map type.
The map-in-map support can be trivially extended to allow netdev_array,
then the program can create N multicast groups of netdevices.
Each multicast group == one netdev_array map.
The user space will populate a hashmap with these netdev_arrays and
bpf kernel side can select dynamically which multicast group to use
to send the packets to.
bpf kernel side may look like:
struct bpf_netdev_array *netdev_array = bpf_map_lookup_elem(&hash, key);
if (!netdev_array)
...
if (my_condition)
bpf_tx_allports(netdev_array); /* broadcast to all netdevices */
else
bpf_tx_port(netdev_array, port_num); /* tx into one netdevice */
that's an artificial example. Just trying to point out
that we shouldn't restrict the feature too soon.
^ permalink raw reply
* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-27 23:29 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Boris Ostrovsky, linux-nvdimm, dri-devel, Stephen Bates, dm-devel,
target-devel, Christoph Hellwig, devel, James E.J. Bottomley,
linux-scsi, Matthew Wilcox, linux-rdma, Sumit Semwal,
Ross Zwisler, open-iscsi, linux-media, Juergen Gross,
Julien Grall, Konrad Rzeszutek Wilk, intel-gfx, sparmaintainer,
linux-raid, Dan Williams, megaraidlinux.pdl, Jens Axboe,
"Martin K. Petersen" <martin.p
In-Reply-To: <20170427232022.GA30398@obsidianresearch.com>
On 27/04/17 05:20 PM, Jason Gunthorpe wrote:
> It seems the most robust: test for iomem, and jump to a slow path
> copy, otherwise inline the kmap and memcpy
>
> Every place doing memcpy from sgl will need that pattern to be
> correct.
Ok, sounds like a good place to start to me. I'll see what I can do for
a v3 of this set. Though, I probably won't send anything until after the
merge window.
>>> sg_miter will still fail when the sg contains __iomem, however I would
>>> expect that the sg_copy will work with iomem, by using the __iomem
>>> memcpy variant.
>>
>> Yes, that's true. Any sg_miters that ever see iomem will need to be
>> converted to support it. This isn't much different than the other
>> kmap(sg_page()) users I was converting that will also fail if they see
>> iomem. Though, I suspect an sg_miter user would be easier to convert to
>> iomem than a random kmap user.
>
> How? sg_miter seems like the next nightmare down this path, what is
> sg_miter_next supposed to do when something hits an iomem sgl?
My proposal is roughly included in the draft I sent upthread. We add an
sg_miter flag indicating the iteratee supports iomem and if miter finds
iomem (with the support flag set) it sets ioaddr which is __iomem. The
iteratee then just needs to null check addr and ioaddr and perform the
appropriate action.
Logan
^ permalink raw reply
* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Jason Gunthorpe @ 2017-04-27 23:20 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
linux-raid-u79uwXL29TY76Z2rM5mHXA,
megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-mmc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <3a7c0d27-0744-4e91-b37f-3885c50455e8-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
On Thu, Apr 27, 2017 at 05:03:45PM -0600, Logan Gunthorpe wrote:
>
>
> On 27/04/17 04:11 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> > Well, that is in the current form, with more users it would make sense
> > to optimize for the single page case, eg by providing the existing
> > call, providing a faster single-page-only variant of the copy, perhaps
> > even one that is inlined.
>
> Ok, does it make sense then to have an sg_copy_page_to_buffer (or some
> such... I'm having trouble thinking of a sane name that isn't too long).
> That just does k(un)map_atomic and memcpy? I could try that if it makes
> sense to people.
It seems the most robust: test for iomem, and jump to a slow path
copy, otherwise inline the kmap and memcpy
Every place doing memcpy from sgl will need that pattern to be
correct.
> > sg_miter will still fail when the sg contains __iomem, however I would
> > expect that the sg_copy will work with iomem, by using the __iomem
> > memcpy variant.
>
> Yes, that's true. Any sg_miters that ever see iomem will need to be
> converted to support it. This isn't much different than the other
> kmap(sg_page()) users I was converting that will also fail if they see
> iomem. Though, I suspect an sg_miter user would be easier to convert to
> iomem than a random kmap user.
How? sg_miter seems like the next nightmare down this path, what is
sg_miter_next supposed to do when something hits an iomem sgl?
miter.addr is supposed to be a kernel pointer that must not be
__iomem..
Jason
^ permalink raw reply
* Re: [PATCH net] bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
From: Jay Vosburgh @ 2017-04-27 23:08 UTC (permalink / raw)
To: Paolo Abeni
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, Honggang LI,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <733d454d3c36e99b55de5374c7664364975b171d.1493313626.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>On slave list updates, the bonding driver computes its hard_header_len
>as the maximum of all enslaved devices's hard_header_len.
>If the slave list is empty, e.g. on last enslaved device removal,
>ETH_HLEN is used.
>
>Since the bonding header_ops are set only when the first enslaved
>device is attached, the above can lead to header_ops->create()
>being called with the wrong skb headroom in place.
>
>If bond0 is configured on top of ipoib devices, with the
>following commands:
>
>ifup bond0
>for slave in $BOND_SLAVES_LIST; do
> ip link set dev $slave nomaster
>done
>ping -c 1 <ip on bond0 subnet>
>
>we will obtain a skb_under_panic() with a similar call trace:
> skb_push+0x3d/0x40
> push_pseudo_header+0x17/0x30 [ib_ipoib]
> ipoib_hard_header+0x4e/0x80 [ib_ipoib]
> arp_create+0x12f/0x220
> arp_send_dst.part.19+0x28/0x50
> arp_solicit+0x115/0x290
> neigh_probe+0x4d/0x70
> __neigh_event_send+0xa7/0x230
> neigh_resolve_output+0x12e/0x1c0
> ip_finish_output2+0x14b/0x390
> ip_finish_output+0x136/0x1e0
> ip_output+0x76/0xe0
> ip_local_out+0x35/0x40
> ip_send_skb+0x19/0x40
> ip_push_pending_frames+0x33/0x40
> raw_sendmsg+0x7d3/0xb50
> inet_sendmsg+0x31/0xb0
> sock_sendmsg+0x38/0x50
> SYSC_sendto+0x102/0x190
> SyS_sendto+0xe/0x10
> do_syscall_64+0x67/0x180
> entry_SYSCALL64_slow_path+0x25/0x25
>
>This change addresses the issue avoiding updating the bonding device
>hard_header_len when the slaves list become empty, forbidding to
>shrink it below the value used by header_ops->create().
>
>The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large
>hard_header_len") but the panic can be triggered only since
>commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard
>header").
>
>Reported-by: Norbert P <noe-PRwTpj6vllL463JZfw7VRw@public.gmane.org>
>Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len")
>Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header")
>Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Jay Vosburgh <jay.vosburgh-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> drivers/net/bonding/bond_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 8a4ba8b..34481c9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1104,11 +1104,11 @@ static void bond_compute_features(struct bonding *bond)
> gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
> gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
> }
>+ bond_dev->hard_header_len = max_hard_header_len;
>
> done:
> bond_dev->vlan_features = vlan_features;
> bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
>- bond_dev->hard_header_len = max_hard_header_len;
> bond_dev->gso_max_segs = gso_max_segs;
> netif_set_gso_max_size(bond_dev, gso_max_size);
>
>--
>2.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-27 23:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
linux-raid-u79uwXL29TY76Z2rM5mHXA,
megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-mmc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <20170427221132.GA30036-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 27/04/17 04:11 PM, Jason Gunthorpe wrote:
> On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> Well, that is in the current form, with more users it would make sense
> to optimize for the single page case, eg by providing the existing
> call, providing a faster single-page-only variant of the copy, perhaps
> even one that is inlined.
Ok, does it make sense then to have an sg_copy_page_to_buffer (or some
such... I'm having trouble thinking of a sane name that isn't too long).
That just does k(un)map_atomic and memcpy? I could try that if it makes
sense to people.
>> Switching the for_each_sg to sg_miter is probably the nicer solution as
>> it takes care of the mapping and the offset/length accounting for you
>> and will have similar performance.
>
> sg_miter will still fail when the sg contains __iomem, however I would
> expect that the sg_copy will work with iomem, by using the __iomem
> memcpy variant.
Yes, that's true. Any sg_miters that ever see iomem will need to be
converted to support it. This isn't much different than the other
kmap(sg_page()) users I was converting that will also fail if they see
iomem. Though, I suspect an sg_miter user would be easier to convert to
iomem than a random kmap user.
Logan
^ permalink raw reply
* Re: [PATCH net] bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
From: Marcelo Ricardo Leitner @ 2017-04-27 22:54 UTC (permalink / raw)
To: Paolo Abeni
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Jay Vosburgh, David S. Miller,
Honggang LI, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <733d454d3c36e99b55de5374c7664364975b171d.1493313626.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Thu, Apr 27, 2017 at 07:29:34PM +0200, Paolo Abeni wrote:
> On slave list updates, the bonding driver computes its hard_header_len
> as the maximum of all enslaved devices's hard_header_len.
> If the slave list is empty, e.g. on last enslaved device removal,
> ETH_HLEN is used.
>
> Since the bonding header_ops are set only when the first enslaved
> device is attached, the above can lead to header_ops->create()
> being called with the wrong skb headroom in place.
>
> If bond0 is configured on top of ipoib devices, with the
> following commands:
>
> ifup bond0
> for slave in $BOND_SLAVES_LIST; do
> ip link set dev $slave nomaster
> done
> ping -c 1 <ip on bond0 subnet>
>
> we will obtain a skb_under_panic() with a similar call trace:
> skb_push+0x3d/0x40
> push_pseudo_header+0x17/0x30 [ib_ipoib]
> ipoib_hard_header+0x4e/0x80 [ib_ipoib]
> arp_create+0x12f/0x220
> arp_send_dst.part.19+0x28/0x50
> arp_solicit+0x115/0x290
> neigh_probe+0x4d/0x70
> __neigh_event_send+0xa7/0x230
> neigh_resolve_output+0x12e/0x1c0
> ip_finish_output2+0x14b/0x390
> ip_finish_output+0x136/0x1e0
> ip_output+0x76/0xe0
> ip_local_out+0x35/0x40
> ip_send_skb+0x19/0x40
> ip_push_pending_frames+0x33/0x40
> raw_sendmsg+0x7d3/0xb50
> inet_sendmsg+0x31/0xb0
> sock_sendmsg+0x38/0x50
> SYSC_sendto+0x102/0x190
> SyS_sendto+0xe/0x10
> do_syscall_64+0x67/0x180
> entry_SYSCALL64_slow_path+0x25/0x25
>
> This change addresses the issue avoiding updating the bonding device
> hard_header_len when the slaves list become empty, forbidding to
> shrink it below the value used by header_ops->create().
>
> The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large
> hard_header_len") but the panic can be triggered only since
> commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard
> header").
>
> Reported-by: Norbert P <noe-PRwTpj6vllL463JZfw7VRw@public.gmane.org>
> Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len")
> Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header")
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
Thanks Paolo.
> drivers/net/bonding/bond_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 8a4ba8b..34481c9 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1104,11 +1104,11 @@ static void bond_compute_features(struct bonding *bond)
> gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
> gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
> }
> + bond_dev->hard_header_len = max_hard_header_len;
>
> done:
> bond_dev->vlan_features = vlan_features;
> bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
> - bond_dev->hard_header_len = max_hard_header_len;
> bond_dev->gso_max_segs = gso_max_segs;
> netif_set_gso_max_size(bond_dev, gso_max_size);
>
> --
> 2.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next] rhashtable: Make sure max_size is non zero
From: Florian Fainelli @ 2017-04-27 22:32 UTC (permalink / raw)
To: netdev; +Cc: davem, herbert, fw, tgraf
In-Reply-To: <20170427222824.31936-1-florian.fainelli@broadcom.com>
On 04/27/2017 03:28 PM, Florian Fainelli wrote:
> After commit 6d684e54690c ("rhashtable: Cap total number of
> entries to 2^31"), we would be hitting a panic() in net/core/rtnetlink.c
> during initialization. The call stack would look like this:
>
> register_pernet_subsys()
> ...
> ops->init()
> rtnetlink_net_init()
> netlink_kernel_create()
> netlink_insert()
> __netlink_insert()
> rhashtable_lookup_insert_key()
> __rhashtable_insert_fast()
> rht_grow_above_max()
>
> And here, we have rht_grow_above_max() return true, because ht->nelemts = 0 (legit)
> && ht->max_elems = 0 (looks bogus).
>
> Eventually, we would be return -E2BIG from __rhashtable_insert_fast()
> and propagate this all the way back to the caller.
>
> After commit 6d684e54690c what changed is that we would take the
> following condition:
>
> if (ht->p.max_size < ht->max_elems / 2)
> ht->max_elems = ht->p.max_size * 2;
>
> and since ht->p.max_size = 0, we would set ht->max_elems to 0 as well.
>
> Fix this by taking this branch only when ht->p.max_size is non-zero
>
> Fixes: Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Sent another version with the correct email address and marked this one
as superseded in patchwork, not that this email is not valid, but it's
all about consistency.
David pleas apply this one instead:
http://patchwork.ozlabs.org/patch/756172/
/me remembers to stop switching between machines.
--
Florian
^ permalink raw reply
* [PATCH net-next] rhashtable: Make sure max_size is non zero
From: Florian Fainelli @ 2017-04-27 22:30 UTC (permalink / raw)
To: netdev; +Cc: davem, herbert, fw, tgraf, Florian Fainelli
In-Reply-To: <56843a86-9a09-16e8-acec-05a80396f282@gmail.com>
After commit 6d684e54690c ("rhashtable: Cap total number of
entries to 2^31"), we would be hitting a panic() in net/core/rtnetlink.c
during initialization. The call stack would look like this:
register_pernet_subsys()
...
ops->init()
rtnetlink_net_init()
netlink_kernel_create()
netlink_insert()
__netlink_insert()
rhashtable_lookup_insert_key()
__rhashtable_insert_fast()
rht_grow_above_max()
And here, we have rht_grow_above_max() return true, because ht->nelemts = 0 (legit)
&& ht->max_elems = 0 (looks bogus).
Eventually, we would be return -E2BIG from __rhashtable_insert_fast()
and propagate this all the way back to the caller.
After commit 6d684e54690c what changed is that we would take the
following condition:
if (ht->p.max_size < ht->max_elems / 2)
ht->max_elems = ht->p.max_size * 2;
and since ht->p.max_size = 0, we would set ht->max_elems to 0 as well.
Fix this by taking this branch only when ht->p.max_size is non-zero
Fixes: Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
lib/rhashtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 751630bbe409..6b4f07760fec 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
/* Cap total entries at 2^31 to avoid nelems overflow. */
ht->max_elems = 1u << 31;
- if (ht->p.max_size < ht->max_elems / 2)
+ if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
ht->max_elems = ht->p.max_size * 2;
ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
--
2.12.2
^ permalink raw reply related
* [PATCH net-next] rhashtable: Make sure max_size is non zero
From: Florian Fainelli @ 2017-04-27 22:28 UTC (permalink / raw)
To: netdev; +Cc: davem, herbert, fw, tgraf, Florian Fainelli
In-Reply-To: <56843a86-9a09-16e8-acec-05a80396f282@gmail.com>
After commit 6d684e54690c ("rhashtable: Cap total number of
entries to 2^31"), we would be hitting a panic() in net/core/rtnetlink.c
during initialization. The call stack would look like this:
register_pernet_subsys()
...
ops->init()
rtnetlink_net_init()
netlink_kernel_create()
netlink_insert()
__netlink_insert()
rhashtable_lookup_insert_key()
__rhashtable_insert_fast()
rht_grow_above_max()
And here, we have rht_grow_above_max() return true, because ht->nelemts = 0 (legit)
&& ht->max_elems = 0 (looks bogus).
Eventually, we would be return -E2BIG from __rhashtable_insert_fast()
and propagate this all the way back to the caller.
After commit 6d684e54690c what changed is that we would take the
following condition:
if (ht->p.max_size < ht->max_elems / 2)
ht->max_elems = ht->p.max_size * 2;
and since ht->p.max_size = 0, we would set ht->max_elems to 0 as well.
Fix this by taking this branch only when ht->p.max_size is non-zero
Fixes: Fixes: 6d684e54690c ("rhashtable: Cap total number of entries to 2^31")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
---
lib/rhashtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 751630bbe409..6b4f07760fec 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
/* Cap total entries at 2^31 to avoid nelems overflow. */
ht->max_elems = 1u << 31;
- if (ht->p.max_size < ht->max_elems / 2)
+ if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
ht->max_elems = ht->p.max_size * 2;
ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
--
2.12.2
^ permalink raw reply related
* Re: rhashtable - Cap total number of entries to 2^31
From: Florian Fainelli @ 2017-04-27 22:21 UTC (permalink / raw)
To: Herbert Xu, David Miller; +Cc: fw, netdev, Thomas Graf
In-Reply-To: <0cd0286d-b81d-7bf4-d345-7ef098b9a998@broadcom.com>
On 04/27/2017 02:16 PM, Florian Fainelli wrote:
> Hi Herbert,
>
> On 04/26/2017 10:44 PM, Herbert Xu wrote:
>> On Tue, Apr 25, 2017 at 10:48:22AM -0400, David Miller wrote:
>>> From: Florian Westphal <fw@strlen.de>
>>> Date: Tue, 25 Apr 2017 16:17:49 +0200
>>>
>>>> I'd have less of an issue with this if we'd be talking about
>>>> something computationally expensive, but this is about storing
>>>> an extra value inside a struct just to avoid one "shr" in insert path...
>>>
>>> Agreed, this shift is probably filling an available cpu cycle :-)
>>
>> OK, but we need to have an extra field for another reason anyway.
>> The problem is that we're not capping the total number of elements
>> in the hashtable when max_size is not set, this means that nelems
>> can overflow which will cause havoc with the automatic shrinking
>> when it tries to fit 2^32 entries into a minimum-sized table.
>>
>> So I'm taking that hole back for now :)
>>
>> ---8<---
>> When max_size is not set or if it set to a sufficiently large
>> value, the nelems counter can overflow. This would cause havoc
>> with the automatic shrinking as it would then attempt to fit a
>> huge number of entries into a tiny hash table.
>>
>> This patch fixes this by adding max_elems to struct rhashtable
>> to cap the number of elements. This is set to 2^31 as nelems is
>> not a precise count. This is sufficiently smaller than UINT_MAX
>> that it should be safe.
>>
>> When max_size is set max_elems will be lowered to at most twice
>> max_size as is the status quo.
>>
>> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>
> This commit:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=6d684e54690caef45cf14051ddeb7c71beeb681b
>
> makes my ARMv7 (32-bit) system panic on boot with the log below. I can
> test net-next (or net) and report back if you want me to test anything.
> Thanks!
And another on with a QEMU guest:
[ 0.389212] NET: Registered protocol family 16
[ 0.388807] Kernel panic - not syncing: rtnetlink_init: cannot
initialize rtnetlink
[ 0.388807]
[ 0.389445] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
4.11.0-rc8-02077-ge221c1f0fe25 #1
[ 0.389745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 0.390219] Call Trace:
[ 0.391406] dump_stack+0x51/0x78
[ 0.391585] panic+0xc7/0x20e
[ 0.391740] ? register_pernet_operations+0xa1/0xd0
[ 0.392031] rtnetlink_init+0x22/0x1a0
[ 0.392190] netlink_proto_init+0x168/0x184
[ 0.392359] ? ptp_classifier_init+0x26/0x30
[ 0.392528] ? netlink_net_init+0x2e/0x2e
[ 0.392692] do_one_initcall+0x54/0x190
[ 0.392852] ? parse_args+0x248/0x400
[ 0.393033] kernel_init_freeable+0x127/0x1b6
[ 0.393208] ? kernel_init_freeable+0x1b6/0x1b6
[ 0.393389] ? rest_init+0x70/0x70
[ 0.393533] kernel_init+0x9/0x100
[ 0.393676] ret_from_fork+0x29/0x40
[ 0.394555] ---[ end Kernel panic - not syncing: rtnetlink_init:
cannot initialize rtnetlink
[ 0.394555]
I traced this down to:
rtnetlink_net_init()
netlink_kernel_create()
netlink_insert()
__netlink_insert()
rhashtable_lookup_insert_key()
__rhashtable_insert_fast()
rht_grow_above_max()
And indeed we have:
ht->nelemts = 0
ht->max_elems = 0
such that rht_grow_above_max() returns true.
With your commit we actually take this branch:
if (ht->p.max_size < ht->max_elems / 2)
ht->max_elems = ht->p.max_size * 2;
since max_size = 0 we have max_elems = 0 as well.
Candidate fix #1:
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 45f89369c4c8..ad9020e1609c 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -329,7 +329,7 @@ static inline bool rht_grow_above_100(const struct
rhashtable *ht,
static inline bool rht_grow_above_max(const struct rhashtable *ht,
const struct bucket_table *tbl)
{
- return atomic_read(&ht->nelems) >= ht->max_elems;
+ return ht->p.max_size && atomic_read(&ht->nelems) >= ht->max_elems;
}
Candidate fix #2:
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 751630bbe409..6b4f07760fec 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -963,7 +963,7 @@ int rhashtable_init(struct rhashtable *ht,
/* Cap total entries at 2^31 to avoid nelems overflow. */
ht->max_elems = 1u << 31;
- if (ht->p.max_size < ht->max_elems / 2)
+ if (ht->p.max_size && (ht->p.max_size < ht->max_elems / 2))
ht->max_elems = ht->p.max_size * 2;
ht->p.min_size = max(ht->p.min_size, HASH_MIN_SIZE);
Number #2 does not introduce an additional conditional on the fastpath,
so I suppose that would be what we would prefer?
>
> [ 0.158619] futex hash table entries: 1024 (order: 4, 65536 bytes)
> [ 0.166386] NET: Registered protocol family 16
> [ 0.179596] Kernel panic - not syncing: rtnetlink_init: cannot
> initialize rtnetlink
> [ 0.179596]
> [ 0.189350] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [ 0.197908] Hardware name: Broadcom STB (Flattened Device Tree)
> [ 0.204254] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [ 0.212447] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [ 0.220144] [<c04bc454>] (dump_stack) from [<c02ab684>]
> (panic+0xf0/0x270)
> [ 0.227460] [<c02ab684>] (panic) from [<c0c2705c>]
> (rtnetlink_init+0x24/0x1d4)
> [ 0.235145] [<c0c2705c>] (rtnetlink_init) from [<c0c27630>]
> (netlink_proto_init+0x124/0x148)
> [ 0.244124] [<c0c27630>] (netlink_proto_init) from [<c02017f8>]
> (do_one_initcall+0x40/0x168)
> [ 0.253072] [<c02017f8>] (do_one_initcall) from [<c0c00dfc>]
> (kernel_init_freeable+0x164/0x200)
> [ 0.262304] [<c0c00dfc>] (kernel_init_freeable) from [<c087bfd8>]
> (kernel_init+0x8/0x110)
> [ 0.270970] [<c087bfd8>] (kernel_init) from [<c0207fa8>]
> (ret_from_fork+0x14/0x2c)
> [ 0.279014] CPU1: stopping
> [ 0.281916] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [ 0.290499] Hardware name: Broadcom STB (Flattened Device Tree)
> [ 0.296796] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [ 0.305018] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [ 0.312684] [<c04bc454>] (dump_stack) from [<c020e984>]
> (handle_IPI+0x170/0x190)
> [ 0.320531] [<c020e984>] (handle_IPI) from [<c020144c>]
> (gic_handle_irq+0x88/0x8c)
> [ 0.328586] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
> (__irq_svc+0x58/0x74)
> [ 0.336543] Exception stack(0xee055f68 to 0xee055fb0)
> [ 0.341938] 5f60: 00000001 00000000 ee055fc0
> c0219b60 ee054000 c1603cc8
> [ 0.350661] 5f80: c1603c6c 00000000 00000000 c1486188 ee055fc0
> c1603cd4 c1483408 ee055fb8
> [ 0.359323] 5fa0: c0208a40 c0208a44 60000013 ffffffff
> [ 0.364745] [<c020bd78>] (__irq_svc) from [<c0208a44>]
> (arch_cpu_idle+0x38/0x3c)
> [ 0.372613] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
> (do_idle+0x168/0x204)
> [ 0.380479] [<c0255e98>] (do_idle) from [<c02561ac>]
> (cpu_startup_entry+0x18/0x1c)
> [ 0.388493] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
> [ 0.395687] CPU3: stopping
> [ 0.398606] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [ 0.407242] Hardware name: Broadcom STB (Flattened Device Tree)
> [ 0.413564] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [ 0.421795] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [ 0.429495] [<c04bc454>] (dump_stack) from [<c020e984>]
> (handle_IPI+0x170/0x190)
> [ 0.437394] [<c020e984>] (handle_IPI) from [<c020144c>]
> (gic_handle_irq+0x88/0x8c)
> [ 0.445475] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
> (__irq_svc+0x58/0x74)
> [ 0.453406] Exception stack(0xee059f68 to 0xee059fb0)
> [ 0.458792] 9f60: 00000001 00000000 ee059fc0
> c0219b60 ee058000 c1603cc8
> [ 0.467489] 9f80: c1603c6c 00000000 00000000 c1486188 ee059fc0
> c1603cd4 c1483408 ee059fb8
> [ 0.476177] 9fa0: c0208a40 c0208a44 60000013 ffffffff
> [ 0.481581] [<c020bd78>] (__irq_svc) from [<c0208a44>]
> (arch_cpu_idle+0x38/0x3c)
> [ 0.489474] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
> (do_idle+0x168/0x204)
> [ 0.497331] [<c0255e98>] (do_idle) from [<c02561ac>]
> (cpu_startup_entry+0x18/0x1c)
> [ 0.505369] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
> [ 0.512562] CPU2: stopping
> [ 0.515463] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
> 4.11.0-rc8-02028-g6d684e54690c #37
> [ 0.524047] Hardware name: Broadcom STB (Flattened Device Tree)
> [ 0.530368] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
> (show_stack+0x10/0x14)
> [ 0.538573] [<c020b294>] (show_stack) from [<c04bc454>]
> (dump_stack+0x90/0xa4)
> [ 0.546195] [<c04bc454>] (dump_stack) from [<c020e984>]
> (handle_IPI+0x170/0x190)
> [ 0.554050] [<c020e984>] (handle_IPI) from [<c020144c>]
> (gic_handle_irq+0x88/0x8c)
> [ 0.562096] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
> (__irq_svc+0x58/0x74)
> [ 0.570044] Exception stack(0xee057f68 to 0xee057fb0)
> [ 0.575465] 7f60: 00000001 00000000 ee057fc0
> c0219b60 ee056000 c1603cc8
> [ 0.584145] 7f80: c1603c6c 00000000 00000000 c1486188 ee057fc0
> c1603cd4 c1483408 ee057fb8
> [ 0.592806] 7fa0: c0208a40 c0208a44 60000013 ffffffff
> [ 0.598220] [<c020bd78>] (__irq_svc) from [<c0208a44>]
> (arch_cpu_idle+0x38/0x3c)
> [ 0.606103] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
> (do_idle+0x168/0x204)
> [ 0.613960] [<c0255e98>] (do_idle) from [<c02561ac>]
> (cpu_startup_entry+0x18/0x1c)
> [ 0.621990] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
> [ 0.629201] ---[ end Kernel panic - not syncing: rtnetlink_init:
> cannot initialize rtnetlink
> [ 0.629201]
>
--
Florian
^ permalink raw reply related
* Re: ipsec doesn't route TCP with 4.11 kernel
From: Don Bowman @ 2017-04-27 22:15 UTC (permalink / raw)
To: Steffen Klassert
Cc: Cong Wang, linux-kernel@vger.kernel.org, Herbert Xu,
Linux Kernel Network Developers
In-Reply-To: <20170427084238.GX2649@secunet.com>
On 27 April 2017 at 04:42, Steffen Klassert
<steffen.klassert@secunet.com> wrote:
> On Wed, Apr 26, 2017 at 10:01:34PM -0700, Cong Wang wrote:
>> (Cc'ing netdev and IPSec maintainers)
>>
>> On Tue, Apr 25, 2017 at 6:08 PM, Don Bowman <db@donbowman.ca> wrote:
for 'esp' question, i have ' esp = aes256-sha256-modp1536!' is that
what you mean?
its nat-aware tunnel [from my desktop pc to my office]
root@office:~# ip -s x s
src 172.16.0.8 dst 64.7.137.180
proto esp spi 0x0d588366(223904614) reqid 1(0x00000001) mode tunnel
replay-window 0 seq 0x00000000 flag af-unspec (0x00100000)
auth-trunc hmac(sha256)
0x046cafdf19c5d78d1c29165d96a0b9fce1c500029d77be0fe956dce1bf80a86a
(256 bits) 128
enc cbc(aes)
0x79ff2fbc2178eb468de6ff16612f0603b514a1d1d5f375c67222294463ec7c62
(256 bits)
encap type espinudp sport 4500 dport 4500 addr 0.0.0.0
anti-replay context: seq 0x0, oseq 0x28, bitmap 0x00000000
lifetime config:
limit: soft (INF)(bytes), hard (INF)(bytes)
limit: soft (INF)(packets), hard (INF)(packets)
expire add: soft 42843(sec), hard 43200(sec)
expire use: soft 0(sec), hard 0(sec)
lifetime current:
2986(bytes), 40(packets)
add 2017-04-27 18:08:12 use 2017-04-27 18:08:12
stats:
replay-window 0 replay 0 failed 0
src 64.7.137.180 dst 172.16.0.8
proto esp spi 0xcd366c03(3442895875) reqid 1(0x00000001) mode tunnel
replay-window 32 seq 0x00000000 flag af-unspec (0x00100000)
auth-trunc hmac(sha256)
0x4158741cc971c49417d60165f19ed02249385c7bba808927d4a9e7b45fb30c5b
(256 bits) 128
enc cbc(aes)
0x77592c79c964787bca5012214b85172b06deb7b3f06aac02e3934dd9ead67c15
(256 bits)
encap type espinudp sport 4500 dport 4500 addr 0.0.0.0
anti-replay context: seq 0x27, oseq 0x0, bitmap 0xffffffff
lifetime config:
limit: soft (INF)(bytes), hard (INF)(bytes)
limit: soft (INF)(packets), hard (INF)(packets)
expire add: soft 42873(sec), hard 43200(sec)
expire use: soft 0(sec), hard 0(sec)
lifetime current:
4501(bytes), 38(packets)
add 2017-04-27 18:08:12 use 2017-04-27 18:08:12
stats:
replay-window 0 replay 0 failed 0
>> >
>> > My ipsec tunnel comes up ok.
>
> When talking about IPsec, I guess you use ESP, right?
...
>
> If it is a GRO issue, then it is on the receive side, could you do
> tcpdump on the receiving interface to see what you get there?
I'm not sure what you mean the receiving interface, you mean the
outer, the native interface?
listening on eno1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:11:32.061501 IP 172.16.0.8.3416 > 64.7.137.180.33638: truncated-udplength 0
18:11:32.788091 IP 64.7.137.180.4500 > 172.16.0.8.4500: NONESP-encap:
isakmp: child_sa inf2
18:11:32.788354 IP 172.16.0.8.4500 > 64.7.137.180.4500: NONESP-encap:
isakmp: child_sa inf2[IR]
18:11:33.066830 IP 172.16.0.8.3416 > 64.7.137.180.33638: truncated-udplength 0
18:11:35.082839 IP 172.16.0.8.3416 > 64.7.137.180.33638: truncated-udplength 0
18:11:37.807945 IP 64.7.137.180.4500 > 172.16.0.8.4500: NONESP-encap:
isakmp: child_sa inf2
18:11:37.808300 IP 172.16.0.8.4500 > 64.7.137.180.4500: NONESP-encap:
isakmp: child_sa inf2[IR]
is what i see there for the 'curl' command that doesn't complete.
>
> What shows /proc/net/xfrm_stat?
root@office:~# cat /proc/net/xfrm_stat
XfrmInError 0
XfrmInBufferError 0
XfrmInHdrError 0
XfrmInNoStates 0
XfrmInStateProtoError 0
XfrmInStateModeError 0
XfrmInStateSeqError 0
XfrmInStateExpired 0
XfrmInStateMismatch 0
XfrmInStateInvalid 0
XfrmInTmplMismatch 0
XfrmInNoPols 0
XfrmInPolBlock 0
XfrmInPolError 0
XfrmOutError 0
XfrmOutBundleGenError 0
XfrmOutBundleCheckError 0
XfrmOutNoStates 0
XfrmOutStateProtoError 0
XfrmOutStateModeError 0
XfrmOutStateSeqError 0
XfrmOutStateExpired 0
XfrmOutPolBlock 0
XfrmOutPolDead 0
XfrmOutPolError 0
XfrmFwdHdrError 0
XfrmOutStateInvalid 0
XfrmAcquireError 0
>
> Can you do 'ip -s x s' to see if the SAs are used?
>
> Do you have INET_ESP_OFFLOAD enabled?
>
CONFIG_INET_ESP=m
CONFIG_INET_ESP_OFFLOAD=m
CONFIG_INET6_ESP=m
CONFIG_INET6_ESP_OFFLOAD=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
# lsmod |grep esp
esp4 20480 2
xfrm_algo 16384 5 xfrm_user,esp4,ah4,af_key,xfrm_ipcomp
^ permalink raw reply
* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Jason Gunthorpe @ 2017-04-27 22:11 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
linux-raid-u79uwXL29TY76Z2rM5mHXA,
megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-mmc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <02ba3c7b-5fab-a06c-fbbf-c3be1c0fae1b-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> On 27/04/17 02:53 PM, Jason Gunthorpe wrote:
> > blkfront is one of the drivers I looked at, and it appears to only be
> > memcpying with the bvec_data pointer, so I wonder why it does not use
> > sg_copy_X_buffer instead..
>
> But you'd potentially end up calling sg_copy_to_buffer multiple times
> per page within the sg (given that gnttab_foreach_grant_in_range might
> call blkif_copy_from_grant/blkif_setup_rw_req_grant multiple times).
> Even calling sg_copy_to_buffer once per page seems rather inefficient as
> it uses sg_miter internally.
Well, that is in the current form, with more users it would make sense
to optimize for the single page case, eg by providing the existing
call, providing a faster single-page-only variant of the copy, perhaps
even one that is inlined.
> Switching the for_each_sg to sg_miter is probably the nicer solution as
> it takes care of the mapping and the offset/length accounting for you
> and will have similar performance.
sg_miter will still fail when the sg contains __iomem, however I would
expect that the sg_copy will work with iomem, by using the __iomem
memcpy variant.
So, sg_copy should always be preferred in this new world with mixed
__iomem since it is the only primitive that can transparently handle
it.
Jason
^ permalink raw reply
* Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-27 21:53 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Boris Ostrovsky, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
target-devel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, James E.J. Bottomley,
linux-scsi-u79uwXL29TY76Z2rM5mHXA, Matthew Wilcox,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, Sumit Semwal,
open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
linux-media-u79uwXL29TY76Z2rM5mHXA, Juergen Gross, Julien Grall,
intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA,
linux-raid-u79uwXL29TY76Z2rM5mHXA,
megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w, Jens Axboe,
Martin K. Petersen, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-mmc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-crypto-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman
In-Reply-To: <20170427205339.GB26330-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 27/04/17 02:53 PM, Jason Gunthorpe wrote:
> blkfront is one of the drivers I looked at, and it appears to only be
> memcpying with the bvec_data pointer, so I wonder why it does not use
> sg_copy_X_buffer instead..
Yes, sort of...
But you'd potentially end up calling sg_copy_to_buffer multiple times
per page within the sg (given that gnttab_foreach_grant_in_range might
call blkif_copy_from_grant/blkif_setup_rw_req_grant multiple times).
Even calling sg_copy_to_buffer once per page seems rather inefficient as
it uses sg_miter internally.
Switching the for_each_sg to sg_miter is probably the nicer solution as
it takes care of the mapping and the offset/length accounting for you
and will have similar performance.
But, yes, if performance is not an issue, switching it to
sg_copy_to_buffer would be a less invasive change than sg_miter. Which
the same might be said about a lot of these cases.
Unfortunately, changing from kmap_atomic (which is a null operation in a
lot of cases) to sg_copy_X_buffer is a pretty big performance hit.
Logan
^ permalink raw reply
* [PATCH net-next] net: Initialise init_net.count to 1
From: David Howells @ 2017-04-27 21:40 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
Initialise init_net.count to 1 for its pointer from init_nsproxy lest
someone tries to do a get_net() and a put_net() in a process in which
current->ns_proxy->net_ns points to the initial network namespace.
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/core/net_namespace.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 652468ff65b7..adb97ca141b7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -35,7 +35,8 @@ LIST_HEAD(net_namespace_list);
EXPORT_SYMBOL_GPL(net_namespace_list);
struct net init_net = {
- .dev_base_head = LIST_HEAD_INIT(init_net.dev_base_head),
+ .count = ATOMIC_INIT(1),
+ .dev_base_head = LIST_HEAD_INIT(init_net.dev_base_head),
};
EXPORT_SYMBOL(init_net);
^ permalink raw reply related
* [PATCH net-next] geneve: fix incorrect setting of UDP checksum flag
From: Girish Moodalbail @ 2017-04-27 21:11 UTC (permalink / raw)
To: davem; +Cc: netdev, pshelar
Creating a geneve link with 'udpcsum' set results in a creation of link
for which UDP checksum will NOT be computed on outbound packets, as can
be seen below.
11: gen0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum
Similarly, creating a link with 'noudpcsum' set results in a creation
of link for which UDP checksum will be computed on outbound packets.
Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.")
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
---
drivers/net/geneve.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 7074b40..dec5d56 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1244,7 +1244,7 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
metadata = true;
if (data[IFLA_GENEVE_UDP_CSUM] &&
- !nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
+ nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
info.key.tun_flags |= TUNNEL_CSUM;
if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
--
1.8.3.1
^ permalink raw reply related
* Re: pull-request: wireless-drivers-next 2017-04-27
From: David Miller @ 2017-04-27 21:16 UTC (permalink / raw)
To: kvalo; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <87a872xk4u.fsf@kamboji.qca.qualcomm.com>
From: Kalle Valo <kvalo@codeaurora.org>
Date: Thu, 27 Apr 2017 12:41:37 +0300
> here's a pull request for net-next, more info in the tag below. This
> should be the last pull request to net-next for 4.12. Please let me know
> if there are any problems.
Pulled, thanks Kalle.
^ permalink raw reply
* Re: rhashtable - Cap total number of entries to 2^31
From: Florian Fainelli @ 2017-04-27 21:16 UTC (permalink / raw)
To: Herbert Xu, David Miller; +Cc: fw, netdev, Thomas Graf
In-Reply-To: <20170427054451.GA529@gondor.apana.org.au>
Hi Herbert,
On 04/26/2017 10:44 PM, Herbert Xu wrote:
> On Tue, Apr 25, 2017 at 10:48:22AM -0400, David Miller wrote:
>> From: Florian Westphal <fw@strlen.de>
>> Date: Tue, 25 Apr 2017 16:17:49 +0200
>>
>>> I'd have less of an issue with this if we'd be talking about
>>> something computationally expensive, but this is about storing
>>> an extra value inside a struct just to avoid one "shr" in insert path...
>>
>> Agreed, this shift is probably filling an available cpu cycle :-)
>
> OK, but we need to have an extra field for another reason anyway.
> The problem is that we're not capping the total number of elements
> in the hashtable when max_size is not set, this means that nelems
> can overflow which will cause havoc with the automatic shrinking
> when it tries to fit 2^32 entries into a minimum-sized table.
>
> So I'm taking that hole back for now :)
>
> ---8<---
> When max_size is not set or if it set to a sufficiently large
> value, the nelems counter can overflow. This would cause havoc
> with the automatic shrinking as it would then attempt to fit a
> huge number of entries into a tiny hash table.
>
> This patch fixes this by adding max_elems to struct rhashtable
> to cap the number of elements. This is set to 2^31 as nelems is
> not a precise count. This is sufficiently smaller than UINT_MAX
> that it should be safe.
>
> When max_size is set max_elems will be lowered to at most twice
> max_size as is the status quo.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This commit:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=6d684e54690caef45cf14051ddeb7c71beeb681b
makes my ARMv7 (32-bit) system panic on boot with the log below. I can
test net-next (or net) and report back if you want me to test anything.
Thanks!
[ 0.158619] futex hash table entries: 1024 (order: 4, 65536 bytes)
[ 0.166386] NET: Registered protocol family 16
[ 0.179596] Kernel panic - not syncing: rtnetlink_init: cannot
initialize rtnetlink
[ 0.179596]
[ 0.189350] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
4.11.0-rc8-02028-g6d684e54690c #37
[ 0.197908] Hardware name: Broadcom STB (Flattened Device Tree)
[ 0.204254] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
(show_stack+0x10/0x14)
[ 0.212447] [<c020b294>] (show_stack) from [<c04bc454>]
(dump_stack+0x90/0xa4)
[ 0.220144] [<c04bc454>] (dump_stack) from [<c02ab684>]
(panic+0xf0/0x270)
[ 0.227460] [<c02ab684>] (panic) from [<c0c2705c>]
(rtnetlink_init+0x24/0x1d4)
[ 0.235145] [<c0c2705c>] (rtnetlink_init) from [<c0c27630>]
(netlink_proto_init+0x124/0x148)
[ 0.244124] [<c0c27630>] (netlink_proto_init) from [<c02017f8>]
(do_one_initcall+0x40/0x168)
[ 0.253072] [<c02017f8>] (do_one_initcall) from [<c0c00dfc>]
(kernel_init_freeable+0x164/0x200)
[ 0.262304] [<c0c00dfc>] (kernel_init_freeable) from [<c087bfd8>]
(kernel_init+0x8/0x110)
[ 0.270970] [<c087bfd8>] (kernel_init) from [<c0207fa8>]
(ret_from_fork+0x14/0x2c)
[ 0.279014] CPU1: stopping
[ 0.281916] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
4.11.0-rc8-02028-g6d684e54690c #37
[ 0.290499] Hardware name: Broadcom STB (Flattened Device Tree)
[ 0.296796] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
(show_stack+0x10/0x14)
[ 0.305018] [<c020b294>] (show_stack) from [<c04bc454>]
(dump_stack+0x90/0xa4)
[ 0.312684] [<c04bc454>] (dump_stack) from [<c020e984>]
(handle_IPI+0x170/0x190)
[ 0.320531] [<c020e984>] (handle_IPI) from [<c020144c>]
(gic_handle_irq+0x88/0x8c)
[ 0.328586] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
(__irq_svc+0x58/0x74)
[ 0.336543] Exception stack(0xee055f68 to 0xee055fb0)
[ 0.341938] 5f60: 00000001 00000000 ee055fc0
c0219b60 ee054000 c1603cc8
[ 0.350661] 5f80: c1603c6c 00000000 00000000 c1486188 ee055fc0
c1603cd4 c1483408 ee055fb8
[ 0.359323] 5fa0: c0208a40 c0208a44 60000013 ffffffff
[ 0.364745] [<c020bd78>] (__irq_svc) from [<c0208a44>]
(arch_cpu_idle+0x38/0x3c)
[ 0.372613] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
(do_idle+0x168/0x204)
[ 0.380479] [<c0255e98>] (do_idle) from [<c02561ac>]
(cpu_startup_entry+0x18/0x1c)
[ 0.388493] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
[ 0.395687] CPU3: stopping
[ 0.398606] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
4.11.0-rc8-02028-g6d684e54690c #37
[ 0.407242] Hardware name: Broadcom STB (Flattened Device Tree)
[ 0.413564] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
(show_stack+0x10/0x14)
[ 0.421795] [<c020b294>] (show_stack) from [<c04bc454>]
(dump_stack+0x90/0xa4)
[ 0.429495] [<c04bc454>] (dump_stack) from [<c020e984>]
(handle_IPI+0x170/0x190)
[ 0.437394] [<c020e984>] (handle_IPI) from [<c020144c>]
(gic_handle_irq+0x88/0x8c)
[ 0.445475] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
(__irq_svc+0x58/0x74)
[ 0.453406] Exception stack(0xee059f68 to 0xee059fb0)
[ 0.458792] 9f60: 00000001 00000000 ee059fc0
c0219b60 ee058000 c1603cc8
[ 0.467489] 9f80: c1603c6c 00000000 00000000 c1486188 ee059fc0
c1603cd4 c1483408 ee059fb8
[ 0.476177] 9fa0: c0208a40 c0208a44 60000013 ffffffff
[ 0.481581] [<c020bd78>] (__irq_svc) from [<c0208a44>]
(arch_cpu_idle+0x38/0x3c)
[ 0.489474] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
(do_idle+0x168/0x204)
[ 0.497331] [<c0255e98>] (do_idle) from [<c02561ac>]
(cpu_startup_entry+0x18/0x1c)
[ 0.505369] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
[ 0.512562] CPU2: stopping
[ 0.515463] CPU: 2 PID: 0 Comm: swapper/2 Not tainted
4.11.0-rc8-02028-g6d684e54690c #37
[ 0.524047] Hardware name: Broadcom STB (Flattened Device Tree)
[ 0.530368] [<c020fa18>] (unwind_backtrace) from [<c020b294>]
(show_stack+0x10/0x14)
[ 0.538573] [<c020b294>] (show_stack) from [<c04bc454>]
(dump_stack+0x90/0xa4)
[ 0.546195] [<c04bc454>] (dump_stack) from [<c020e984>]
(handle_IPI+0x170/0x190)
[ 0.554050] [<c020e984>] (handle_IPI) from [<c020144c>]
(gic_handle_irq+0x88/0x8c)
[ 0.562096] [<c020144c>] (gic_handle_irq) from [<c020bd78>]
(__irq_svc+0x58/0x74)
[ 0.570044] Exception stack(0xee057f68 to 0xee057fb0)
[ 0.575465] 7f60: 00000001 00000000 ee057fc0
c0219b60 ee056000 c1603cc8
[ 0.584145] 7f80: c1603c6c 00000000 00000000 c1486188 ee057fc0
c1603cd4 c1483408 ee057fb8
[ 0.592806] 7fa0: c0208a40 c0208a44 60000013 ffffffff
[ 0.598220] [<c020bd78>] (__irq_svc) from [<c0208a44>]
(arch_cpu_idle+0x38/0x3c)
[ 0.606103] [<c0208a44>] (arch_cpu_idle) from [<c0255e98>]
(do_idle+0x168/0x204)
[ 0.613960] [<c0255e98>] (do_idle) from [<c02561ac>]
(cpu_startup_entry+0x18/0x1c)
[ 0.621990] [<c02561ac>] (cpu_startup_entry) from [<002014ec>] (0x2014ec)
[ 0.629201] ---[ end Kernel panic - not syncing: rtnetlink_init:
cannot initialize rtnetlink
[ 0.629201]
--
Florian
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox