* Re: [PATCH 0/3] Use refcount_t for refcount
From: Leon Romanovsky @ 2019-08-05 5:57 UTC (permalink / raw)
To: Chuhong Yuan
Cc: Saeed Mahameed, David S . Miller, Doug Ledford, Jason Gunthorpe,
linux-rdma, Netdev, LKML
In-Reply-To: <CANhBUQ0rMKHmh4ibktwRmVN6NU=HAjs-Q7PrF9yX5x5yOyOB2A@mail.gmail.com>
On Sun, Aug 04, 2019 at 10:58:19PM +0800, Chuhong Yuan wrote:
> On Sun, Aug 4, 2019 at 8:48 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Aug 02, 2019 at 08:10:35PM +0800, Chuhong Yuan wrote:
> > > Reference counters are preferred to use refcount_t instead of
> > > atomic_t.
> > > This is because the implementation of refcount_t can prevent
> > > overflows and detect possible use-after-free.
> > >
> > > First convert the refcount field to refcount_t in mlx5/driver.h.
> > > Then convert the uses to refcount_() APIs.
> >
> > You can't do it, because you need to ensure that driver compiles and
> > works between patches. By converting driver.h alone to refcount_t, you
> > simply broke mlx5 driver.
> >
>
> It is my fault... I am not clear how to send patches which cross
> several subsystems, so I sent them in series.
> Maybe I should merge these patches together?
In case of conversion patches, yes, you need to perform such change
in one shot.
>
>
> > NAK, to be clear.
> >
> > And please don't sent series of patches as standalone patches.
> >
>
> Due to the reason mentioned above, I sent them seperately.
Create patch, run ./scripts/get_maintainer.pl on it and send according
to generated output. You are not doing kernel core changes and there is
no need to worry about cross subsystem complexity as long as you will
put relevant maintainers in TO: field.
Thanks
>
> > Thanks,
^ permalink raw reply
* Re: [PATCH net-next 3/6] cls_api: add flow_indr_block_call function
From: Jiri Pirko @ 2019-08-05 6:02 UTC (permalink / raw)
To: wenxu; +Cc: pablo, fw, jakub.kicinski, netfilter-devel, netdev
In-Reply-To: <1564628627-10021-4-git-send-email-wenxu@ucloud.cn>
Re subject. You don't have "v5" in this patch. I don't understand how
that happened. Do you use --subject-prefix in git-format-patch?
^ permalink raw reply
* Re: [PATCH v2] net/mlx5e: Use refcount_t for refcount
From: Leon Romanovsky @ 2019-08-05 6:13 UTC (permalink / raw)
To: Chuhong Yuan; +Cc: Saeed Mahameed, David S . Miller, Netdev, linux-rdma, LKML
In-Reply-To: <CANhBUQ2H5MU0m2xM0AkJGPf7+MJBZ3Eq5rR0kgeOoKRi4q1j6Q@mail.gmail.com>
On Sun, Aug 04, 2019 at 10:44:47PM +0800, Chuhong Yuan wrote:
> On Sun, Aug 4, 2019 at 8:59 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sat, Aug 03, 2019 at 12:48:28AM +0800, Chuhong Yuan wrote:
> > > refcount_t is better for reference counters since its
> > > implementation can prevent overflows.
> > > So convert atomic_t ref counters to refcount_t.
> >
> > I'm not thrilled to see those automatic conversion patches, especially
> > for flows which can't overflow. There is nothing wrong in using atomic_t
> > type of variable, do you have in mind flow which will cause to overflow?
> >
> > Thanks
>
> I have to say that these patches are not done automatically...
> Only the detection of problems is done by a script.
> All conversions are done manually.
Even worse, you need to audit usage of atomic_t and replace there
it can overflow.
>
> I am not sure whether the flow can cause an overflow.
It can't.
> But I think it is hard to ensure that a data path is impossible
> to have problems in any cases including being attacked.
It is not data path, and I doubt that such conversion will be allowed
in data paths without proving that no performance regression is introduced.
>
> So I think it is better to do this minor revision to prevent
> potential risk, just like we have done in mlx5/core/cq.c.
mlx5/core/cq.c is a different beast, refcount there means actual users
of CQ which are limited in SW, so in theory, they have potential
to be overflown.
It is not the case here, there your are adding new port.
There is nothing wrong with atomic_t.
Thanks
>
> Regards,
> Chuhong
>
> > >
> > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > ---
> > > Changes in v2:
> > > - Add #include.
> > >
> > > drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c | 9 +++++----
> > > 1 file changed, 5 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > index b9d4f4e19ff9..148b55c3db7a 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > @@ -32,6 +32,7 @@
> > >
> > > #include <linux/kernel.h>
> > > #include <linux/module.h>
> > > +#include <linux/refcount.h>
> > > #include <linux/mlx5/driver.h>
> > > #include <net/vxlan.h>
> > > #include "mlx5_core.h"
> > > @@ -48,7 +49,7 @@ struct mlx5_vxlan {
> > >
> > > struct mlx5_vxlan_port {
> > > struct hlist_node hlist;
> > > - atomic_t refcount;
> > > + refcount_t refcount;
> > > u16 udp_port;
> > > };
> > >
> > > @@ -113,7 +114,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > >
> > > vxlanp = mlx5_vxlan_lookup_port(vxlan, port);
> > > if (vxlanp) {
> > > - atomic_inc(&vxlanp->refcount);
> > > + refcount_inc(&vxlanp->refcount);
> > > return 0;
> > > }
> > >
> > > @@ -137,7 +138,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > > }
> > >
> > > vxlanp->udp_port = port;
> > > - atomic_set(&vxlanp->refcount, 1);
> > > + refcount_set(&vxlanp->refcount, 1);
> > >
> > > spin_lock_bh(&vxlan->lock);
> > > hash_add(vxlan->htable, &vxlanp->hlist, port);
> > > @@ -170,7 +171,7 @@ int mlx5_vxlan_del_port(struct mlx5_vxlan *vxlan, u16 port)
> > > goto out_unlock;
> > > }
> > >
> > > - if (atomic_dec_and_test(&vxlanp->refcount)) {
> > > + if (refcount_dec_and_test(&vxlanp->refcount)) {
> > > hash_del(&vxlanp->hlist);
> > > remove = true;
> > > }
> > > --
> > > 2.20.1
> > >
^ permalink raw reply
* Re: [RFC PATCH 1/2] dt-bindings: net: macb: Add new property for PS SGMII only
From: Harini Katakam @ 2019-08-05 6:15 UTC (permalink / raw)
To: Andrew Lunn
Cc: Harini Katakam, Nicolas Ferre, David Miller, Claudiu Beznea,
Rob Herring, Mark Rutland, netdev, linux-kernel, Michal Simek,
devicetree
In-Reply-To: <20190804145633.GB6800@lunn.ch>
Hi Andrew,
On Sun, Aug 4, 2019 at 8:26 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Jul 31, 2019 at 03:10:32PM +0530, Harini Katakam wrote:
> > Add a new property to indicate when PS SGMII is used with NO
> > external PHY on board.
>
> Hi Harini
>
> What exactly is you use case? Are you connecting to a Ethernet switch?
> To an SFP cage with a copper module?
Yes, an SFP cage is the common HW target for this patch. Essentially, there
is no external PHY driver that the macb can "connect" to; if there was an
external PHY on board, its link status would also indicate when the GEM(PCS)
to external PHY link was down. But in the absence of an external PHY on
HW, PCS link status needs to be monitored and reported to users.
Regards,
Harini
^ permalink raw reply
* Re: [PATCH net-next 3/6] cls_api: add flow_indr_block_call function
From: wenxu @ 2019-08-05 6:26 UTC (permalink / raw)
To: Jiri Pirko; +Cc: pablo, fw, jakub.kicinski, netfilter-devel, netdev
In-Reply-To: <20190805060238.GB2349@nanopsycho.orion>
v5 contain this patch but with non-version tag,
I used --subject-prefix in git-format-patch. I am sorry to make a mistake when modify the
commit log. So should I repost the v6?
On 8/5/2019 2:02 PM, Jiri Pirko wrote:
> Re subject. You don't have "v5" in this patch. I don't understand how
> that happened. Do you use --subject-prefix in git-format-patch?
>
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-05 6:28 UTC (permalink / raw)
To: Jason Wang
Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
linux-mm
In-Reply-To: <6c3a0a1c-ce87-907b-7bc8-ec41bf9056d8@redhat.com>
On Mon, Aug 05, 2019 at 12:33:45PM +0800, Jason Wang wrote:
>
> On 2019/8/2 下午10:03, Michael S. Tsirkin wrote:
> > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > Btw, I come up another idea, that is to disable preemption when vhost thread
> > > need to access the memory. Then register preempt notifier and if vhost
> > > thread is preempted, we're sure no one will access the memory and can do the
> > > cleanup.
> > Great, more notifiers :(
> >
> > Maybe can live with
> > 1- disable preemption while using the cached pointer
> > 2- teach vhost to recover from memory access failures,
> > by switching to regular from/to user path
>
>
> I don't get this, I believe we want to recover from regular from/to user
> path, isn't it?
That (disable copy to/from user completely) would be a nice to have
since it would reduce the attack surface of the driver, but e.g. your
code already doesn't do that.
>
> >
> > So if you want to try that, fine since it's a step in
> > the right direction.
> >
> > But I think fundamentally it's not what we want to do long term.
>
>
> Yes.
>
>
> >
> > It's always been a fundamental problem with this patch series that only
> > metadata is accessed through a direct pointer.
> >
> > The difference in ways you handle metadata and data is what is
> > now coming and messing everything up.
>
>
> I do propose soemthing like this in the past:
> https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks
> like you have some concern about its locality.
Right and it doesn't go away. You'll need to come up
with a test that messes it up and triggers a worst-case
scenario, so we can measure how bad is that worst-case.
> But the problem still there, GUP can do page fault, so still need to
> synchronize it with MMU notifiers.
I think the idea was, if GUP would need a pagefault, don't
do a GUP and do to/from user instead. Hopefully that
will fault the page in and the next access will go through.
> The solution might be something like
> moving GUP to a dedicated kind of vhost work.
Right, generally GUP.
>
> >
> > So if continuing the direct map approach,
> > what is needed is a cache of mapped VM memory, then on a cache miss
> > we'd queue work along the lines of 1-2 above.
> >
> > That's one direction to take. Another one is to give up on that and
> > write our own version of uaccess macros. Add a "high security" flag to
> > the vhost module and if not active use these for userspace memory
> > access.
>
>
> Or using SET_BACKEND_FEATURES?
No, I don't think it's considered best practice to allow unpriveledged
userspace control over whether kernel enables security features.
> But do you mean permanent GUP as I did in
> original RFC https://lkml.org/lkml/2018/12/13/218?
>
> Thanks
Permanent GUP breaks THP and NUMA.
> >
> >
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-05 6:30 UTC (permalink / raw)
To: Jason Wang
Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
linux-mm
In-Reply-To: <e8ecb811-6653-cff4-bc11-81f4ccb0dbbf@redhat.com>
On Mon, Aug 05, 2019 at 12:36:40PM +0800, Jason Wang wrote:
>
> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
> > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > synchronize_rcu.
> > > >
> > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > concern.
> > > I've also idly wondered if calling synchronize_rcu() under the various
> > > mm locks is a deadlock situation.
> > >
> > > > Then I try spinlock and mutex:
> > > >
> > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > > improvement.
> > > I think the topic here is correctness not performance improvement
> > The topic is whether we should revert
> > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
> >
> > or keep it in. The only reason to keep it is performance.
>
>
> Maybe it's time to introduce the config option?
Depending on CONFIG_BROKEN? I'm not sure it's a good idea.
>
> >
> > Now as long as all this code is disabled anyway, we can experiment a
> > bit.
> >
> > I personally feel we would be best served by having two code paths:
> >
> > - Access to VM memory directly mapped into kernel
> > - Access to userspace
> >
> >
> > Having it all cleanly split will allow a bunch of optimizations, for
> > example for years now we planned to be able to process an incoming short
> > packet directly on softirq path, or an outgoing on directly within
> > eventfd.
>
>
> It's not hard consider we've already had our own accssors. But the question
> is (as asked in another thread), do you want permanent GUP or still use MMU
> notifiers.
>
> Thanks
We want THP and NUMA to work. Both are important for performance.
--
MST
^ permalink raw reply
* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-08-05 6:38 UTC (permalink / raw)
To: Andrew Lunn
Cc: Vladimir Oltean, Florian Fainelli, Heiner Kallweit,
David S . Miller, Arun Parameswaran, Justin Chen, netdev, lkml,
openbmc@lists.ozlabs.org
In-Reply-To: <20190804145152.GA6800@lunn.ch>
Hi Andrew,
On 8/4/19 7:51 AM, Andrew Lunn wrote:
>>> The patchset looks better now. But is it ok, I wonder, to keep
>>> PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
>>> phy_attach_direct is overwriting it?
>>
>
>> I checked ftgmac100 driver (used on my machine) and it calls
>> phy_connect_direct which passes phydev->dev_flags when calling
>> phy_attach_direct: that explains why the flag is not cleared in my
>> case.
>
> Yes, that is the way it is intended to be used. The MAC driver can
> pass flags to the PHY. It is a fragile API, since the MAC needs to
> know what PHY is being used, since the flags are driver specific.
>
> One option would be to modify the assignment in phy_attach_direct() to
> OR in the flags passed to it with flags which are already in
> phydev->dev_flags.
It sounds like a reasonable fix/enhancement to replace overriding with OR, no matter which direction we are going to (either adding 1000bx aneg in genphy or providing phy-specific aneg callback).
Do you want me to send out the patch (I feel it's better to be in a separate patch?) or someone else will take care of it?
Thanks,
Tao
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-05 6:40 UTC (permalink / raw)
To: Jason Wang
Cc: Jason Gunthorpe, linux-mm, netdev, linux-kernel, kvm,
virtualization
In-Reply-To: <494ac30d-b750-52c8-b927-16cd4b9414c4@redhat.com>
On Mon, Aug 05, 2019 at 12:41:45PM +0800, Jason Wang wrote:
>
> On 2019/8/5 下午12:36, Jason Wang wrote:
> >
> > On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
> > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > > synchronize_rcu.
> > > > >
> > > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > > concern.
> > > > I've also idly wondered if calling synchronize_rcu() under the various
> > > > mm locks is a deadlock situation.
> > > >
> > > > > Then I try spinlock and mutex:
> > > > >
> > > > > 1) spinlock: add lots of overhead on datapath, this leads 0
> > > > > performance
> > > > > improvement.
> > > > I think the topic here is correctness not performance improvement
> > > The topic is whether we should revert
> > > commit 7f466032dc9 ("vhost: access vq metadata through kernel
> > > virtual address")
> > >
> > > or keep it in. The only reason to keep it is performance.
> >
> >
> > Maybe it's time to introduce the config option?
>
>
> Or does it make sense if I post a V3 with:
>
> - introduce config option and disable the optimization by default
>
> - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the
> same
>
> This can give us some breath to decide which way should go for next release?
>
> Thanks
As is, with preempt enabled? Nope I don't think blocking an invalidator
on swap IO is ok, so I don't believe this stuff is going into this
release at this point.
So it's more a question of whether it's better to revert and apply a clean
patch on top, or just keep the code around but disabled with an ifdef as is.
I'm open to both options, and would like your opinion on this.
>
> >
> >
> > >
> > > Now as long as all this code is disabled anyway, we can experiment a
> > > bit.
> > >
> > > I personally feel we would be best served by having two code paths:
> > >
> > > - Access to VM memory directly mapped into kernel
> > > - Access to userspace
> > >
> > >
> > > Having it all cleanly split will allow a bunch of optimizations, for
> > > example for years now we planned to be able to process an incoming short
> > > packet directly on softirq path, or an outgoing on directly within
> > > eventfd.
> >
> >
> > It's not hard consider we've already had our own accssors. But the
> > question is (as asked in another thread), do you want permanent GUP or
> > still use MMU notifiers.
> >
> > Thanks
> >
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2] net/mlx5e: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-05 6:55 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Saeed Mahameed, David S . Miller, Netdev, linux-rdma, LKML
In-Reply-To: <20190805061320.GN4832@mtr-leonro.mtl.com>
On Mon, Aug 5, 2019 at 2:13 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sun, Aug 04, 2019 at 10:44:47PM +0800, Chuhong Yuan wrote:
> > On Sun, Aug 4, 2019 at 8:59 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Sat, Aug 03, 2019 at 12:48:28AM +0800, Chuhong Yuan wrote:
> > > > refcount_t is better for reference counters since its
> > > > implementation can prevent overflows.
> > > > So convert atomic_t ref counters to refcount_t.
> > >
> > > I'm not thrilled to see those automatic conversion patches, especially
> > > for flows which can't overflow. There is nothing wrong in using atomic_t
> > > type of variable, do you have in mind flow which will cause to overflow?
> > >
> > > Thanks
> >
> > I have to say that these patches are not done automatically...
> > Only the detection of problems is done by a script.
> > All conversions are done manually.
>
> Even worse, you need to audit usage of atomic_t and replace there
> it can overflow.
>
> >
> > I am not sure whether the flow can cause an overflow.
>
> It can't.
>
> > But I think it is hard to ensure that a data path is impossible
> > to have problems in any cases including being attacked.
>
> It is not data path, and I doubt that such conversion will be allowed
> in data paths without proving that no performance regression is introduced.
>>
>
> > So I think it is better to do this minor revision to prevent
> > potential risk, just like we have done in mlx5/core/cq.c.
>
> mlx5/core/cq.c is a different beast, refcount there means actual users
> of CQ which are limited in SW, so in theory, they have potential
> to be overflown.
>
> It is not the case here, there your are adding new port.
> There is nothing wrong with atomic_t.
>
Thanks for your explanation!
I will pay attention to this point in similar cases.
But it seems that the semantic of refcount is not always as clear as here...
Regards,
Chuhong
> Thanks
>
> >
> > Regards,
> > Chuhong
> >
> > > >
> > > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > > ---
> > > > Changes in v2:
> > > > - Add #include.
> > > >
> > > > drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c | 9 +++++----
> > > > 1 file changed, 5 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > > index b9d4f4e19ff9..148b55c3db7a 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > > @@ -32,6 +32,7 @@
> > > >
> > > > #include <linux/kernel.h>
> > > > #include <linux/module.h>
> > > > +#include <linux/refcount.h>
> > > > #include <linux/mlx5/driver.h>
> > > > #include <net/vxlan.h>
> > > > #include "mlx5_core.h"
> > > > @@ -48,7 +49,7 @@ struct mlx5_vxlan {
> > > >
> > > > struct mlx5_vxlan_port {
> > > > struct hlist_node hlist;
> > > > - atomic_t refcount;
> > > > + refcount_t refcount;
> > > > u16 udp_port;
> > > > };
> > > >
> > > > @@ -113,7 +114,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > > >
> > > > vxlanp = mlx5_vxlan_lookup_port(vxlan, port);
> > > > if (vxlanp) {
> > > > - atomic_inc(&vxlanp->refcount);
> > > > + refcount_inc(&vxlanp->refcount);
> > > > return 0;
> > > > }
> > > >
> > > > @@ -137,7 +138,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > > > }
> > > >
> > > > vxlanp->udp_port = port;
> > > > - atomic_set(&vxlanp->refcount, 1);
> > > > + refcount_set(&vxlanp->refcount, 1);
> > > >
> > > > spin_lock_bh(&vxlan->lock);
> > > > hash_add(vxlan->htable, &vxlanp->hlist, port);
> > > > @@ -170,7 +171,7 @@ int mlx5_vxlan_del_port(struct mlx5_vxlan *vxlan, u16 port)
> > > > goto out_unlock;
> > > > }
> > > >
> > > > - if (atomic_dec_and_test(&vxlanp->refcount)) {
> > > > + if (refcount_dec_and_test(&vxlanp->refcount)) {
> > > > hash_del(&vxlanp->hlist);
> > > > remove = true;
> > > > }
> > > > --
> > > > 2.20.1
> > > >
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-08-05 7:36 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
LSM List
In-Reply-To: <CALCETrWtE2U4EvZVYeq8pSmQjBzF2PHH+KxYW8FSeF+W=1FYjw@mail.gmail.com>
Hi Andy,
> On Aug 4, 2019, at 10:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
>>>>
>>>> Hi Andy,
>>>>
>>>>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>>>>>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
>>>>>
>>>>> It's been done before -- it's not that hard. IMO the main tricky bit
>>>>> would be try be somewhat careful about defining exactly what
>>>>> CAP_BPF_ADMIN does.
>>>>
>>>> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
>>>> Plumbers conference.
>>>>
>>>> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
>>>> like systemd to do sys_bpf() without root.
>>>
>>> I don't understand the use case here. Are you talking about systemd
>>> --user? As far as I know, a user is expected to be able to fully
>>> control their systemd --user process, so giving it unrestricted bpf
>>> access is very close to giving it superuser access, and this doesn't
>>> sound like a good idea. I think that, if systemd --user needs bpf(),
>>> it either needs real unprivileged bpf() or it needs a privileged
>>> helper (SUID or a daemon) to intermediate this access.
>>>
>>>>
>>>>>
>>>>>>> I don't see why you need to invent a whole new mechanism for this.
>>>>>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>>>>>> write permission on files in cgroupfs to control access. Why can't
>>>>>>> bpf() do the same thing?
>>>>>>
>>>>>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>>>>>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>>>>>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>>>>>> we should have target concept for all these commands. But that is a
>>>>>> much bigger project. OTOH, "all or nothing" model allows all these
>>>>>> commands at once.
>>>>>
>>>>> For BPF_PROG_LOAD, I admit I've never understood why permission is
>>>>> required at all. I think that CAP_SYS_ADMIN or similar should be
>>>>> needed to get is_priv in the verifier, but I think that should mainly
>>>>> be useful for tracing, and that requires lots of privilege anyway.
>>>>> BPF_MAP_* is probably the trickiest part. One solution would be some
>>>>> kind of bpffs, but I'm sure other solutions are possible.
>>>>
>>>> Improving permission management of cgroup_bpf is another good topic to
>>>> discuss. However, it is also an overkill for current use case.
>>>>
>>>
>>> I looked at the code some more, and I don't think this is so hard
>>> after all. As I understand it, all of the map..by_id stuff is, to
>>> some extent, deprecated in favor of persistent maps. As I see it, the
>>> map..by_id calls should require privilege forever, although I can
>>> imagine ways to scope that privilege to a namespace if the maps
>>> themselves were to be scoped to a namespace.
>>>
>>> Instead, unprivileged tools would use the persistent map interface
>>> roughly like this:
>>>
>>> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
>>> 8 entries 64 name mapname
>>>
>>> This would require that the caller have either CAP_DAC_OVERRIDE or
>>> that the caller have permission to create files in /sys/fs/bpf/my_dir
>>> (using the same rules as for any filesystem), and the resulting map
>>> would end up owned by the creating user and have mode 0600 (or maybe
>>> 0666, or maybe a new bpf_attr parameter) modified by umask. Then all
>>> the various capable() checks that are currently involved in accessing
>>> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
>>> map file as appropriate.
>>>
>>> Half of this stuff already works. I just set my system up like this:
>>>
>>> $ ls -l /sys/fs/bpf
>>> total 0
>>> drwxr-xr-x. 3 luto luto 0 Aug 4 15:10 luto
>>>
>>> $ mkdir /sys/fs/bpf/luto/test
>>>
>>> $ ls -l /sys/fs/bpf/luto
>>> total 0
>>> drwxrwxr-x. 2 luto luto 0 Aug 4 15:10 test
>>>
>>> I bet that making the bpf() syscalls work appropriately in this
>>> context without privilege would only be a couple of hours of work.
>>> The hard work, creating bpffs and making it function, is already done
>>> :)
>>>
>>> P.S. The docs for bpftool create are less than fantastic. The
>>> complete lack of any error message at all when the syscall returns
>>> -EACCES is also not fantastic.
>>
>> This isn't remotely finished, but I spent a bit of time fiddling with this:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>>
>> What do you think? (It's obviously not done. It doesn't compile, and
>> I haven't gotten to the permissions needed to do map operations. I
>> also haven't touched the capable() checks.)
>
> I updated the branch. It compiles, and basic map functionality works!
Thanks a lot for trying this out. This is a very interesting direction
that we will explore.
>
> # mount -t bpf bpf /sys/fs/bpf
> # cd /sys/fs/bpf
> # mkdir luto
> # chown luto: luto
> # setpriv --euid=1000 --ruid=1000 bash
> $ pwd
> /sys/fs/bpf
> bash-5.0$ ls -l
> total 0
> drwxr-xr-x 2 luto luto 0 Aug 4 22:41 luto
> bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
> value 8 entries 64 name mapname
> bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Found 0 elements
>
> # chown root: /sys/fs/bpf/luto/filename
>
> $ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Error: bpf obj get (/sys/fs/bpf/luto): Permission denied
>
> So I think it's possible to get a respectable subset of bpf()
> functionality working without privilege in short order :)
I think we have two key questions to answer:
1. What subset of bpf() functionality will the users need?
2. Who are the users?
Different answers to these two questions lead to different directions.
In our use case, the answers are
1) almost all bpf() functionality
2) highly trusted users (sudoers)
So our initial approach of /dev/bpf allows all bpf() functionality
in one bit in task_struct. (Yes, we can just sudo. But, we would
rather not use sudo when possible.)
"cgroup management" use case may have answers like:
1) cgroup_bpf only
2) users in their own containers
For this case, getting cgroup_bpf related features (cgroup_bpf progs;
some map types, etc.) work with unprivileged users would be the right
direction.
"USDT tracing" use case may have answers like:
1) uprobe, stockmap, histogram, etc.
2) unprivileged user, w/ or w/o containers
For this case, the first step is likely hacking sys_perf_event_open().
I guess we will need more discussions to decide how to make bpf()
work better for all these (and more) use cases.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH ethtool] ethtool: dump nested registers
From: Michal Kubecek @ 2019-08-05 8:04 UTC (permalink / raw)
To: netdev; +Cc: Vivien Didelot, f.fainelli, andrew, davem, linville, cphealy
In-Reply-To: <20190802193455.17126-1-vivien.didelot@gmail.com>
On Fri, Aug 02, 2019 at 03:34:54PM -0400, Vivien Didelot wrote:
> Usually kernel drivers set the regs->len value to the same length as
> info->regdump_len, which was used for the allocation. In case where
> regs->len is smaller than the allocated info->regdump_len length,
> we may assume that the dump contains a nested set of registers.
>
> This becomes handy for kernel drivers to expose registers of an
> underlying network conduit unfortunately not exposed to userspace,
> as found in network switching equipment for example.
>
> This patch adds support for recursing into the dump operation if there
> is enough room for a nested ethtool_drvinfo structure containing a
> valid driver name, followed by a ethtool_regs structure like this:
>
> 0 regs->len info->regdump_len
> v v v
> +--------------+-----------------+--------------+-- - --+
> | ethtool_regs | ethtool_drvinfo | ethtool_regs | |
> +--------------+-----------------+--------------+-- - --+
>
> Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
> ---
I'm not sure about this approach. If these additional objects with their
own registers are represented by a network device, we can query their
registers directly. If they are not (which, IIUC, is the case in your
use case), we should use an appropriate interface. AFAIK the CPU ports
are already represented in devlink, shouldn't devlink be also used to
query their registers?
> ethtool.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/ethtool.c b/ethtool.c
> index 05fe05a08..c0e2903c5 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -1245,7 +1245,7 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
>
> if (gregs_dump_raw) {
> fwrite(regs->data, regs->len, 1, stdout);
> - return 0;
> + goto nested;
> }
>
> if (!gregs_dump_hex)
> @@ -1253,7 +1253,7 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
> if (!strncmp(driver_list[i].name, info->driver,
> ETHTOOL_BUSINFO_LEN)) {
> if (driver_list[i].func(info, regs) == 0)
> - return 0;
> + goto nested;
> /* This version (or some other
> * variation in the dump format) is
> * not handled; fall back to hex
> @@ -1263,6 +1263,15 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
>
> dump_hex(stdout, regs->data, regs->len, 0);
>
> +nested:
> + /* Recurse dump if some drvinfo and regs structures are nested */
> + if (info->regdump_len > regs->len + sizeof(*info) + sizeof(*regs)) {
> + info = (struct ethtool_drvinfo *)(®s->data[0] + regs->len);
> + regs = (struct ethtool_regs *)(®s->data[0] + regs->len + sizeof(*info));
> +
> + return dump_regs(gregs_dump_raw, gregs_dump_hex, info, regs);
> + }
> +
> return 0;
> }
>
For raw and hex dumps, this will dump only the payloads without any
metadata allowing to identify what are the additional blocks for the
other related objects, i.e. where they start, how long they are and what
they belong to. That doesn't seem very useful.
Michal Kubecek
^ permalink raw reply
* Re: [PATCH v3] net: sched: Fix a possible null-pointer dereference in dequeue_func()
From: Eric Dumazet @ 2019-08-05 8:05 UTC (permalink / raw)
To: Cong Wang, Jia-Ju Bai
Cc: Jamal Hadi Salim, Jiri Pirko, David Miller,
Linux Kernel Network Developers, LKML
In-Reply-To: <CAM_iQpU0L6hgzg1N9Ay=72Ee-2Ni-ovPJX8SyJaRDv7dbhZs_Q@mail.gmail.com>
On 7/29/19 7:23 PM, Cong Wang wrote:
> On Mon, Jul 29, 2019 at 1:24 AM Jia-Ju Bai <baijiaju1990@gmail.com> wrote:
>>
>> In dequeue_func(), there is an if statement on line 74 to check whether
>> skb is NULL:
>> if (skb)
>>
>> When skb is NULL, it is used on line 77:
>> prefetch(&skb->end);
>>
>> Thus, a possible null-pointer dereference may occur.
>>
>> To fix this bug, skb->end is used when skb is not NULL.
>>
>> This bug is found by a static analysis tool STCheck written by us.
>
> It doesn't dereference the pointer, it simply calculates the address
> and passes it to gcc builtin prefetch. Both are fine when it is NULL,
> as prefetching a NULL address should be okay for kernel.
>
> So your changelog is misleading and it doesn't fix any bug,
> although it does very slightly make the code better.
>
Original code was intentional.
A prefetch() need to be done as early as possible.
At the time of commit 76e3cc126bb223013a6b9a0e2a51238d1ef2e409
this was pretty clear :
+static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
+{
+ struct sk_buff *skb = __skb_dequeue(&sch->q);
+
+ prefetch(&skb->end); /* we'll need skb_shinfo() */
+ return skb;
+}
Really static analysis should learn about prefetch(X) being legal for _any_ value of X
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05 8:18 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
Paul E. McKenney
In-Reply-To: <20190803173825-mutt-send-email-mst@kernel.org>
On 2019/8/4 上午5:54, Michael S. Tsirkin wrote:
> On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote:
>> On 2019/8/1 上午2:29, Michael S. Tsirkin wrote:
>>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
>>>> We used to use RCU to synchronize MMU notifier with worker. This leads
>>>> calling synchronize_rcu() in invalidate_range_start(). But on a busy
>>>> system, there would be many factors that may slow down the
>>>> synchronize_rcu() which makes it unsuitable to be called in MMU
>>>> notifier.
>>>>
>>>> A solution is SRCU but its overhead is obvious with the expensive full
>>>> memory barrier. Another choice is to use seqlock, but it doesn't
>>>> provide a synchronization method between readers and writers. The last
>>>> choice is to use vq mutex, but it need to deal with the worst case
>>>> that MMU notifier must be blocked and wait for the finish of swap in.
>>>>
>>>> So this patch switches use a counter to track whether or not the map
>>>> was used. The counter was increased when vq try to start or finish
>>>> uses the map. This means, when it was even, we're sure there's no
>>>> readers and MMU notifier is synchronized. When it was odd, it means
>>>> there's a reader we need to wait it to be even again then we are
>>>> synchronized. To avoid full memory barrier, store_release +
>>>> load_acquire on the counter is used.
>>> Unfortunately this needs a lot of review and testing, so this can't make
>>> rc2, and I don't think this is the kind of patch I can merge after rc3.
>>> Subtle memory barrier tricks like this can introduce new bugs while they
>>> are fixing old ones.
>> I admit the patch is tricky. Some questions:
>>
>> - Do we must address the case of e.g swap in? If not, a simple
>> vhost_work_flush() instead of synchronize_rcu() may work.
>> - Having some hard thought, I think we can use seqlock, it looks
>> to me smp_wmb() is in write_segcount_begin() is sufficient, we don't
>> care vq->map read before smp_wmb(), and for the other we all have
>> good data devendency so smp_wmb() in the write_seqbegin_end() is
>> sufficient.
> If we need an mb in the begin() we can switch to
> dependent_ptr_mb. if you need me to fix it up
> and repost, let me know.
Yes, but please let me figure out whether mb is really necessary here.
>
> Why isn't it a problem if the map is
> accessed outside the lock?
Correct me if I was wrong. E.g for vhost_put_avail_event()
vhost_vq_access_map_begin(vq);
map = vq->maps[VHOST_ADDR_USED];
if (likely(map)) {
used = map->addr;
*((__virtio16 *)&used->ring[vq->num]) =
cpu_to_vhost16(vq, vq->avail_idx);
vhost_vq_access_map_end(vq);
return 0;
}
vhost_vq_access_map_end(vq);
We dont' care whether map is accessed before vhost_vq_access_map_begin()
since MMU notifier can only change map from non-NULL to NULL. If we read
it too early, we will only get NULL and won't use the map at all. And
smp_wmb() in vhost_vq_access_map_begin() can make sure the real access
to map->addr is done after we increasing the counter.
>
>
>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index db2c81cb1e90..6d9501303258 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>>
>> static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>> {
>> - int ref = READ_ONCE(vq->ref);
>> -
>> - smp_store_release(&vq->ref, ref + 1);
>> - /* Make sure ref counter is visible before accessing the map */
>> - smp_load_acquire(&vq->ref);
>> + write_seqcount_begin(&vq->seq);
>> }
>>
>> static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>> {
>> - int ref = READ_ONCE(vq->ref);
>> -
>> - /* Make sure vq access is done before increasing ref counter */
>> - smp_store_release(&vq->ref, ref + 1);
>> + write_seqcount_end(&vq->seq);
>> }
>>
>> static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>> {
>> - int ref;
>> + unsigned int ret;
>>
>> /* Make sure map change was done before checking ref counter */
>> smp_mb();
>> -
>> - ref = READ_ONCE(vq->ref);
>> - if (ref & 0x1) {
>> - /* When ref change, we are sure no reader can see
>> + ret = raw_read_seqcount(&vq->seq);
>> + if (ret & 0x1) {
>> + /* When seq changes, we are sure no reader can see
>> * previous map */
>> - while (READ_ONCE(vq->ref) == ref) {
>> - set_current_state(TASK_RUNNING);
>> + while (raw_read_seqcount(&vq->seq) == ret)
>> schedule();
>
> So why do we set state here?
No need, just a artifact of previous patch.
> And should not we
> check need_sched?
We need use need_sched().
>
>
>> - }
>> }
>> - /* Make sure ref counter was checked before any other
>> - * operations that was dene on map. */
>> + /* Make sure seq was checked before any other operations that
>> + * was dene on map. */
>> smp_mb();
>> }
>>
>> @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>> vq->indirect = NULL;
>> vq->heads = NULL;
>> vq->dev = dev;
>> - vq->ref = 0;
>> + seqcount_init(&vq->seq);
>> mutex_init(&vq->mutex);
>> spin_lock_init(&vq->mmu_lock);
>> vhost_vq_reset(dev, vq);
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 3d10da0ae511..1a705e181a84 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -125,7 +125,7 @@ struct vhost_virtqueue {
>> */
>> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>> #endif
>> - int ref;
>> + seqcount_t seq;
>> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>>
>> struct file *kick;
>> --
>> 2.18.1
>>
>>>
>>>
>>>
>>>
>>>> Consider the read critical section is pretty small the synchronization
>>>> should be done very fast.
>>>>
>>>> Note the patch lead about 3% PPS dropping.
>>> Sorry what do you mean by this last sentence? This degrades performance
>>> compared to what?
>> Compare to without this patch.
> OK is the feature still a performance win? or should we drop it for now?
Still a win, just a drop from 23% improvement to 20% improvement.
>>>> Reported-by: Michael S. Tsirkin <mst@redhat.com>
>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>> drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
>>>> drivers/vhost/vhost.h | 7 +-
>>>> 2 files changed, 94 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index cfc11f9ed9c9..db2c81cb1e90 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>>>>
>>>> spin_lock(&vq->mmu_lock);
>>>> for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>>>> - map[i] = rcu_dereference_protected(vq->maps[i],
>>>> - lockdep_is_held(&vq->mmu_lock));
>>>> + map[i] = vq->maps[i];
>>>> if (map[i]) {
>>>> vhost_set_map_dirty(vq, map[i], i);
>>>> - rcu_assign_pointer(vq->maps[i], NULL);
>>>> + vq->maps[i] = NULL;
>>>> }
>>>> }
>>>> spin_unlock(&vq->mmu_lock);
>>>>
>>>> - /* No need for synchronize_rcu() or kfree_rcu() since we are
>>>> - * serialized with memory accessors (e.g vq mutex held).
>>>> + /* No need for synchronization since we are serialized with
>>>> + * memory accessors (e.g vq mutex held).
>>>> */
>>>>
>>>> for (i = 0; i < VHOST_NUM_ADDRS; i++)
>>>> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>>>> return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>>>> }
>>>>
>>>> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>>>> +{
>>>> + int ref = READ_ONCE(vq->ref);
>>>> +
>>>> + smp_store_release(&vq->ref, ref + 1);
>>>> + /* Make sure ref counter is visible before accessing the map */
>>>> + smp_load_acquire(&vq->ref);
>>> The map access is after this sequence, correct?
>> Yes.
>>
>>> Just going by the rules in Documentation/memory-barriers.txt,
>>> I think that this pair will not order following accesses with ref store.
>>>
>>> Documentation/memory-barriers.txt says:
>>>
>>>
>>> + In addition, a RELEASE+ACQUIRE
>>> + pair is -not- guaranteed to act as a full memory barrier.
>>>
>>>
>>>
>>> The guarantee that is made is this:
>>> after
>>> an ACQUIRE on a given variable, all memory accesses preceding any prior
>>> RELEASE on that same variable are guaranteed to be visible.
>> Yes, but it's not clear about the order of ACQUIRE the same location
>> of previous RELEASE. And it only has a example like:
>>
>> "
>> *A = a;
>> RELEASE M
>> ACQUIRE N
>> *B = b;
>>
>> could occur as:
>>
>> ACQUIRE N, STORE *B, STORE *A, RELEASE M
>> "
>>
>> But it doesn't explain what happen when
>>
>> *A = a
>> RELEASE M
>> ACQUIRE M
>> *B = b;
>>
>> And tools/memory-model/Documentation said
>>
>> "
>> First, when a lock-acquire reads from a lock-release, the LKMM
>> requires that every instruction po-before the lock-release must
>> execute before any instruction po-after the lock-acquire.
>> "
>>
>> Is this a hint that I was correct?
> I don't think it's correct since by this logic
> memory barriers can be nops on x86.
It not a nop, instead, it goes to a write and then read to one same
location of memory.
>
>>>
>>> And if we also had the reverse rule we'd end up with a full barrier,
>>> won't we?
>>>
>>> Cc Paul in case I missed something here. And if I'm right,
>>> maybe we should call this out, adding
>>>
>>> "The opposite is not true: a prior RELEASE is not
>>> guaranteed to be visible before memory accesses following
>>> the subsequent ACQUIRE".
>> That kinds of violates the RELEASE?
>>
>> "
>> This also acts as a one-way permeable barrier. It guarantees that all
>> memory operations before the RELEASE operation will appear to happen
>> before the RELEASE operation with respect to the other components of the
>> "
>
> yes but we are talking about RELEASE itself versus stuff
> that comes after it.
Unless RELEASE and ACQUIRE on the same address can be reordered (at
least doesn't happen x86). The following ACQUIRE can make sure stuff
after ACQUIRE id done after RELEASE.
>
>>>
>>>
>>>> +}
>>>> +
>>>> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>>>> +{
>>>> + int ref = READ_ONCE(vq->ref);
>>>> +
>>>> + /* Make sure vq access is done before increasing ref counter */
>>>> + smp_store_release(&vq->ref, ref + 1);
>>>> +}
>>>> +
>>>> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>>>> +{
>>>> + int ref;
>>>> +
>>>> + /* Make sure map change was done before checking ref counter */
>>>> + smp_mb();
>>>> +
>>>> + ref = READ_ONCE(vq->ref);
>>>> + if (ref & 0x1) {
>>> Please document the even/odd trick here too, not just in the commit log.
>>>
>> Ok.
>>
>>>> + /* When ref change,
>>> changes
>>>
>>>> we are sure no reader can see
>>>> + * previous map */
>>>> + while (READ_ONCE(vq->ref) == ref) {
>>>
>>> what is the below line in aid of?
>>>
>>>> + set_current_state(TASK_RUNNING);
> any answers here?
It's unecessary.
>
>>>> + schedule();
>>> if (need_resched())
>>> schedule();
>>>
>>> ?
>> Yes, better.
>>
>>>> + }
>>> On an interruptible kernel, there's a risk here is that
>>> a task got preempted with an odd ref.
>>> So I suspect we'll have to disable preemption when we
>>> make ref odd.
>> I'm not sure I get, if the odd is not the original value we read,
>> we're sure it won't read the new map here I believe.
> But we will spin for a very long time in this case.
Yes, but do we disable preemption in MMU notifier callback. If not it
should be the same as MMU notifier was preempted for long time.
We can disable the preempt count here, but it needs an extra cacheline.
Thanks
>
>>>
>>>> + }
>>>> + /* Make sure ref counter was checked before any other
>>>> + * operations that was dene on map. */
>>> was dene -> were done?
>>>
>> Yes.
>>
>>>> + smp_mb();
>>>> +}
>>>> +
>>>> static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>>> int index,
>>>> unsigned long start,
>>>> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>>> spin_lock(&vq->mmu_lock);
>>>> ++vq->invalidate_count;
>>>>
>>>> - map = rcu_dereference_protected(vq->maps[index],
>>>> - lockdep_is_held(&vq->mmu_lock));
>>>> + map = vq->maps[index];
>>>> if (map) {
>>>> vhost_set_map_dirty(vq, map, index);
>>>> - rcu_assign_pointer(vq->maps[index], NULL);
>>>> + vq->maps[index] = NULL;
>>>> }
>>>> spin_unlock(&vq->mmu_lock);
>>>>
>>>> if (map) {
>>>> - synchronize_rcu();
>>>> + vhost_vq_sync_access(vq);
>>>> vhost_map_unprefetch(map);
>>>> }
>>>> }
>>>> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
>>>> for (i = 0; i < dev->nvqs; ++i) {
>>>> vq = dev->vqs[i];
>>>> for (j = 0; j < VHOST_NUM_ADDRS; j++)
>>>> - RCU_INIT_POINTER(vq->maps[j], NULL);
>>>> + vq->maps[j] = NULL;
>>>> }
>>>> }
>>>> #endif
>>>> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>> vq->indirect = NULL;
>>>> vq->heads = NULL;
>>>> vq->dev = dev;
>>>> + vq->ref = 0;
>>>> mutex_init(&vq->mutex);
>>>> spin_lock_init(&vq->mmu_lock);
>>>> vhost_vq_reset(dev, vq);
>>>> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
>>>> map->npages = npages;
>>>> map->pages = pages;
>>>>
>>>> - rcu_assign_pointer(vq->maps[index], map);
>>>> + vq->maps[index] = map;
>>>> /* No need for a synchronize_rcu(). This function should be
>>>> * called by dev->worker so we are serialized with all
>>>> * readers.
>>>> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>>>> struct vring_used *used;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> + map = vq->maps[VHOST_ADDR_USED];
>>>> if (likely(map)) {
>>>> used = map->addr;
>>>> *((__virtio16 *)&used->ring[vq->num]) =
>>>> cpu_to_vhost16(vq, vq->avail_idx);
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>>> size_t size;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> + map = vq->maps[VHOST_ADDR_USED];
>>>> if (likely(map)) {
>>>> used = map->addr;
>>>> size = count * sizeof(*head);
>>>> memcpy(used->ring + idx, head, size);
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>>> struct vring_used *used;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> + map = vq->maps[VHOST_ADDR_USED];
>>>> if (likely(map)) {
>>>> used = map->addr;
>>>> used->flags = cpu_to_vhost16(vq, vq->used_flags);
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>>>> struct vring_used *used;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> + map = vq->maps[VHOST_ADDR_USED];
>>>> if (likely(map)) {
>>>> used = map->addr;
>>>> used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>>>> struct vring_avail *avail;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> + map = vq->maps[VHOST_ADDR_AVAIL];
>>>> if (likely(map)) {
>>>> avail = map->addr;
>>>> *idx = avail->idx;
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>>> struct vring_avail *avail;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> + map = vq->maps[VHOST_ADDR_AVAIL];
>>>> if (likely(map)) {
>>>> avail = map->addr;
>>>> *head = avail->ring[idx & (vq->num - 1)];
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>>>> struct vring_avail *avail;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> + map = vq->maps[VHOST_ADDR_AVAIL];
>>>> if (likely(map)) {
>>>> avail = map->addr;
>>>> *flags = avail->flags;
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>>>> struct vring_avail *avail;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> + vhost_vq_access_map_begin(vq);
>>>> + map = vq->maps[VHOST_ADDR_AVAIL];
>>>> if (likely(map)) {
>>>> avail = map->addr;
>>>> *event = (__virtio16)avail->ring[vq->num];
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>>>> struct vring_used *used;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> + map = vq->maps[VHOST_ADDR_USED];
>>>> if (likely(map)) {
>>>> used = map->addr;
>>>> *idx = used->idx;
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>>>> struct vring_desc *d;
>>>>
>>>> if (!vq->iotlb) {
>>>> - rcu_read_lock();
>>>> + vhost_vq_access_map_begin(vq);
>>>>
>>>> - map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
>>>> + map = vq->maps[VHOST_ADDR_DESC];
>>>> if (likely(map)) {
>>>> d = map->addr;
>>>> *desc = *(d + idx);
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> return 0;
>>>> }
>>>>
>>>> - rcu_read_unlock();
>>>> + vhost_vq_access_map_end(vq);
>>>> }
>>>> #endif
>>>>
>>>> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
>>>> #if VHOST_ARCH_CAN_ACCEL_UACCESS
>>>> static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
>>>> {
>>>> - struct vhost_map __rcu *map;
>>>> + struct vhost_map *map;
>>>> int i;
>>>>
>>>> for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>>>> - rcu_read_lock();
>>>> - map = rcu_dereference(vq->maps[i]);
>>>> - rcu_read_unlock();
>>>> + map = vq->maps[i];
>>>> if (unlikely(!map))
>>>> vhost_map_prefetch(vq, i);
>>>> }
>>>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>>>> index a9a2a93857d2..f9e9558a529d 100644
>>>> --- a/drivers/vhost/vhost.h
>>>> +++ b/drivers/vhost/vhost.h
>>>> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
>>>> #if VHOST_ARCH_CAN_ACCEL_UACCESS
>>>> /* Read by memory accessors, modified by meta data
>>>> * prefetching, MMU notifier and vring ioctl().
>>>> - * Synchonrized through mmu_lock (writers) and RCU (writers
>>>> - * and readers).
>>>> + * Synchonrized through mmu_lock (writers) and ref counters,
>>>> + * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
>>>> */
>>>> - struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
>>>> + struct vhost_map *maps[VHOST_NUM_ADDRS];
>>>> /* Read by MMU notifier, modified by vring ioctl(),
>>>> * synchronized through MMU notifier
>>>> * registering/unregistering.
>>>> */
>>>> struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>>>> #endif
>>>> + int ref;
>>> Is it important that this is signed? If not I'd do unsigned here:
>>> even though kernel does compile with 2s complement sign overflow,
>>> it seems cleaner not to depend on that.
>> Not a must, let me fix.
>>
>> Thanks
>>
>>>> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>>>>
>>>> struct file *kick;
>>>> --
>>>> 2.18.1
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05 8:21 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
linux-mm
In-Reply-To: <20190805020752-mutt-send-email-mst@kernel.org>
On 2019/8/5 下午2:28, Michael S. Tsirkin wrote:
> On Mon, Aug 05, 2019 at 12:33:45PM +0800, Jason Wang wrote:
>> On 2019/8/2 下午10:03, Michael S. Tsirkin wrote:
>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>> Btw, I come up another idea, that is to disable preemption when vhost thread
>>>> need to access the memory. Then register preempt notifier and if vhost
>>>> thread is preempted, we're sure no one will access the memory and can do the
>>>> cleanup.
>>> Great, more notifiers :(
>>>
>>> Maybe can live with
>>> 1- disable preemption while using the cached pointer
>>> 2- teach vhost to recover from memory access failures,
>>> by switching to regular from/to user path
>>
>> I don't get this, I believe we want to recover from regular from/to user
>> path, isn't it?
> That (disable copy to/from user completely) would be a nice to have
> since it would reduce the attack surface of the driver, but e.g. your
> code already doesn't do that.
>
Yes since it requires a lot of changes.
>
>>> So if you want to try that, fine since it's a step in
>>> the right direction.
>>>
>>> But I think fundamentally it's not what we want to do long term.
>>
>> Yes.
>>
>>
>>> It's always been a fundamental problem with this patch series that only
>>> metadata is accessed through a direct pointer.
>>>
>>> The difference in ways you handle metadata and data is what is
>>> now coming and messing everything up.
>>
>> I do propose soemthing like this in the past:
>> https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks
>> like you have some concern about its locality.
> Right and it doesn't go away. You'll need to come up
> with a test that messes it up and triggers a worst-case
> scenario, so we can measure how bad is that worst-case.
>
>> But the problem still there, GUP can do page fault, so still need to
>> synchronize it with MMU notifiers.
> I think the idea was, if GUP would need a pagefault, don't
> do a GUP and do to/from user instead.
But this still need to be synchronized with MMU notifiers (or using
dedicated work for GUP).
> Hopefully that
> will fault the page in and the next access will go through.
>
>> The solution might be something like
>> moving GUP to a dedicated kind of vhost work.
> Right, generally GUP.
>
>>> So if continuing the direct map approach,
>>> what is needed is a cache of mapped VM memory, then on a cache miss
>>> we'd queue work along the lines of 1-2 above.
>>>
>>> That's one direction to take. Another one is to give up on that and
>>> write our own version of uaccess macros. Add a "high security" flag to
>>> the vhost module and if not active use these for userspace memory
>>> access.
>>
>> Or using SET_BACKEND_FEATURES?
> No, I don't think it's considered best practice to allow unpriveledged
> userspace control over whether kernel enables security features.
Get this.
>
>> But do you mean permanent GUP as I did in
>> original RFC https://lkml.org/lkml/2018/12/13/218?
>>
>> Thanks
> Permanent GUP breaks THP and NUMA.
Yes.
Thanks
>
>>>
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05 8:22 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
linux-mm
In-Reply-To: <20190805022833-mutt-send-email-mst@kernel.org>
On 2019/8/5 下午2:30, Michael S. Tsirkin wrote:
> On Mon, Aug 05, 2019 at 12:36:40PM +0800, Jason Wang wrote:
>> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>>> synchronize_rcu.
>>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>>> concern.
>>>> I've also idly wondered if calling synchronize_rcu() under the various
>>>> mm locks is a deadlock situation.
>>>>
>>>>> Then I try spinlock and mutex:
>>>>>
>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>>>>> improvement.
>>>> I think the topic here is correctness not performance improvement
>>> The topic is whether we should revert
>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
>>>
>>> or keep it in. The only reason to keep it is performance.
>>
>> Maybe it's time to introduce the config option?
> Depending on CONFIG_BROKEN? I'm not sure it's a good idea.
Ok.
>>> Now as long as all this code is disabled anyway, we can experiment a
>>> bit.
>>>
>>> I personally feel we would be best served by having two code paths:
>>>
>>> - Access to VM memory directly mapped into kernel
>>> - Access to userspace
>>>
>>>
>>> Having it all cleanly split will allow a bunch of optimizations, for
>>> example for years now we planned to be able to process an incoming short
>>> packet directly on softirq path, or an outgoing on directly within
>>> eventfd.
>>
>> It's not hard consider we've already had our own accssors. But the question
>> is (as asked in another thread), do you want permanent GUP or still use MMU
>> notifiers.
>>
>> Thanks
> We want THP and NUMA to work. Both are important for performance.
>
Yes.
Thanks
^ permalink raw reply
* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05 8:24 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Gunthorpe, linux-mm, netdev, linux-kernel, kvm,
virtualization
In-Reply-To: <20190805023106-mutt-send-email-mst@kernel.org>
On 2019/8/5 下午2:40, Michael S. Tsirkin wrote:
> On Mon, Aug 05, 2019 at 12:41:45PM +0800, Jason Wang wrote:
>> On 2019/8/5 下午12:36, Jason Wang wrote:
>>> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
>>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>>>> synchronize_rcu.
>>>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>>>> concern.
>>>>> I've also idly wondered if calling synchronize_rcu() under the various
>>>>> mm locks is a deadlock situation.
>>>>>
>>>>>> Then I try spinlock and mutex:
>>>>>>
>>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0
>>>>>> performance
>>>>>> improvement.
>>>>> I think the topic here is correctness not performance improvement
>>>> The topic is whether we should revert
>>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel
>>>> virtual address")
>>>>
>>>> or keep it in. The only reason to keep it is performance.
>>>
>>> Maybe it's time to introduce the config option?
>>
>> Or does it make sense if I post a V3 with:
>>
>> - introduce config option and disable the optimization by default
>>
>> - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the
>> same
>>
>> This can give us some breath to decide which way should go for next release?
>>
>> Thanks
> As is, with preempt enabled? Nope I don't think blocking an invalidator
> on swap IO is ok, so I don't believe this stuff is going into this
> release at this point.
>
> So it's more a question of whether it's better to revert and apply a clean
> patch on top, or just keep the code around but disabled with an ifdef as is.
> I'm open to both options, and would like your opinion on this.
Then I prefer to leave current code (VHOST_ARCH_CAN_ACCEL to 0) as is.
This can also save efforts on rebasing packed virtqueues.
Thanks
>
>>>
>>>> Now as long as all this code is disabled anyway, we can experiment a
>>>> bit.
>>>>
>>>> I personally feel we would be best served by having two code paths:
>>>>
>>>> - Access to VM memory directly mapped into kernel
>>>> - Access to userspace
>>>>
>>>>
>>>> Having it all cleanly split will allow a bunch of optimizations, for
>>>> example for years now we planned to be able to process an incoming short
>>>> packet directly on softirq path, or an outgoing on directly within
>>>> eventfd.
>>>
>>> It's not hard consider we've already had our own accssors. But the
>>> question is (as asked in another thread), do you want permanent GUP or
>>> still use MMU notifiers.
>>>
>>> Thanks
>>>
>>> _______________________________________________
>>> Virtualization mailing list
>>> Virtualization@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH net-next v2] net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
From: Hubert Feurstein @ 2019-08-05 8:26 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: Hubert Feurstein, Richard Cochran, Andrew Lunn, Vivien Didelot,
Florian Fainelli, David S. Miller
From: Hubert Feurstein <h.feurstein@gmail.com>
This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.
Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
drivers/net/dsa/mv88e6xxx/ptp.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index 073cbd0bb91b..2ebc7db0fd4a 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
return 0;
}
-static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
- struct timespec64 *ts)
+static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
+ struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
{
struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
u64 ns;
mv88e6xxx_reg_lock(chip);
+ ptp_read_system_prets(sts);
ns = timecounter_read(&chip->tstamp_tc);
+ ptp_read_system_postts(sts);
mv88e6xxx_reg_unlock(chip);
*ts = ns_to_timespec64(ns);
@@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
struct timespec64 ts;
- mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
+ mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
schedule_delayed_work(&chip->overflow_work,
MV88E6XXX_TAI_OVERFLOW_PERIOD);
@@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
chip->ptp_clock_info.max_adj = MV88E6XXX_MAX_ADJ_PPB;
chip->ptp_clock_info.adjfine = mv88e6xxx_ptp_adjfine;
chip->ptp_clock_info.adjtime = mv88e6xxx_ptp_adjtime;
- chip->ptp_clock_info.gettime64 = mv88e6xxx_ptp_gettime;
+ chip->ptp_clock_info.gettimex64 = mv88e6xxx_ptp_gettimex;
chip->ptp_clock_info.settime64 = mv88e6xxx_ptp_settime;
chip->ptp_clock_info.enable = ptp_ops->ptp_enable;
chip->ptp_clock_info.verify = ptp_ops->ptp_verify;
--
2.22.0
^ permalink raw reply related
* Re: [PATCH 05/14] gpio: lpc32xx: allow building on non-lpc32xx targets
From: Bartosz Golaszewski @ 2019-08-05 8:27 UTC (permalink / raw)
To: Arnd Bergmann
Cc: soc, arm-soc, Vladimir Zapolskiy, Sylvain Lemieux, Russell King,
Gregory Clement, Linus Walleij, Jason Cooper, Andrew Lunn,
Sebastian Hesselbarth, David S. Miller, Greg Kroah-Hartman,
Alan Stern, Guenter Roeck, linux-gpio, netdev, linux-serial,
USB list, LINUXWATCHDOG, Lee Jones, LKML
In-Reply-To: <CAK8P3a3KpKvRKXY72toE_5eAp4ER_Mre0GX3guwGeQgsY2HX+g@mail.gmail.com>
pt., 2 sie 2019 o 13:20 Arnd Bergmann <arnd@arndb.de> napisał(a):
>
> On Fri, Aug 2, 2019 at 9:10 AM Bartosz Golaszewski
> <bgolaszewski@baylibre.com> wrote:
> > > -#include <mach/hardware.h>
> > > -#include <mach/platform.h>
> > > +#define _GPREG(x) (x)
> >
> > What purpose does this macro serve?
> >
> > >
> > > #define LPC32XX_GPIO_P3_INP_STATE _GPREG(0x000)
> > > #define LPC32XX_GPIO_P3_OUTP_SET _GPREG(0x004)
>
> In the existing code base, this macro converts a register offset to
> an __iomem pointer for a gpio register. I changed the definition of the
> macro here to keep the number of changes down, but I it's just
> as easy to remove it if you prefer.
Could you just add a comment so that it's clear at first glance?
>
> > > @@ -167,14 +166,26 @@ struct lpc32xx_gpio_chip {
> > > struct gpio_regs *gpio_grp;
> > > };
> > >
> > > +void __iomem *gpio_reg_base;
> >
> > Any reason why this can't be made part of struct lpc32xx_gpio_chip?
>
> It could be, but it's the same for each instance, and not known until
> probe() time, so the same pointer would need to be copied into each
> instance that is otherwise read-only.
>
> Let me know if you'd prefer me to rework these two things or leave
> them as they are.
I would prefer not to have global state in the driver, let's just
store the pointer in the data passed to gpiochip_add_data().
Bart
>
> > > +static inline u32 gpreg_read(unsigned long offset)
> >
> > Here and elsewhere: could you please keep the lpc32xx_gpio prefix for
> > all symbols?
>
> Sure.
>
> Arnd
^ permalink raw reply
* Re: [PATCH net-next 1/3,v2] net: sched: use major priority number as hardware priority
From: Jiri Pirko @ 2019-08-05 8:36 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: netfilter-devel, davem, netdev, jakub.kicinski, marcelo.leitner,
saeedm, wenxu, gerlitz.or, paulb
In-Reply-To: <20190802132846.3067-2-pablo@netfilter.org>
Fri, Aug 02, 2019 at 03:28:44PM CEST, pablo@netfilter.org wrote:
>tc transparently maps the software priority number to hardware. Update
>it to pass the major priority which is what most drivers expect. Update
>drivers too so they do not need to lshift the priority field of the
>flow_cls_common_offload object. The stmmac driver is an exception, since
>this code assumes the tc software priority is fine, therefore, lshift it
>just to be conservative.
>
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
^ permalink raw reply
* Re: [PATCH net] net: dsa: mv88e6xxx: drop adjust_link to enabled phylink
From: Hubert Feurstein @ 2019-08-05 8:49 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, linux-kernel, Vivien Didelot, Florian Fainelli,
David S. Miller
In-Reply-To: <20190804151013.GD6800@lunn.ch>
Hi Andrew,
It looks like some work is still needed in b53_phylink_mac_config to
take over the
functionality of the current adjust_link implementation.
Hubert
Am So., 4. Aug. 2019 um 17:10 Uhr schrieb Andrew Lunn <andrew@lunn.ch>:
>
> On Wed, Jul 31, 2019 at 05:42:39PM +0200, Hubert Feurstein wrote:
> > We have to drop the adjust_link callback in order to finally migrate to
> > phylink.
> >
> > Otherwise we get the following warning during startup:
> > "mv88e6xxx 2188000.ethernet-1:10: Using legacy PHYLIB callbacks. Please
> > migrate to PHYLINK!"
>
> Hi Hubert
>
> Do we need the same patch for the b53 driver?
>
> Andrew
^ permalink raw reply
* Re: [RFC] net: dsa: mv88e6xxx: ptp: improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Vladimir Oltean @ 2019-08-05 9:54 UTC (permalink / raw)
To: Hubert Feurstein, mlichvar, Richard Cochran
Cc: Andrew Lunn, netdev, lkml, Vivien Didelot, Florian Fainelli,
David S. Miller
In-Reply-To: <20190802163248.11152-1-h.feurstein@gmail.com>
Hi Hubert,
On Fri, 2 Aug 2019 at 19:33, Hubert Feurstein <h.feurstein@gmail.com> wrote:
>
> With this patch the phc2sys synchronisation precision improved to +/-500ns.
>
> Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
> ---
>
> This patch should only be the base for a discussion about improving precision of
> phc2sys (from the linuxptp package) in combination with a mv88e6xxx switch and
> imx6-fec.
>
> When I started my work on PTP at the beginning of this week, I was positively
> supprised about the sync precision of ptp4l. After adding support for the
> MV88E6220 I was able to synchronize two of our boards within +/- 10 nanoseconds.
> Remebering that the PTP system in the MV88E6220 is clocked with 100MHz, this is
> I think the best what can be expected. Big thanks to Richard and the other
> developers who made this possible.
>
> But then I tried to synchornize the PTP clock with the system clock by using
> phc2sys (phc2sys -rr -amq -l6) and I quickly was very disapointed about the
> precision.
>
> Min: -17829 ns
> Max: 21694 ns
> StdDev: 8520 ns
> Count: 127
>
> So I tried to find the reason for this. And as you probably already know, the
> reason for that is the MDIO latency, which is terrible (up to 800 usecs).
>
> As a next step, I tried to to implement the gettimex64 callback (see: "[PATCH]
> net: dsa: mv88e6xxx: extend PTP gettime function to read system clock"). With
> this in place (and a patched linuxptp-master version which really uses the
> PTP_SYS_OFFSET_EXTENDED-ioctl), I got the following results:
>
> Min: -12144 ns
> Max: 10891 ns
> StdDev: 4046,71 ns
> Count: 112
>
> So, things improved, but this is still unacceptable. It was still not possible
> to compensate the MDIO latency issue.
>
> According to my understanding, the timestamps (by using
> ptp_read_system_{pre|post}ts) have to be captured at a place where we have an
> constant offset related to the PHC in the switch. The only point where these
> timestamps can be captured is the mdio_write callback in the imx_fec. Because,
> reading the PHC timestamp will result in the follwing MDIO accesses:
>
> (several) reads of the AVB_CMD register (to poll for the busy-flag)
> write AVB_CMD (AVBOp=110b Read with post-incerement of PHC timestamp)
> read AVB_DATA (PTP Global Time [15:0])
> read AVB_DATA (PTP Global Time [31:16])
>
> With this sequence in mind, the Marvell switch has to snapshot the PHC
> timestamp at the write-AVB_CMD in order to be able to get sane values later by
> reading AVB_DATA. So the best place to capture the system timestamps is this
> one and only write operation directly in the imx_fec. By using the patch below
> (without the changes to the system clock resolution) I got the following
> results:
>
> Min: -464 ns
> Max: 525 ns
> StdDev: 210,31 ns
> Count: 401
>
> I would say that is a huge improvement.
>
> I realized, that the system clock (at least on the imx6) has a resolution of
> 333ns. So I tried to speed up this clock by using the PER-clock instead of
> OSC_PER. This gave me 15ns resolution. The results were:
>
> Min: -476 ns
> Max: 439 ns
> StdDev: 176,52 ns
> Count: 630
>
> So, things got improved again a little bit (at least the StdDev).
>
> According to my understanding, this is almost the best which is possible,
> because there is one more clock which influences the results. This is the MDIO
> bus clock, which is just 2.5MHz (or 400ns). So, I would say that +/- 400ns
> jitter is caused only by the MDIO bus clock, since the changes in imx_fec should
> not introduce any unpredictable latencies.
>
> My question to the experienced kernel developers is, how can this be implemented
> in a more generic way? Because this hack only works under these circumstances,
> and of course can never be part of the mainline kernel.
>
> I am not 100% sure that all my statements or assumptions are correct, so feel
> free to correct me.
>
> Hubert
>
You guessed correctly (since you copied me) that I'm battling much of
the same issues with the sja1105 and its spi-fsl-dspi controller
driver.
In fact I will refrain from saying anything about your
PTP_SYS_OFFSET_EXTENDED solution/hack combo, but will ask some
questions instead:
- You said you patched linuxptp master. Could you share the patch? Is
there anything else that's needed except compiling against the board's
real kernel headers (aka point KBUILD_OUTPUT to the extracted location
of /sys/kernel/kheaders.tar.xz)? I've done that and I do see phc2sys
probing and using the new ioctl, but I don't see a big improvement in
my case (that's probably because my SPI interface has real jitter
caused by peripheral clock instability, but I really need to put a
scope on it to be sure, so that's a discussion for another time).
- I really wonder what your jitter is caused by. Maybe it is just
contention on the mii_bus->mdio_lock? If that's the case, maybe you
don't need to go all the way down to the driver level, and taking the
ptp_sts at the subsystem (MDIO, SPI, I2C, etc) level is "good enough".
- Along the lines of the above, I wonder how badly would the
maintainers shout at the proposal of adding a struct
ptp_system_timestamp pointer and its associated lock in struct device.
That way at least you don't have to change the API, like you did with
mdiobus_write_nested_ptp. Relatively speaking, this is the least
amount of intrusion (although, again, far from beautiful).
- The software timestamps help you in the particular case of phc2sys,
but are they enough to cover all your needs? I haven't spent even 1
second looking at Marvell switch datasheets, but is its free-running
timer only used for PTP timestamping? At least the sja1105 does also
support time-based egress scheduling and ingress policing, so for that
scenario, the timecounter/cyclecounter implementation will no longer
cut it (you do need to synchronize the hardware clock). If your
hardware supports these PTP-based features as well, I can only assume
the reason why mv88e6xxx went for a timecounter is the same as the
reason why I did: the MDIO/SPI jitter while accessing the frequency
and offset correction registers is bad enough that the servo loop goes
haywire. And implementing gettimex64 will not solve that.
I also added Miroslav Lichvar, who originally created the
PTP_SYS_OFFSET_EXTENDED ioctl, maybe he can share some ideas on how it
is best served.
> drivers/clocksource/timer-imx-gpt.c | 9 ++++++++-
> drivers/net/dsa/mv88e6xxx/chip.h | 2 ++
> drivers/net/dsa/mv88e6xxx/ptp.c | 11 +++++++----
> drivers/net/dsa/mv88e6xxx/smi.c | 2 +-
> drivers/net/ethernet/freescale/fec_main.c | 24 ++++++++++++++++++-----
> drivers/net/phy/mdio_bus.c | 16 +++++++++++++++
> include/linux/mdio.h | 2 ++
> include/linux/phy.h | 1 +
> 8 files changed, 56 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/clocksource/timer-imx-gpt.c b/drivers/clocksource/timer-imx-gpt.c
> index 706c0d0ff56c..84695a2d8ff7 100644
> --- a/drivers/clocksource/timer-imx-gpt.c
> +++ b/drivers/clocksource/timer-imx-gpt.c
> @@ -471,8 +471,15 @@ static int __init mxc_timer_init_dt(struct device_node *np, enum imx_gpt_type t
>
> /* Try osc_per first, and fall back to per otherwise */
> imxtm->clk_per = of_clk_get_by_name(np, "osc_per");
> - if (IS_ERR(imxtm->clk_per))
> +
> + /* Force PER clock to be used, so we get 15ns instead of 333ns */
> + //if (IS_ERR(imxtm->clk_per)) {
> + if (1) {
> imxtm->clk_per = of_clk_get_by_name(np, "per");
> + pr_info("==> Using PER clock\n");
> + } else {
> + pr_info("==> Using OSC_PER clock\n");
> + }
>
> imxtm->type = type;
>
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
> index 01963ee94c50..9e14dc406415 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.h
> +++ b/drivers/net/dsa/mv88e6xxx/chip.h
> @@ -277,6 +277,8 @@ struct mv88e6xxx_chip {
> struct ptp_clock_info ptp_clock_info;
> struct delayed_work tai_event_work;
> struct ptp_pin_desc pin_config[MV88E6XXX_MAX_GPIO];
> + struct ptp_system_timestamp *ptp_sts;
> +
> u16 trig_config;
> u16 evcap_config;
> u16 enable_count;
> diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
> index 073cbd0bb91b..cf6e52ee9e0a 100644
> --- a/drivers/net/dsa/mv88e6xxx/ptp.c
> +++ b/drivers/net/dsa/mv88e6xxx/ptp.c
> @@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
> return 0;
> }
>
> -static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
> - struct timespec64 *ts)
> +static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
> + struct timespec64 *ts,
> + struct ptp_system_timestamp *sts)
> {
> struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
> u64 ns;
>
> mv88e6xxx_reg_lock(chip);
> + chip->ptp_sts = sts;
> ns = timecounter_read(&chip->tstamp_tc);
> + chip->ptp_sts = NULL;
> mv88e6xxx_reg_unlock(chip);
>
> *ts = ns_to_timespec64(ns);
> @@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
> struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
> struct timespec64 ts;
>
> - mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
> + mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
>
> schedule_delayed_work(&chip->overflow_work,
> MV88E6XXX_TAI_OVERFLOW_PERIOD);
> @@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
> chip->ptp_clock_info.max_adj = MV88E6XXX_MAX_ADJ_PPB;
> chip->ptp_clock_info.adjfine = mv88e6xxx_ptp_adjfine;
> chip->ptp_clock_info.adjtime = mv88e6xxx_ptp_adjtime;
> - chip->ptp_clock_info.gettime64 = mv88e6xxx_ptp_gettime;
> + chip->ptp_clock_info.gettimex64 = mv88e6xxx_ptp_gettimex;
> chip->ptp_clock_info.settime64 = mv88e6xxx_ptp_settime;
> chip->ptp_clock_info.enable = ptp_ops->ptp_enable;
> chip->ptp_clock_info.verify = ptp_ops->ptp_verify;
> diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c
> index 5fc78a063843..801fd4abba5a 100644
> --- a/drivers/net/dsa/mv88e6xxx/smi.c
> +++ b/drivers/net/dsa/mv88e6xxx/smi.c
> @@ -45,7 +45,7 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
> {
> int ret;
>
> - ret = mdiobus_write_nested(chip->bus, dev, reg, data);
> + ret = mdiobus_write_nested_ptp(chip->bus, dev, reg, data, chip->ptp_sts);
> if (ret < 0)
> return ret;
>
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 2f6057e7335d..20f589dc5b8b 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1814,11 +1814,25 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>
> reinit_completion(&fep->mdio_done);
>
> - /* start a write op */
> - writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> - FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> - FEC_MMFR_TA | FEC_MMFR_DATA(value),
> - fep->hwp + FEC_MII_DATA);
> + if (bus->ptp_sts) {
> + unsigned long flags = 0;
> + local_irq_save(flags);
> + __iowmb();
> + /* >Take the timestamp *after* the memory barrier */
> + ptp_read_system_prets(bus->ptp_sts);
> + writel_relaxed(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> + FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> + FEC_MMFR_TA | FEC_MMFR_DATA(value),
> + fep->hwp + FEC_MII_DATA);
> + ptp_read_system_postts(bus->ptp_sts);
> + local_irq_restore(flags);
> + } else {
> + /* start a write op */
> + writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> + FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> + FEC_MMFR_TA | FEC_MMFR_DATA(value),
> + fep->hwp + FEC_MII_DATA);
> + }
>
> /* wait for end of transfer */
> time_left = wait_for_completion_timeout(&fep->mdio_done,
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index bd04fe762056..f076487db11f 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -672,6 +672,22 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val)
> }
> EXPORT_SYMBOL(mdiobus_write_nested);
>
> +int mdiobus_write_nested_ptp(struct mii_bus *bus, int addr, u32 regnum, u16 val, struct ptp_system_timestamp *ptp_sts)
> +{
> + int err;
> +
> + BUG_ON(in_interrupt());
> +
> + mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
> + bus->ptp_sts = ptp_sts;
> + err = __mdiobus_write(bus, addr, regnum, val);
> + bus->ptp_sts = NULL;
> + mutex_unlock(&bus->mdio_lock);
> +
> + return err;
> +}
> +EXPORT_SYMBOL(mdiobus_write_nested_ptp);
> +
> /**
> * mdiobus_write - Convenience function for writing a given MII mgmt register
> * @bus: the mii_bus struct
> diff --git a/include/linux/mdio.h b/include/linux/mdio.h
> index e8242ad88c81..7f9767babdc3 100644
> --- a/include/linux/mdio.h
> +++ b/include/linux/mdio.h
> @@ -9,6 +9,7 @@
> #include <uapi/linux/mdio.h>
> #include <linux/mod_devicetable.h>
>
> +struct ptp_system_timestamp;
> struct gpio_desc;
> struct mii_bus;
>
> @@ -310,6 +311,7 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
> int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum);
> int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
> int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val);
> +int mdiobus_write_nested_ptp(struct mii_bus *bus, int addr, u32 regnum, u16 val, struct ptp_system_timestamp *ptp_sts);
>
> int mdiobus_register_device(struct mdio_device *mdiodev);
> int mdiobus_unregister_device(struct mdio_device *mdiodev);
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 462b90b73f93..fd4e33654863 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -252,6 +252,7 @@ struct mii_bus {
> int reset_delay_us;
> /* RESET GPIO descriptor pointer */
> struct gpio_desc *reset_gpiod;
> + struct ptp_system_timestamp *ptp_sts;
> };
> #define to_mii_bus(d) container_of(d, struct mii_bus, dev)
>
> --
> 2.22.0
>
Regards,
-Vladimir
^ permalink raw reply
* [patch iproute2] devlink: finish queue.h to list.h transition
From: Jiri Pirko @ 2019-08-05 9:56 UTC (permalink / raw)
To: netdev; +Cc: stephen, dsahern, slyfox, ayal, mlxsw
From: Jiri Pirko <jiri@mellanox.com>
Loose the "q" from the names and name the structure fields in the same
way rest of the code does. Also, fix list_add arg order which leads
to segfault.
Fixes: 33267017faf1 ("iproute2: devlink: port from sys/queue.h to list.h")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
devlink/devlink.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/devlink/devlink.c b/devlink/devlink.c
index 0ea401ae432a..91c85dc1de73 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -5978,35 +5978,36 @@ static int fmsg_value_show(struct dl *dl, int type, struct nlattr *nl_data)
return MNL_CB_OK;
}
-struct nest_qentry {
+struct nest_entry {
int attr_type;
- struct list_head nest_entries;
+ struct list_head list;
};
struct fmsg_cb_data {
struct dl *dl;
uint8_t value_type;
- struct list_head qhead;
+ struct list_head entry_list;
};
static int cmd_fmsg_nest_queue(struct fmsg_cb_data *fmsg_data,
uint8_t *attr_value, bool insert)
{
- struct nest_qentry *entry = NULL;
+ struct nest_entry *entry;
if (insert) {
- entry = malloc(sizeof(struct nest_qentry));
+ entry = malloc(sizeof(struct nest_entry));
if (!entry)
return -ENOMEM;
entry->attr_type = *attr_value;
- list_add(&fmsg_data->qhead, &entry->nest_entries);
+ list_add(&entry->list, &fmsg_data->entry_list);
} else {
- if (list_empty(&fmsg_data->qhead))
+ if (list_empty(&fmsg_data->entry_list))
return MNL_CB_ERROR;
- entry = list_first_entry(&fmsg_data->qhead, struct nest_qentry, nest_entries);
+ entry = list_first_entry(&fmsg_data->entry_list,
+ struct nest_entry, list);
*attr_value = entry->attr_type;
- list_del(&entry->nest_entries);
+ list_del(&entry->list);
free(entry);
}
return MNL_CB_OK;
@@ -6115,7 +6116,7 @@ static int cmd_health_object_common(struct dl *dl, uint8_t cmd, uint16_t flags)
return err;
data.dl = dl;
- INIT_LIST_HEAD(&data.qhead);
+ INIT_LIST_HEAD(&data.entry_list);
err = _mnlg_socket_sndrcv(dl->nlg, nlh, cmd_fmsg_object_cb, &data);
return err;
}
--
2.21.0
^ permalink raw reply related
* [PATCH] NFC: nfcmrvl: fix gpio-handling regression
From: Johan Hovold @ 2019-08-05 10:00 UTC (permalink / raw)
To: netdev
Cc: Linus Walleij, Vincent Cuissard, Andrey Konovalov, Dmitry Vyukov,
linux-usb, linux-kernel, Johan Hovold, stable,
syzbot+cf35b76f35e068a1107f
Fix two reset-gpio sanity checks which were never converted to use
gpio_is_valid(), and make sure to use -EINVAL to indicate a missing
reset line also for the UART-driver module parameter and for the USB
driver.
This specifically prevents the UART and USB drivers from incidentally
trying to request and use gpio 0, and also avoids triggering a WARN() in
gpio_to_desc() during probe when no valid reset line has been specified.
Fixes: e33a3f84f88f ("NFC: nfcmrvl: allow gpio 0 for reset signalling")
Cc: stable <stable@vger.kernel.org> # 4.13
Reported-by: syzbot+cf35b76f35e068a1107f@syzkaller.appspotmail.com
Tested-by: syzbot+cf35b76f35e068a1107f@syzkaller.appspotmail.com
Signed-off-by: Johan Hovold <johan@kernel.org>
---
drivers/nfc/nfcmrvl/main.c | 4 ++--
drivers/nfc/nfcmrvl/uart.c | 4 ++--
drivers/nfc/nfcmrvl/usb.c | 1 +
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/nfc/nfcmrvl/main.c b/drivers/nfc/nfcmrvl/main.c
index e65d027b91fa..529be35ac178 100644
--- a/drivers/nfc/nfcmrvl/main.c
+++ b/drivers/nfc/nfcmrvl/main.c
@@ -244,7 +244,7 @@ void nfcmrvl_chip_reset(struct nfcmrvl_private *priv)
/* Reset possible fault of previous session */
clear_bit(NFCMRVL_PHY_ERROR, &priv->flags);
- if (priv->config.reset_n_io) {
+ if (gpio_is_valid(priv->config.reset_n_io)) {
nfc_info(priv->dev, "reset the chip\n");
gpio_set_value(priv->config.reset_n_io, 0);
usleep_range(5000, 10000);
@@ -255,7 +255,7 @@ void nfcmrvl_chip_reset(struct nfcmrvl_private *priv)
void nfcmrvl_chip_halt(struct nfcmrvl_private *priv)
{
- if (priv->config.reset_n_io)
+ if (gpio_is_valid(priv->config.reset_n_io))
gpio_set_value(priv->config.reset_n_io, 0);
}
diff --git a/drivers/nfc/nfcmrvl/uart.c b/drivers/nfc/nfcmrvl/uart.c
index 9a22056e8d9e..e5a622ce4b95 100644
--- a/drivers/nfc/nfcmrvl/uart.c
+++ b/drivers/nfc/nfcmrvl/uart.c
@@ -26,7 +26,7 @@
static unsigned int hci_muxed;
static unsigned int flow_control;
static unsigned int break_control;
-static unsigned int reset_n_io;
+static int reset_n_io = -EINVAL;
/*
** NFCMRVL NCI OPS
@@ -231,5 +231,5 @@ MODULE_PARM_DESC(break_control, "Tell if UART driver must drive break signal.");
module_param(hci_muxed, uint, 0);
MODULE_PARM_DESC(hci_muxed, "Tell if transport is muxed in HCI one.");
-module_param(reset_n_io, uint, 0);
+module_param(reset_n_io, int, 0);
MODULE_PARM_DESC(reset_n_io, "GPIO that is wired to RESET_N signal.");
diff --git a/drivers/nfc/nfcmrvl/usb.c b/drivers/nfc/nfcmrvl/usb.c
index 945cc903d8f1..888e298f610b 100644
--- a/drivers/nfc/nfcmrvl/usb.c
+++ b/drivers/nfc/nfcmrvl/usb.c
@@ -305,6 +305,7 @@ static int nfcmrvl_probe(struct usb_interface *intf,
/* No configuration for USB */
memset(&config, 0, sizeof(config));
+ config.reset_n_io = -EINVAL;
nfc_info(&udev->dev, "intf %p id %p\n", intf, id);
--
2.22.0
^ permalink raw reply related
* Re: [RFC] net: dsa: mv88e6xxx: ptp: improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Miroslav Lichvar @ 2019-08-05 10:36 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Hubert Feurstein, Richard Cochran, Andrew Lunn, netdev, lkml,
Vivien Didelot, Florian Fainelli, David S. Miller
In-Reply-To: <CA+h21hr835sdvtgVOA2M9SWeCXDOrDG1S3FnNgJd_9NP2X_TaQ@mail.gmail.com>
On Mon, Aug 05, 2019 at 12:54:49PM +0300, Vladimir Oltean wrote:
> - Along the lines of the above, I wonder how badly would the
> maintainers shout at the proposal of adding a struct
> ptp_system_timestamp pointer and its associated lock in struct device.
> That way at least you don't have to change the API, like you did with
> mdiobus_write_nested_ptp. Relatively speaking, this is the least
> amount of intrusion (although, again, far from beautiful).
That would make sense to me, if there are other drivers that could use
it.
> I also added Miroslav Lichvar, who originally created the
> PTP_SYS_OFFSET_EXTENDED ioctl, maybe he can share some ideas on how it
> is best served.
The idea behind that ioctl was to allow drivers to take the system
timestamps as close to the reading of the HW clock as possible,
excluding delays (and jitter) due to other operations that are
necessary to get that timestamp. In the ethernet drivers that was a
single PCI read. If in this case it's a "write" operation that
triggers the reading of the HW clock, then I think the patch is
using is the ptp_read_system_*ts() calls correctly.
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -1814,11 +1814,25 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
> >
> > reinit_completion(&fep->mdio_done);
> >
> > - /* start a write op */
> > - writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> > - FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> > - FEC_MMFR_TA | FEC_MMFR_DATA(value),
> > - fep->hwp + FEC_MII_DATA);
> > + if (bus->ptp_sts) {
> > + unsigned long flags = 0;
> > + local_irq_save(flags);
> > + __iowmb();
> > + /* >Take the timestamp *after* the memory barrier */
> > + ptp_read_system_prets(bus->ptp_sts);
> > + writel_relaxed(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> > + FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> > + FEC_MMFR_TA | FEC_MMFR_DATA(value),
> > + fep->hwp + FEC_MII_DATA);
> > + ptp_read_system_postts(bus->ptp_sts);
> > + local_irq_restore(flags);
> > + } else {
> > + /* start a write op */
> > + writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> > + FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> > + FEC_MMFR_TA | FEC_MMFR_DATA(value),
> > + fep->hwp + FEC_MII_DATA);
> > + }
--
Miroslav Lichvar
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox