* Re: [RFC v2 00/10] HFI Virtual Network Interface Controller (VNIC)
From: Jason Gunthorpe @ 2016-12-15 16:56 UTC (permalink / raw)
To: Vishwanathapura, Niranjana
Cc: dledford, linux-rdma, netdev, dennis.dalessandro, ira.weiny
In-Reply-To: <1481788782-89964-1-git-send-email-niranjana.vishwanathapura@intel.com>
On Wed, Dec 14, 2016 at 11:59:32PM -0800, Vishwanathapura, Niranjana wrote:
> create mode 100644 drivers/infiniband/sw/intel/hfi_vnic/Kconfig
> create mode 100644 drivers/infiniband/sw/intel/hfi_vnic/Makefile
Stil NAK on these paths, I already explained why 'sw' is totally
unsuitable. Put it in drivers/net or drivers/infiniband/ulp
Jason
^ permalink raw reply
* Re: [PATCH net] sctp: sctp_transport_lookup_process should rcu_read_unlock when transport is null
From: Marcelo Ricardo Leitner @ 2016-12-15 16:48 UTC (permalink / raw)
To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <3d4f0205d5b5c64d372dffb37602f48e4a173612.1481814352.git.lucien.xin@gmail.com>
On Thu, Dec 15, 2016 at 11:05:52PM +0800, Xin Long wrote:
> Prior to this patch, sctp_transport_lookup_process didn't rcu_read_unlock
> when it failed to find a transport by sctp_addrs_lookup_transport.
>
> This patch is to fix it by moving up rcu_read_unlock right before checking
> transport and also to remove the out path.
>
> Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> net/sctp/socket.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index d5f4b4a..318c678 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4472,18 +4472,17 @@ int sctp_transport_lookup_process(int (*cb)(struct sctp_transport *, void *),
> const union sctp_addr *paddr, void *p)
> {
> struct sctp_transport *transport;
> - int err = -ENOENT;
> + int err;
>
> rcu_read_lock();
> transport = sctp_addrs_lookup_transport(net, laddr, paddr);
> + rcu_read_unlock();
> if (!transport)
> - goto out;
> + return -ENOENT;
>
> - rcu_read_unlock();
> err = cb(transport, p);
> sctp_transport_put(transport);
>
> -out:
> return err;
> }
> EXPORT_SYMBOL_GPL(sctp_transport_lookup_process);
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH] net: wan: Use dma_pool_zalloc
From: Joe Perches @ 2016-12-15 16:48 UTC (permalink / raw)
To: Souptick Joarder, Krzysztof Hałasa, netdev; +Cc: Rameshwar Sahu
In-Reply-To: <CAFqt6zaHJ_YP-inLm8_XRDrX8FtcU86NqYG=+DRn9-Y3zR4cyg@mail.gmail.com>
On Thu, 2016-12-15 at 10:41 +0530, Souptick Joarder wrote:
> On Mon, Dec 12, 2016 at 10:12 AM, Souptick Joarder <jrdr.linux@gmail.com> wrote:
> > On Fri, Dec 9, 2016 at 6:33 PM, Krzysztof Hałasa <khalasa@piap.pl> wrote:
> > > Souptick Joarder <jrdr.linux@gmail.com> writes:
> > >
> > > > We should use dma_pool_zalloc instead of dma_pool_alloc/memset
[]
> > > > diff --git a/drivers/net/wan/ixp4xx_hss.c b/drivers/net/wan/ixp4xx_hss.c
[]
> > > > @@ -976,10 +976,9 @@ static int init_hdlc_queues(struct port *port)
> > > > return -ENOMEM;
> > > > }
> > > >
> > > > - if (!(port->desc_tab = dma_pool_alloc(dma_pool, GFP_KERNEL,
> > > > - &port->desc_tab_phys)))
> > > > + if (!(port->desc_tab = dma_pool_zalloc(dma_pool, GFP_KERNEL,
> > > > + &port->desc_tab_phys)))
> > > > return -ENOMEM;
> > > > - memset(port->desc_tab, 0, POOL_ALLOC_SIZE);
> > > > memset(port->rx_buff_tab, 0, sizeof(port->rx_buff_tab)); /* tables */
> > > > memset(port->tx_buff_tab, 0, sizeof(port->tx_buff_tab));
> > >
> > > This look fine, feel free to send it to the netdev mailing list for
> > > inclusion.
> >
> > Including netdev mailing list based as requested.
> > > Acked-by: Krzysztof Halasa <khalasa@piap.pl>
[]
> Any comment on this patch ?
Shouldn't the one in drivers/net/ethernet/xscale/ixp4xx_eth.c
also be changed?
^ permalink raw reply
* Re: [PATCH net] sctp: sctp_epaddr_lookup_transport should be protected by rcu_read_lock
From: Marcelo Ricardo Leitner @ 2016-12-15 16:43 UTC (permalink / raw)
To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman, Dmitry Vyukov
In-Reply-To: <eb70a1f54c102c728a5572f1aa2842940fc3f5a5.1481814055.git.lucien.xin@gmail.com>
On Thu, Dec 15, 2016 at 11:00:55PM +0800, Xin Long wrote:
> Since commit 7fda702f9315 ("sctp: use new rhlist interface on sctp transport
> rhashtable"), sctp has changed to use rhlist_lookup to look up transport, but
> rhlist_lookup doesn't call rcu_read_lock inside, unlike rhashtable_lookup_fast.
>
> It is called in sctp_epaddr_lookup_transport and sctp_addrs_lookup_transport.
> sctp_addrs_lookup_transport is always in the protection of rcu_read_lock(),
> as __sctp_lookup_association is called in rx path or sctp_lookup_association
> which are in the protection of rcu_read_lock() already.
>
> But sctp_epaddr_lookup_transport is called by sctp_endpoint_lookup_assoc, it
> doesn't call rcu_read_lock, which may cause "suspicious rcu_dereference_check
> usage' in __rhashtable_lookup.
>
> This patch is to fix it by adding rcu_read_lock in sctp_endpoint_lookup_assoc
> before calling sctp_epaddr_lookup_transport.
>
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> net/sctp/endpointola.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> index 1f03065..410ddc1 100644
> --- a/net/sctp/endpointola.c
> +++ b/net/sctp/endpointola.c
> @@ -331,7 +331,9 @@ struct sctp_association *sctp_endpoint_lookup_assoc(
> * on this endpoint.
> */
> if (!ep->base.bind_addr.port)
> - goto out;
> + return NULL;
> +
> + rcu_read_lock();
> t = sctp_epaddr_lookup_transport(ep, paddr);
> if (!t)
> goto out;
> @@ -339,6 +341,7 @@ struct sctp_association *sctp_endpoint_lookup_assoc(
> *transport = t;
> asoc = t->asoc;
> out:
> + rcu_read_unlock();
> return asoc;
> }
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter @ 2016-12-15 16:38 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Alexander Duyck, John Fastabend, David Miller, rppt, Netdev,
linux-mm, willemdebruijn.kernel, Björn Töpel,
magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
Vlad Yasevich
In-Reply-To: <20161215092841.2f7065b5@redhat.com>
On Thu, 15 Dec 2016, Jesper Dangaard Brouer wrote:
> > It sounds like Christoph's RDMA approach might be the way to go.
>
> I'm getting more and more fond of Christoph's RDMA approach. I do
> think we will end-up with something close to that approach. I just
> wanted to get review on my idea first.
>
> IMHO the major blocker for the RDMA approach is not HW filters
> themselves, but a common API that applications can call to register
> what goes into the HW queues in the driver. I suspect it will be a
> long project agreeing between vendors. And agreeing on semantics.
Some of the methods from the RDMA subsystem (like queue pairs, the various
queues etc) could be extracted and used here. Multiple vendors already
support these features and some devices operate both in an RDMA and a
network stack mode. Having that all supported by the networks stack would
reduce overhead for those vendors.
Multiple new vendors are coming up in the RDMA subsystem because the
regular network stack does not have the right performance for high speed
networking. I would rather see them have a way to get that functionality
from the regular network stack. Please add some extensions so that the
RDMA style I/O can be made to work. Even the hardware of the new NICs is
already prepared to work with the data structures of the RDMA subsystem.
That provides an area of standardization where we could hook into but do
that properly and in a nice way in the context of main stream network
support.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO
From: Michael S. Tsirkin @ 2016-12-15 16:35 UTC (permalink / raw)
To: John Fastabend
Cc: daniel, shm, davem, tgraf, alexei.starovoitov, john.r.fastabend,
netdev, brouer
In-Reply-To: <58517AE7.5090104@gmail.com>
On Wed, Dec 14, 2016 at 09:01:27AM -0800, John Fastabend wrote:
> On 16-12-14 05:31 AM, Michael S. Tsirkin wrote:
> > On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote:
> >> On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
> >>>> This adds support for dynamically setting the LRO feature flag. The
> >>>> message to control guest features in the backend uses the
> >>>> CTRL_GUEST_OFFLOADS msg type.
> >>>>
> >>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >>>> ---
>
> [...]
>
> >>>>
> >>>> static void virtnet_config_changed_work(struct work_struct *work)
> >>>> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>>> if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
> >>>> dev->features |= NETIF_F_RXCSUM;
> >>>>
> >>>> + if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
> >>>> + virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
> >>>> + dev->features |= NETIF_F_LRO;
> >>>> + dev->hw_features |= NETIF_F_LRO;
> >>>
> >>> So the issue is I think that the virtio "LRO" isn't really
> >>> LRO, it's typically just GRO forwarded to guests.
> >>> So these are easily re-split along MTU boundaries,
> >>> which makes it ok to forward these across bridges.
> >>>
> >>> It's not nice that we don't document this in the spec,
> >>> but it's the reality and people rely on this.
> >>>
> >>> For now, how about doing a custom thing and just disable/enable
> >>> it as XDP is attached/detached?
> >>
> >> The annoying part about doing this is ethtool will say that it is fixed
> >> yet it will be changed by seemingly unrelated operation. I'm not sure I
> >> like the idea to start automatically configuring the link via xdp_set.
> >
> > I really don't like the idea of dropping performance
> > by a factor of 3 for people bridging two virtio net
> > interfaces.
> >
> > So how about a simple approach for now, just disable
> > XDP if GUEST_TSO is enabled?
> >
> > We can discuss better approaches in next version.
> >
>
> So the proposal is to add a check in XDP setup so that
>
> if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO{4|6})
> return -ENOPSUPP;
>
> Or whatever is the most appropriate return code? Then we can
> disable TSO via qemu-system with guest_tso4=off,guest_tso6=off for
> XDP use cases.
Right. It's a start.
> Sounds like a reasonable start to me. I'll make the change should this
> go through DaveMs net-next tree or do you want it on virtio tree? Either
> is fine with me.
>
> Thanks,
> John
I think I'll merge it because I'm tweaking RX processing too,
and this will likely conflict.
--
MST
^ permalink raw reply
* Re: [RFC v2 00/10] HFI Virtual Network Interface Controller (VNIC)
From: Doug Ledford @ 2016-12-15 16:28 UTC (permalink / raw)
To: ira.weiny, Leon Romanovsky, Jeff Kirsher, David S. Miller
Cc: Vishwanathapura, Niranjana, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20161215145212.GA29116-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 5315 bytes --]
On 12/15/2016 9:52 AM, ira.weiny wrote:
> On Thu, Dec 15, 2016 at 11:12:26AM +0200, Leon Romanovsky wrote:
>> On Wed, Dec 14, 2016 at 11:59:32PM -0800, Vishwanathapura, Niranjana wrote:
>>> Thanks Jason for the valuable feedback.
>>> Here is the revised HFI VNIC patch series.
>>>
>>> ChangeLog:
>>> =========
>>> v1 => v2:
>>> a) Removed hfi_vnic bus, instead make hfi_vnic driver an 'ib client',
>>> as per feedback from Jason Gunthorpe.
>>> b) Interface changes, data structure changes and variable name changes
>>> associated with (a).
>>> c) Add hfi_ibdev abstraction to provide VNIC control operations to
>>> hfi_vnic client.
>>> d) Minor fixes
>>> e) Moved hfi_vnic driver from .../sw/intel/vnic/hfi_vnic to
>>> .../sw/intel/hfi_vnic.
>>
>> To put it into proportion, Jason asked you to do different thing.
>> http://marc.info/?l=linux-rdma&m=147977108302151&w=2
>> http://marc.info/?l=linux-rdma&m=148000415401842&w=2
>>
>> And Christoph,
>> http://marc.info/?l=linux-rdma&m=147985587425861&w=2
>
> Understood. However, we never heard back from Niranjanas analysis of the code
> which stated that > 60% of the code was dealing with the OPA MADs used to
> configure this device.
>
> https://www.spinics.net/lists/linux-rdma/msg43579.html
>
> Furthermore, neither Dave nor Doug has had time to weigh in on what we should
> do.
>
> So before we make that change we wanted to get consensus on using the
> hfi1_ibdev abstraction rather than the bus. This was the _real_ technical
> change.
>
> Beyond that it is really just which maintainer wants this driver. To that end
> I've also cc'ed Jeff Kirsher who maintains drivers/net/ethernet/intel. Perhaps
> Dave would like the driver to go through that tree?
>
>
> I think there are pros and cons to both subtrees and in the end we will do
> whatever is decided.
>
> For maintainer review:
>
> 1) The driver encapsulates ethernet packets with OPA headers
>
> 2) VNIC uses OPA management packets (MADs) for its configuration
>
> 3) A significant portion (> 60% +) of the code is specific to OPA
>
> https://www.spinics.net/lists/linux-rdma/msg43579.html
>
> 4) The driver is from Intel and we expect Intel to be the primary
> contributor to the code.
>
> 5) The driver, like hfi1, is dual licensed (GPL/BSD)
>
> 6) Based on Christophs feedback we will be adding device capability
> bits to the IB core to indicate HFI VNIC support.
>
> https://www.spinics.net/lists/linux-rdma/msg44113.html
>
>
> Doug, Dave, Jeff any thoughts?
>
> Ira
>
Sorry for my late reply. The series is relatively large, and also
tagged with RFC, so it got shuffled to the back burner while I worked on
the stuff for this pull request.
I just read through the comments in the V1 series between Jason et. al.,
and my take on things is like this:
1) Since your intent is to make this work with multiple versions of the
hfi drivers, I disagree with Jason that just because there is only one
driver today that we should keep it simple. Design it right from the
beginning of multi driver is your intent is, IMO, a better way to go.
You'll work out the bugs in the initial implementation and when it comes
time to add the second driver, things will go much more smoothly.
2) With more than 60% of the code being MAD related, and another
significant chunk being hfi related, and only a minor bit (20% maybe?)
being net related, I disagree that this belongs in the drivers/net or
net/ directories. Part of the purpose of putting code like this in any
given directory is to group it with what it is most tightly tied too.
That way people doing sub-tree wide changes know the rough scope of
their work as the code that needs changed is grouped together. Putting
this or IPoIB in one of the net trees would make it obvious to the
casual coder that these need changed for net changes, but would totally
hide the fact that once you tear into these drivers, there is a lot more
IB to them than there is net. What's more, when 60+% of driver is
non-net, then you end up having many more of my patches crossing over
into Dave's tree than the opposite if you put the code under my tree.
If nothing else, locality of code churn would say both this and IPoIB
belong here despite them being net drivers.
3) I would like some hard reasons why this driver deserves to exist?
I'm struggling very hard right now with why we would add an entirely new
"encapsulate IP over RDMA" driver. Even if you use regular Ethernet
MACs instead of IPoIB's 20byte MAC, I'm struggling for why IPoIB
couldn't be modified to know it supports two MAC sizes and provide
different net devices based on those different types? I'm struggling to
see why IPoIB couldn't be modified to essentially have two transport
layers underneath? I haven't done a thorough code review yet, but if I
get into the net driver portion of this and it has very much similarity
to the IPoIB net portion, I'm probably going to want answers about why
this can't be a modification of IPoIB to support multiple
transport/encapsulation types instead of a separate driver even more.
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG Key ID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply
* Re: [PATCH net-next] ixgbevf: fix 'Etherleak' in ixgbevf
From: Alexander Duyck @ 2016-12-15 16:13 UTC (permalink / raw)
To: Weilong Chen
Cc: Jeff Kirsher, intel-wired-lan, Netdev,
linux-kernel@vger.kernel.org
In-Reply-To: <1481802034-77729-1-git-send-email-chenweilong@huawei.com>
On Thu, Dec 15, 2016 at 3:40 AM, Weilong Chen <chenweilong@huawei.com> wrote:
> Nessus report the vf appears to leak memory in network packets.
> Fix this by padding all small packets manually.
>
> And the CVE-2003-0001.
> https://ofirarkin.files.wordpress.com/2008/11/atstake_etherleak_report.pdf
>
> Signed-off-by: Weilong Chen <chenweilong@huawei.com>
> ---
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 6d4bef5..137a154 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -3654,6 +3654,13 @@ static int ixgbevf_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> return NETDEV_TX_OK;
> }
>
> + /* On PCI/PCI-X HW, if packet size is less than ETH_ZLEN,
> + * packets may get corrupted during padding by HW.
> + * To WA this issue, pad all small packets manually.
> + */
> + if (eth_skb_pad(skb))
> + return NETDEV_TX_OK;
> +
So the patch description for this probably isn't correct. It looks
like the problem isn't leaking data it is the fact that the frames
aren't being padded to prevent malicious events. The only issue is
the patch is padding by a bit too much. I would recommend replacing
this with the following from ixgbe:
/*
* The minimum packet size for olinfo paylen is 17 so pad the skb
* in order to meet this minimum size requirement.
*/
if (skb_put_padto(skb, 17))
return NETDEV_TX_OK;
> tx_ring = adapter->tx_ring[skb->queue_mapping];
>
> /* need: 1 descriptor per page * PAGE_SIZE/IXGBE_MAX_DATA_PER_TXD,
> --
> 1.7.12
>
^ permalink raw reply
* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Alexander Duyck @ 2016-12-15 15:59 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: John Fastabend, David Miller, Christoph Lameter, rppt, Netdev,
linux-mm, willemdebruijn.kernel, Björn Töpel,
magnus.karlsson, Mel Gorman, Tom Herbert, Brenden Blanco,
Tariq Toukan, Saeed Mahameed, Brandeburg, Jesse, METH,
Vlad Yasevich
In-Reply-To: <20161215092841.2f7065b5@redhat.com>
On Thu, Dec 15, 2016 at 12:28 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 14 Dec 2016 14:45:00 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > On Wed, 14 Dec 2016 08:45:08 -0800
>> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
>> >
>> >> I agree. This is a no-go from the performance perspective as well.
>> >> At a minimum you would have to be zeroing out the page between uses to
>> >> avoid leaking data, and that assumes that the program we are sending
>> >> the pages to is slightly well behaved. If we think zeroing out an
>> >> sk_buff is expensive wait until we are trying to do an entire 4K page.
>> >
>> > Again, yes the page will be zero'ed out, but only when entering the
>> > page_pool. Because they are recycled they are not cleared on every use.
>> > Thus, performance does not suffer.
>>
>> So you are talking about recycling, but not clearing the page when it
>> is recycled. That right there is my problem with this. It is fine if
>> you assume the pages are used by the application only, but you are
>> talking about using them for both the application and for the regular
>> network path. You can't do that. If you are recycling you will have
>> to clear the page every time you put it back onto the Rx ring,
>> otherwise you can leak the recycled memory into user space and end up
>> with a user space program being able to snoop data out of the skb.
>>
>> > Besides clearing large mem area is not as bad as clearing small.
>> > Clearing an entire page does cost something, as mentioned before 143
>> > cycles, which is 28 bytes-per-cycle (4096/143). And clearing 256 bytes
>> > cost 36 cycles which is only 7 bytes-per-cycle (256/36).
>>
>> What I am saying is that you are going to be clearing the 4K blocks
>> each time they are recycled. You can't have the pages shared between
>> user-space and the network stack unless you have true isolation. If
>> you are allowing network stack pages to be recycled back into the
>> user-space application you open up all sorts of leaks where the
>> application can snoop into data it shouldn't have access to.
>
> See later, the "Read-only packet page" mode should provide a mode where
> the netstack doesn't write into the page, and thus cannot leak kernel
> data. (CAP_NET_ADMIN already give it access to other applications data.)
I think you are kind of missing the point. The device is writing to
the page on the kernel's behalf. Therefore the page isn't "Read-only"
and you have an issue since you are talking about sharing a ring
between kernel and userspace.
>> >> I think we are stuck with having to use a HW filter to split off
>> >> application traffic to a specific ring, and then having to share the
>> >> memory between the application and the kernel on that ring only. Any
>> >> other approach just opens us up to all sorts of security concerns
>> >> since it would be possible for the application to try to read and
>> >> possibly write any data it wants into the buffers.
>> >
>> > This is why I wrote a document[1], trying to outline how this is possible,
>> > going through all the combinations, and asking the community to find
>> > faults in my idea. Inlining it again, as nobody really replied on the
>> > content of the doc.
>> >
>> > -
>> > Best regards,
>> > Jesper Dangaard Brouer
>> > MSc.CS, Principal Kernel Engineer at Red Hat
>> > LinkedIn: http://www.linkedin.com/in/brouer
>> >
>> > [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>> >
>> > ===========================
>> > Memory Model for Networking
>> > ===========================
>> >
>> > This design describes how the page_pool change the memory model for
>> > networking in the NIC (Network Interface Card) drivers.
>> >
>> > .. Note:: The catch for driver developers is that, once an application
>> > request zero-copy RX, then the driver must use a specific
>> > SKB allocation mode and might have to reconfigure the
>> > RX-ring.
>> >
>> >
>> > Design target
>> > =============
>> >
>> > Allow the NIC to function as a normal Linux NIC and be shared in a
>> > safe manor, between the kernel network stack and an accelerated
>> > userspace application using RX zero-copy delivery.
>> >
>> > Target is to provide the basis for building RX zero-copy solutions in
>> > a memory safe manor. An efficient communication channel for userspace
>> > delivery is out of scope for this document, but OOM considerations are
>> > discussed below (`Userspace delivery and OOM`_).
>> >
>> > Background
>> > ==========
>> >
>> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure
>> > for network packets in the Linux Kernel network stack. It is a fairly
>> > complex object and can be constructed in several ways.
>> >
>> > From a memory perspective there are two ways depending on
>> > RX-buffer/page state:
>> >
>> > 1) Writable packet page
>> > 2) Read-only packet page
>> >
>> > To take full potential of the page_pool, the drivers must actually
>> > support handling both options depending on the configuration state of
>> > the page_pool.
>> >
>> > Writable packet page
>> > --------------------
>> >
>> > When the RX packet page is writable, the SKB setup is fairly straight
>> > forward. The SKB->data (and skb->head) can point directly to the page
>> > data, adjusting the offset according to drivers headroom (for adding
>> > headers) and setting the length according to the DMA descriptor info.
>> >
>> > The page/data need to be writable, because the network stack need to
>> > adjust headers (like TimeToLive and checksum) or even add or remove
>> > headers for encapsulation purposes.
>> >
>> > A subtle catch, which also requires a writable page, is that the SKB
>> > also have an accompanying "shared info" data-structure ``struct
>> > skb_shared_info``. This "skb_shared_info" is written into the
>> > skb->data memory area at the end (skb->end) of the (header) data. The
>> > skb_shared_info contains semi-sensitive information, like kernel
>> > memory pointers to other pages (which might be pointers to more packet
>> > data). This would be bad from a zero-copy point of view to leak this
>> > kind of information.
>>
>> This should be the default once we get things moved over to using the
>> DMA_ATTR_SKIP_CPU_SYNC DMA attribute. It will be a little while more
>> before it gets fully into Linus's tree. It looks like the swiotlb
>> bits have been accepted, just waiting on the ability to map a page w/
>> attributes and the remainder of the patches that are floating around
>> in mmotm and linux-next.
>
> I'm very happy that you are working on this.
Well it looks like the rest just got accepted into Linus's tree
yesterday. There are still some documentation and rename patches
outstanding but I will probably start submitting driver updates for
enabling build_skb and the like in net-next in the next several weeks.
>> BTW, any ETA on when we might expect to start seeing code related to
>> the page_pool? It is much easier to review code versus these kind of
>> blueprints.
>
> I've implemented a prove-of-concept of page_pool, but only the first
> stage, which is the ability to replace driver specific page-caches. It
> works, but is not upstream ready, as e.g. it assumes it can get a page
> flag and cleanup-on-driver-unload code is missing. Mel Gorman have
> reviewed it, but with the changes he requested I lost quite some
> performance, I'm still trying to figure out a way to regain that
> performance lost. The zero-copy part is not implemented.
Well RFCs are always welcome. It is just really hard to review things
when all you have is documentation that may or may not match up with
what ends up being implemented.
>
>> > Read-only packet page
>> > ---------------------
>> >
>> > When the RX packet page is read-only, the construction of the SKB is
>> > significantly more complicated and even involves one more memory
>> > allocation.
>> >
>> > 1) Allocate a new separate writable memory area, and point skb->data
>> > here. This is needed due to (above described) skb_shared_info.
>> >
>> > 2) Memcpy packet headers into this (skb->data) area.
>> >
>> > 3) Clear part of skb_shared_info struct in writable-area.
>> >
>> > 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
>> > and adjust the page_offset to be past the headers just copied.
>> >
>> > It is useful (later) that the network stack have this notion that part
>> > of the packet and a page can be read-only. This implies that the
>> > kernel will not "pollute" this memory with any sensitive information.
>> > This is good from a zero-copy point of view, but bad from a
>> > performance perspective.
>>
>> This will hopefully become a legacy approach.
>
> Hopefully, but this mode will have to be supported forever, and is the
> current default.
Maybe you need to rename this approach since it is clear there is some
confusion about what is going on here. The page is not Read-only. It
is left device mapped and is not writable by the CPU. That doesn't
mean it isn't written to though.
>
>> > NIC RX Zero-Copy
>> > ================
>> >
>> > Doing NIC RX zero-copy involves mapping RX pages into userspace. This
>> > involves costly mapping and unmapping operations in the address space
>> > of the userspace process. Plus for doing this safely, the page memory
>> > need to be cleared before using it, to avoid leaking kernel
>> > information to userspace, also a costly operation. The page_pool base
>> > "class" of optimization is moving these kind of operations out of the
>> > fastpath, by recycling and lifetime control.
>> >
>> > Once a NIC RX-queue's page_pool have been configured for zero-copy
>> > into userspace, then can packets still be allowed to travel the normal
>> > stack?
>> >
>> > Yes, this should be possible, because the driver can use the
>> > SKB-read-only mode, which avoids polluting the page data with
>> > kernel-side sensitive data. This implies, when a driver RX-queue
>> > switch page_pool to RX-zero-copy mode it MUST also switch to
>> > SKB-read-only mode (for normal stack delivery for this RXq).
>>
>> This is the part that is wrong. Once userspace has access to the
>> pages in an Rx ring that ring cannot be used for regular kernel-side
>> networking. If it is, then sensitive kernel data may be leaked
>> because the application has full access to any page on the ring so it
>> could read the data at any time regardless of where the data is meant
>> to be delivered.
>
> Are you sure. Can you give me an example of kernel code that writes
> into the page when it is attached as a read-only page to the SKB?
You are completely overlooking the writes by the device. The device
is writing to the page. Therefore it is not a true "read-only" page.
> That would violate how we/drivers use the DMA API today (calling DMA
> unmap when packets are in-flight).
What I am talking about is the DMA so it doesn't violate things.
>> > XDP can be used for controlling which pages that gets RX zero-copied
>> > to userspace. The page is still writable for the XDP program, but
>> > read-only for normal stack delivery.
>>
>> Making the page read-only doesn't get you anything. You still have a
>> conflict since user-space can read any packet directly out of the
>> page.
>
> Giving the application CAP_NAT_ADMIN already gave it "tcpdump" read access
> to all other applications packet content from that NIC.
Now we are getting somewhere. So in this scenario we are okay with
the application being able to read anything that is written to the
kernel. That is actually the data I was concerned about. So as long
as we are fine with the application reading any data that is going by
in the packets then we should be fine sharing the data this way.
It does lead to questions though on why there was the 1 page per
packet requirement. As long as we are using pages in this "read-only"
format you could share as many pages as you wanted and have either
multiple packets per page or multiple pages per packet as long as you
honor the read-only aspect of things.
>> > Kernel safety
>> > -------------
>> >
>> > For the paranoid, how do we protect the kernel from a malicious
>> > userspace program. Sure there will be a communication interface
>> > between kernel and userspace, that synchronize ownership of pages.
>> > But a userspace program can violate this interface, given pages are
>> > kept VMA mapped, the program can in principle access all the memory
>> > pages in the given page_pool. This opens up for a malicious (or
>> > defect) program modifying memory pages concurrently with the kernel
>> > and DMA engine using them.
>> >
>> > An easy way to get around userspace modifying page data contents is
>> > simply to map pages read-only into userspace.
>> >
>> > .. Note:: The first implementation target is read-only zero-copy RX
>> > page to userspace and require driver to use SKB-read-only
>> > mode.
>>
>> This allows for Rx but what do we do about Tx?
>
> True, I've not covered Tx. But I believe Tx is easier from a sharing
> PoV, as we don't have the early demux sharing problem, because an
> application/socket will be the starting point, and simply have
> associated a page_pool for TX, solving the VMA mapping overhead.
> Using the skb-read-only-page mode, this would in principle allow normal
> socket zero-copy TX and packet steering.
>
> For performance reasons, when you already know what NIC you want to TX
> on, you could extend this to allocate a separate queue for TX. Which
> makes it look a lot like RDMA.
>
>
>> It sounds like Christoph's RDMA approach might be the way to go.
>
> I'm getting more and more fond of Christoph's RDMA approach. I do
> think we will end-up with something close to that approach. I just
> wanted to get review on my idea first.
>
> IMHO the major blocker for the RDMA approach is not HW filters
> themselves, but a common API that applications can call to register
> what goes into the HW queues in the driver. I suspect it will be a
> long project agreeing between vendors. And agreeing on semantics.
We really should end up doing a HW filtering approach for any
application most likely anyway. I know the Intel parts have their
Flow Director which should allow for directing a flow to the correct
queue. Really it sort of makes sense to look at going that route as
you can focus your software efforts on a queue that should mostly
contain the traffic you are looking for rather than one that will be
processing unrelated traffic.
- Alex
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH net-next 2/2] inet: Fix get port to handle zero port number with soreuseport set
From: Craig Gallek @ 2016-12-15 15:58 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, netdev, kernel-team, Josef Bacik, Eric Dumazet
On Wed, Dec 14, 2016 at 7:54 PM, Tom Herbert <tom@herbertland.com> wrote:
> A user may call listen with binding an explicit port with the intent
> that the kernel will assign an available port to the socket. In this
> case inet_csk_get_port does a port scan. For such sockets, the user may
> also set soreuseport with the intent a creating more sockets for the
> port that is selected. The problem is that the initial socket being
> opened could inadvertently choose an existing and unreleated port
> number that was already created with soreuseport.
Good catch! I think this problem may also exist in the UDP path?
(udp_lib_get_port -> udp_lib_lport_inuse[2])
^ permalink raw reply
* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Hannes Frederic Sowa @ 2016-12-15 15:53 UTC (permalink / raw)
To: David Laight, Jason A. Donenfeld
Cc: Netdev, kernel-hardening@lists.openwall.com,
Jean-Philippe Aumasson, LKML, Linux Crypto Mailing List,
Daniel J . Bernstein, Linus Torvalds, Eric Biggers
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0240529@AcuExch.aculab.com>
On 15.12.2016 16:41, David Laight wrote:
> Try (retyped):
>
> echo 'struct { long a; long long b; } s; int bar { return sizeof s; }' >foo.c
> gcc [-m32] -O2 -S foo.c; cat foo.s
>
> And look at what is generated.
I used __alignof__(unsigned long long) with -m32.
>> Right now ipv6 addresses have an alignment of 4. So we couldn't even
>> naturally pass them to siphash but would need to copy them around, which
>> I feel like a source of bugs.
>
> That is more of a problem on systems that don't support misaligned accesses.
> Reading the 64bit values with two explicit 32bit reads would work.
> I think you can get gcc to do that by adding an aligned(4) attribute to the
> structure member.
Yes, and that is actually my fear, because we support those
architectures. I can't comment on that as I don't understand enough of this.
If someone finds a way to cause misaligned reads on a small box this
seems (maybe depending on sysctls they get fixed up or panic) to be a
much bigger issue than having a hash DoS.
Thanks,
Hannes
^ permalink raw reply
* Re: [PATCH perf/core REBASE 2/5] samples/bpf: Switch over to libbpf
From: Arnaldo Carvalho de Melo @ 2016-12-15 15:50 UTC (permalink / raw)
To: Joe Stringer
Cc: linux-kernel, netdev, wangnan0, ast, daniel,
Arnaldo Carvalho de Melo
In-Reply-To: <20161214224342.12858-3-joe@ovn.org>
Em Wed, Dec 14, 2016 at 02:43:39PM -0800, Joe Stringer escreveu:
> Now that libbpf under tools/lib/bpf/* is synced with the version from
> samples/bpf, we can get rid most of the libbpf library here.
>
> Signed-off-by: Joe Stringer <joe@ovn.org>
> Cc: Alexei Starovoitov <ast@fb.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Wang Nan <wangnan0@huawei.com>
> Link: http://lkml.kernel.org/r/20161209024620.31660-6-joe@ovn.org
> [ Use -I$(srctree)/tools/lib/ to support out of source code tree builds, as noticed by Wang Nan ]
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
So, right before this patch building samples/bpf works, then, after, it fails,
investigating:
[root@1e797fdfbf4f linux]# make -j4 O=/tmp/build/linux/ headers_install
make[1]: Entering directory '/tmp/build/linux'
CHK include/generated/uapi/linux/version.h
make[1]: Leaving directory '/tmp/build/linux'
[root@1e797fdfbf4f linux]# make -j4 O=/tmp/build/linux/ samples/bpf/
make[1]: Entering directory '/tmp/build/linux'
CHK include/config/kernel.release
GEN ./Makefile
CHK include/generated/uapi/linux/version.h
Using /git/linux as source for kernel
CHK include/generated/utsrelease.h
CHK include/generated/timeconst.h
CHK include/generated/bounds.h
CHK include/generated/asm-offsets.h
CALL /git/linux/scripts/checksyscalls.sh
HOSTCC samples/bpf/test_lru_dist.o
HOSTCC samples/bpf/libbpf.o
HOSTCC samples/bpf/sock_example.o
HOSTCC samples/bpf/bpf_load.o
In file included from /git/linux/samples/bpf/libbpf.c:12:0:
/git/linux/samples/bpf/libbpf.h:5:21: fatal error: bpf/bpf.h: No such file or directory
#include <bpf/bpf.h>
^
compilation terminated.
In file included from /git/linux/samples/bpf/test_lru_dist.c:24:0:
/git/linux/samples/bpf/libbpf.h:5:21: fatal error: bpf/bpf.h: No such file or directory
#include <bpf/bpf.h>
^
compilation terminated.
make[2]: *** [scripts/Makefile.host:124: samples/bpf/test_lru_dist.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.host:124: samples/bpf/libbpf.o] Error 1
In file included from /git/linux/samples/bpf/bpf_load.c:24:0:
/git/linux/samples/bpf/libbpf.h:5:21: fatal error: bpf/bpf.h: No such file or directory
#include <bpf/bpf.h>
^
compilation terminated.
make[2]: *** [scripts/Makefile.host:124: samples/bpf/bpf_load.o] Error 1
In file included from /git/linux/samples/bpf/sock_example.c:29:0:
/git/linux/samples/bpf/libbpf.h:5:21: fatal error: bpf/bpf.h: No such file or directory
#include <bpf/bpf.h>
^
compilation terminated.
make[2]: *** [scripts/Makefile.host:124: samples/bpf/sock_example.o] Error 1
make[1]: *** [/git/linux/Makefile:1659: samples/bpf/] Error 2
make[1]: Leaving directory '/tmp/build/linux'
make: *** [Makefile:150: sub-make] Error 2
[root@1e797fdfbf4f linux]#
^ permalink raw reply
* Re: [PATCH net 2/2] net/sched: cls_flower: Use masked key when calling HW offloads
From: Simon Horman @ 2016-12-15 15:43 UTC (permalink / raw)
To: Or Gerlitz, Jiri Pirko
Cc: Paul Blakey, David S. Miller, netdev, Jiri Pirko, Roi Dayan,
Shahar Klein, Hadar Hen Zion
In-Reply-To: <ec30629a-063d-3ff2-d2d7-2e710f7d6779@mellanox.com>
On Thu, Dec 15, 2016 at 04:12:05PM +0200, Or Gerlitz wrote:
> On 12/15/2016 3:50 PM, Simon Horman wrote:
> >>Zero bits on the mask signify a "don't care" on the corresponding bits
> >>in key. Some HWs require those bits on the key to be zero. Since these
> >>bits are masked anyway, it's okay to provide the masked key to all
> >>drivers.
> >>
> >>Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support')
> >>
> >While I don't have a specific use case in mind that this change would break
> >it seems to me that it would be better to handle hardware requirements
> >at the driver level.
>
> Simon, again, since these bits are masked anyway, it would be correct to
> provide the masked key to the hw device.
>
> E.g no matter if the flow key/mask provided to the HW device is is
> 1.1.1.10/24 or 1.1.1.0/24, the user expects to the same matching, so
> nothing can't happen if we provide the latter to the driver.
> >While I don't have a specific use case in mind that this change would break
> >it seems to me that it would be better to handle hardware requirements
> >at the driver level.
>
> Even though, makes no sense to pass unmasked key down. Is is only
> confusing. This patch fixes it.
It seems somewhat arbitrary to me to allow such filters in software
but not pass then down to the driver layer. But I don't feel strongly
about this and I am happy for the patch to progress as-is.
^ permalink raw reply
* RE: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: David Laight @ 2016-12-15 15:41 UTC (permalink / raw)
To: 'Hannes Frederic Sowa', Jason A. Donenfeld
Cc: Netdev, kernel-hardening@lists.openwall.com,
Jean-Philippe Aumasson, LKML, Linux Crypto Mailing List,
Daniel J . Bernstein, Linus Torvalds, Eric Biggers
In-Reply-To: <f6813084-6f46-0b22-e9c3-97b2e266f8f6@stressinduktion.org>
From: Hannes Frederic Sowa
> Sent: 15 December 2016 14:57
> On 15.12.2016 14:56, David Laight wrote:
> > From: Hannes Frederic Sowa
> >> Sent: 15 December 2016 12:50
> >> On 15.12.2016 13:28, David Laight wrote:
> >>> From: Hannes Frederic Sowa
> >>>> Sent: 15 December 2016 12:23
> >>> ...
> >>>> Hmm? Even the Intel ABI expects alignment of unsigned long long to be 8
> >>>> bytes on 32 bit. Do you question that?
> >>>
> >>> Yes.
> >>>
> >>> The linux ABI for x86 (32 bit) only requires 32bit alignment for u64 (etc).
> >>
> >> Hmm, u64 on 32 bit is unsigned long long and not unsigned long. Thus I
> >> am actually not sure if the ABI would say anything about that (sorry
> >> also for my wrong statement above).
> >>
> >> Alignment requirement of unsigned long long on gcc with -m32 actually
> >> seem to be 8.
> >
> > It depends on the architecture.
> > For x86 it is definitely 4.
>
> May I ask for a reference?
Ask anyone who has had to do compatibility layers to support 32bit
binaries on 64bit systems.
> I couldn't see unsigned long long being
> mentioned in the ia32 abi spec that I found. I agree that those accesses
> might be synthetically assembled by gcc and for me the alignment of 4
> would have seemed natural. But my gcc at least in 32 bit mode disagrees
> with that.
Try (retyped):
echo 'struct { long a; long long b; } s; int bar { return sizeof s; }' >foo.c
gcc [-m32] -O2 -S foo.c; cat foo.s
And look at what is generated.
> Right now ipv6 addresses have an alignment of 4. So we couldn't even
> naturally pass them to siphash but would need to copy them around, which
> I feel like a source of bugs.
That is more of a problem on systems that don't support misaligned accesses.
Reading the 64bit values with two explicit 32bit reads would work.
I think you can get gcc to do that by adding an aligned(4) attribute to the
structure member.
David
^ permalink raw reply
* Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
From: Arnaldo Carvalho de Melo @ 2016-12-15 15:35 UTC (permalink / raw)
To: Joe Stringer; +Cc: Daniel Borkmann, LKML, netdev, Wang Nan, ast
In-Reply-To: <20161215143329.GB6866@kernel.org>
Em Thu, Dec 15, 2016 at 11:33:29AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Dec 14, 2016 at 02:46:23PM -0800, Joe Stringer escreveu:
> > On 14 December 2016 at 06:55, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > So, Joe, can you try refreshing this work, starting from what I have in
> > > perf/core? It has the changes coming from net-next that Daniel warned us about
> > > and some more.
>
> > I've just respun this series based on the version you previously
> > applied to perf/core. Since bpf_prog_{attach,detach}() were added to
> > samples/libbpf, a new patch will shift these over to tools/lib/bpf.
> > Other than that, I folded "samples/bpf: Drop unnecessary build
> > targets." back into "samples/bpf: Switch over to libbpf", and I
> > noticed that there were a couple of unnecessary log buffers with the
> > latest changes. For any new sample programs, those were fixed up to
> > use libbpf as well.
>
> > Don't forget to do a "make headers_install" before attempting to build
> > the samples, access to the latest headers is required (as per the
> > readme in samples/bpf).
>
> Ah, README, I should read that ;-)
>
> I got used to how tools/perf/ work, i.e. it is self sufficient wrt
> in-flux stuff in the kernel, i.e. headers that are related to features
> it supports and that are under constant improvements, such as eBPF, kvm,
> syscall tables, etc.
>
> Anyway, will do the headers_install step inside a container, to avoid
> polluting my workstation.
heh: should've read that file, now I did:
<quote>
There are usually dependencies to header files of the current kernel.
To avoid installing devel kernel headers system wide, as a normal
user, simply call::
make headers_install
This will creates a local "usr/include" directory in the git/build top
level directory, that the make system automatically pickup first.
</quote>
> Thanks for doing the respin and for the clarifications about building
> samples/bpf/.
>
> - Arnaldo
^ permalink raw reply
* Re: wl1251 & mac address & calibration data
From: Pali Rohár @ 2016-12-15 15:33 UTC (permalink / raw)
To: Kalle Valo
Cc: Sebastian Reichel, Pavel Machek, Michal Kazior, Ivaylo Dimitrov,
Aaro Koskinen, Tony Lindgren, linux-wireless, Network Development,
linux-kernel, Luis R. Rodriguez
In-Reply-To: <87d1gtli57.fsf@purkki.adurom.net>
On Thu Dec 15 09:18:44 2016 Kalle Valo <kvalo@codeaurora.org> wrote:
> (Adding Luis because he has been working on request_firmware() lately)
>
> Pali Rohár <pali.rohar@gmail.com> writes:
>
> > > > So no, there is no argument against... request_firmware() in
> > > > fallback mode with userspace helper is by design blocking and
> > > > waiting for userspace. But waiting for some change in DTS in
> > > > kernel is just nonsense.
> > >
> > > I would just mark the wlan device with status = "disabled" and
> > > enable it in the overlay together with adding the NVS & MAC info.
> >
> > So if you think that this solution make sense, we can wait what net
> > wireless maintainers say about it...
> >
> > For me it looks like that solution can be:
> >
> > extending request_firmware() to use only userspace helper
>
> I haven't followed the discussion very closely but this is my preference
> what drivers should do:
>
> 1) First the driver should do try to get the calibration data and mac
> address from the device tree.
>
Ok, but there is no (dynamic, device specific) data in DTS for N900. So 1) is noop.
> 2) If they are not in DT the driver should retrieve the calibration data
> with request_firmware(). BUT with an option for user space to
> implement that with a helper script so that the data can be created
> dynamically, which I believe openwrt does with ath10k calibration
> data right now.
Currently there is flag for request_firmware() that it should fallback to user helper if direct VFS access not find needed firmware.
But this flag is not suitable as /lib/firmware already provides default (not device specific) calibration data.
So I would suggest to add another flag/function which will primary use user helper.
> > and load mac address also via request_firmware() either by appending
> > it into NVS data or via separate call
>
> I'm not really fan of the idea providing permanent mac address through
> request_firmware(). For example, how to handle multiple devices on the
> same host, would there be a need for some kind of bus ids encoded to the
> filename? And what about devices with multiple mac addresses?
For N900 there is only one wl1251 device. And... wl12xx is already using appended MAC address in calibration data read by request firmware. So reason why I prefer similar usage also for wl1251.
> I wish there would be a better way than request_firmware() to provide
> the permanent mac addresses from user space (if device tree is not
> available), I just don't know what that could be :) But if we would
> start to use request_firmware() for this at least there should be a
> wider concensus about that and it should be properly documented, just
> like the device tree bindings.
>
> --
> Kalle Valo
I do not know about any other, so reason why I'm asking :-) and there are my proposed solutions. If you (or any other) came up with better we can discuss about it :-)
--
Pali Rohár
pali.rohar@gmail.com
^ permalink raw reply
* [PATCH net] sctp: sctp_transport_lookup_process should rcu_read_unlock when transport is null
From: Xin Long @ 2016-12-15 15:05 UTC (permalink / raw)
To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman
Prior to this patch, sctp_transport_lookup_process didn't rcu_read_unlock
when it failed to find a transport by sctp_addrs_lookup_transport.
This patch is to fix it by moving up rcu_read_unlock right before checking
transport and also to remove the out path.
Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/socket.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d5f4b4a..318c678 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4472,18 +4472,17 @@ int sctp_transport_lookup_process(int (*cb)(struct sctp_transport *, void *),
const union sctp_addr *paddr, void *p)
{
struct sctp_transport *transport;
- int err = -ENOENT;
+ int err;
rcu_read_lock();
transport = sctp_addrs_lookup_transport(net, laddr, paddr);
+ rcu_read_unlock();
if (!transport)
- goto out;
+ return -ENOENT;
- rcu_read_unlock();
err = cb(transport, p);
sctp_transport_put(transport);
-out:
return err;
}
EXPORT_SYMBOL_GPL(sctp_transport_lookup_process);
--
2.1.0
^ permalink raw reply related
* [PATCH net] sctp: sctp_epaddr_lookup_transport should be protected by rcu_read_lock
From: Xin Long @ 2016-12-15 15:00 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, Marcelo Ricardo Leitner, Neil Horman, Dmitry Vyukov
Since commit 7fda702f9315 ("sctp: use new rhlist interface on sctp transport
rhashtable"), sctp has changed to use rhlist_lookup to look up transport, but
rhlist_lookup doesn't call rcu_read_lock inside, unlike rhashtable_lookup_fast.
It is called in sctp_epaddr_lookup_transport and sctp_addrs_lookup_transport.
sctp_addrs_lookup_transport is always in the protection of rcu_read_lock(),
as __sctp_lookup_association is called in rx path or sctp_lookup_association
which are in the protection of rcu_read_lock() already.
But sctp_epaddr_lookup_transport is called by sctp_endpoint_lookup_assoc, it
doesn't call rcu_read_lock, which may cause "suspicious rcu_dereference_check
usage' in __rhashtable_lookup.
This patch is to fix it by adding rcu_read_lock in sctp_endpoint_lookup_assoc
before calling sctp_epaddr_lookup_transport.
Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport rhashtable")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/endpointola.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
index 1f03065..410ddc1 100644
--- a/net/sctp/endpointola.c
+++ b/net/sctp/endpointola.c
@@ -331,7 +331,9 @@ struct sctp_association *sctp_endpoint_lookup_assoc(
* on this endpoint.
*/
if (!ep->base.bind_addr.port)
- goto out;
+ return NULL;
+
+ rcu_read_lock();
t = sctp_epaddr_lookup_transport(ep, paddr);
if (!t)
goto out;
@@ -339,6 +341,7 @@ struct sctp_association *sctp_endpoint_lookup_assoc(
*transport = t;
asoc = t->asoc;
out:
+ rcu_read_unlock();
return asoc;
}
--
2.1.0
^ permalink raw reply related
* Re: [v3] net: ethernet: cavium: octeon: octeon_mgmt: Handle return NULL error from devm_ioremap
From: arvind Yadav @ 2016-12-15 15:00 UTC (permalink / raw)
To: David Daney, peter.chen, fw, david.daney, f.fainelli; +Cc: netdev, linux-kernel
In-Reply-To: <41599714-309c-8590-b2b0-e468cffb57b5@caviumnetworks.com>
Hi David,
I did not tested this feature. I have build it and flashed on hardware.
You can check below commit id. Which has similar check for ioremap.
1- Commit id - de9e397e40f56b9f34af4bf6a5bd7a75ea02456c
In 'drivers/net/phy/mdio-octeon.c'
2- Commit id - 592569de4c247fe4f25db8369dc0c63860f9560b
In 'drivers/gpio/gpio-octeon.c'
Thanks
Arvind
On Thursday 15 December 2016 12:58 AM, David Daney wrote:
> On 12/14/2016 11:03 AM, Arvind Yadav wrote:
>> Here, If devm_ioremap will fail. It will return NULL.
>> Kernel can run into a NULL-pointer dereference.
>> This error check will avoid NULL pointer dereference. t
>
> I have asked you twice already this question, but could not determine
> from your response what the answer is:
>
> Q: Have you tested the patch on OCTEON based hardware that contains
> the "octeon_mgmt" Ethernet ports? Please answer either "yes" or "no".
>
>
> Thanks,
> David Daney
>
>
>>
>> Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
>> ---
>> drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
>> b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
>> index 4ab404f..33c2fec 100644
>> --- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
>> +++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
>> @@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct
>> platform_device *pdev)
>> p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
>> p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
>> p->agl_prt_ctl_size);
>> + if (!p->mix || !p->agl || !p->agl_prt_ctl) {
>> + dev_err(&pdev->dev, "failed to map I/O memory\n");
>> + result = -ENOMEM;
>> + goto err;
>> + }
>> +
>> spin_lock_init(&p->lock);
>>
>> skb_queue_head_init(&p->tx_list);
>>
>
^ permalink raw reply
* Re: [PATCH net 2/2] net/sched: cls_flower: Use masked key when calling HW offloads
From: Jiri Pirko @ 2016-12-15 14:59 UTC (permalink / raw)
To: Simon Horman
Cc: Paul Blakey, David S. Miller, netdev, Jiri Pirko, Or Gerlitz,
Roi Dayan, Shahar Klein, Hadar Hen Zion
In-Reply-To: <20161215135043.GA7104@penelope.horms.nl>
Thu, Dec 15, 2016 at 02:50:44PM CET, simon.horman@netronome.com wrote:
>Hi Paul,
>
>On Wed, Dec 14, 2016 at 07:00:58PM +0200, Paul Blakey wrote:
>> Zero bits on the mask signify a "don't care" on the corresponding bits
>> in key. Some HWs require those bits on the key to be zero. Since these
>> bits are masked anyway, it's okay to provide the masked key to all
>> drivers.
>>
>> Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support')
>> Signed-off-by: Paul Blakey <paulb@mellanox.com>
>> Reviewed-by: Roi Dayan <roid@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
>
>While I don't have a specific use case in mind that this change would break
>it seems to me that it would be better to handle hardware requirements
>at the driver level.
Even though, makes no sense to pass unmasked key down. Is is only
confusing. This patch fixes it.
>
>> ---
>> net/sched/cls_flower.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>> index 9758f5a..35ac28d 100644
>> --- a/net/sched/cls_flower.c
>> +++ b/net/sched/cls_flower.c
>> @@ -252,7 +252,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
>> offload.cookie = (unsigned long)f;
>> offload.dissector = dissector;
>> offload.mask = mask;
>> - offload.key = &f->key;
>> + offload.key = &f->mkey;
>> offload.exts = &f->exts;
>>
>> tc->type = TC_SETUP_CLSFLOWER;
>> --
>> 1.8.3.1
>>
^ permalink raw reply
* Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: Hannes Frederic Sowa @ 2016-12-15 14:56 UTC (permalink / raw)
To: David Laight, Jason A. Donenfeld
Cc: Netdev, kernel-hardening@lists.openwall.com,
Jean-Philippe Aumasson, LKML, Linux Crypto Mailing List,
Daniel J . Bernstein, Linus Torvalds, Eric Biggers
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0240437@AcuExch.aculab.com>
On 15.12.2016 14:56, David Laight wrote:
> From: Hannes Frederic Sowa
>> Sent: 15 December 2016 12:50
>> On 15.12.2016 13:28, David Laight wrote:
>>> From: Hannes Frederic Sowa
>>>> Sent: 15 December 2016 12:23
>>> ...
>>>> Hmm? Even the Intel ABI expects alignment of unsigned long long to be 8
>>>> bytes on 32 bit. Do you question that?
>>>
>>> Yes.
>>>
>>> The linux ABI for x86 (32 bit) only requires 32bit alignment for u64 (etc).
>>
>> Hmm, u64 on 32 bit is unsigned long long and not unsigned long. Thus I
>> am actually not sure if the ABI would say anything about that (sorry
>> also for my wrong statement above).
>>
>> Alignment requirement of unsigned long long on gcc with -m32 actually
>> seem to be 8.
>
> It depends on the architecture.
> For x86 it is definitely 4.
May I ask for a reference? I couldn't see unsigned long long being
mentioned in the ia32 abi spec that I found. I agree that those accesses
might be synthetically assembled by gcc and for me the alignment of 4
would have seemed natural. But my gcc at least in 32 bit mode disagrees
with that.
> It might be 8 for sparc, ppc and/or alpha.
This is something to find out...
Right now ipv6 addresses have an alignment of 4. So we couldn't even
naturally pass them to siphash but would need to copy them around, which
I feel like a source of bugs.
Bye,
Hannes
^ permalink raw reply
* Re: [RFC v2 00/10] HFI Virtual Network Interface Controller (VNIC)
From: ira.weiny @ 2016-12-15 14:52 UTC (permalink / raw)
To: Leon Romanovsky, Doug Ledford, Jeff Kirsher, David S. Miller
Cc: Vishwanathapura, Niranjana, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20161215091226.GC811-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
On Thu, Dec 15, 2016 at 11:12:26AM +0200, Leon Romanovsky wrote:
> On Wed, Dec 14, 2016 at 11:59:32PM -0800, Vishwanathapura, Niranjana wrote:
> > Thanks Jason for the valuable feedback.
> > Here is the revised HFI VNIC patch series.
> >
> > ChangeLog:
> > =========
> > v1 => v2:
> > a) Removed hfi_vnic bus, instead make hfi_vnic driver an 'ib client',
> > as per feedback from Jason Gunthorpe.
> > b) Interface changes, data structure changes and variable name changes
> > associated with (a).
> > c) Add hfi_ibdev abstraction to provide VNIC control operations to
> > hfi_vnic client.
> > d) Minor fixes
> > e) Moved hfi_vnic driver from .../sw/intel/vnic/hfi_vnic to
> > .../sw/intel/hfi_vnic.
>
> To put it into proportion, Jason asked you to do different thing.
> http://marc.info/?l=linux-rdma&m=147977108302151&w=2
> http://marc.info/?l=linux-rdma&m=148000415401842&w=2
>
> And Christoph,
> http://marc.info/?l=linux-rdma&m=147985587425861&w=2
Understood. However, we never heard back from Niranjanas analysis of the code
which stated that > 60% of the code was dealing with the OPA MADs used to
configure this device.
https://www.spinics.net/lists/linux-rdma/msg43579.html
Furthermore, neither Dave nor Doug has had time to weigh in on what we should
do.
So before we make that change we wanted to get consensus on using the
hfi1_ibdev abstraction rather than the bus. This was the _real_ technical
change.
Beyond that it is really just which maintainer wants this driver. To that end
I've also cc'ed Jeff Kirsher who maintains drivers/net/ethernet/intel. Perhaps
Dave would like the driver to go through that tree?
I think there are pros and cons to both subtrees and in the end we will do
whatever is decided.
For maintainer review:
1) The driver encapsulates ethernet packets with OPA headers
2) VNIC uses OPA management packets (MADs) for its configuration
3) A significant portion (> 60% +) of the code is specific to OPA
https://www.spinics.net/lists/linux-rdma/msg43579.html
4) The driver is from Intel and we expect Intel to be the primary
contributor to the code.
5) The driver, like hfi1, is dual licensed (GPL/BSD)
6) Based on Christophs feedback we will be adding device capability
bits to the IB core to indicate HFI VNIC support.
https://www.spinics.net/lists/linux-rdma/msg44113.html
Doug, Dave, Jeff any thoughts?
Ira
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf
From: Arnaldo Carvalho de Melo @ 2016-12-15 14:33 UTC (permalink / raw)
To: Joe Stringer; +Cc: Daniel Borkmann, LKML, netdev, Wang Nan, ast
In-Reply-To: <CAPWQB7E-i+m0Ukz6O39AV-pHMB=GsoUTTa1862QUjGWM3RQJyQ@mail.gmail.com>
Em Wed, Dec 14, 2016 at 02:46:23PM -0800, Joe Stringer escreveu:
> On 14 December 2016 at 06:55, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > So, Joe, can you try refreshing this work, starting from what I have in
> > perf/core? It has the changes coming from net-next that Daniel warned us about
> > and some more.
> I've just respun this series based on the version you previously
> applied to perf/core. Since bpf_prog_{attach,detach}() were added to
> samples/libbpf, a new patch will shift these over to tools/lib/bpf.
> Other than that, I folded "samples/bpf: Drop unnecessary build
> targets." back into "samples/bpf: Switch over to libbpf", and I
> noticed that there were a couple of unnecessary log buffers with the
> latest changes. For any new sample programs, those were fixed up to
> use libbpf as well.
> Don't forget to do a "make headers_install" before attempting to build
> the samples, access to the latest headers is required (as per the
> readme in samples/bpf).
Ah, README, I should read that ;-)
I got used to how tools/perf/ work, i.e. it is self sufficient wrt
in-flux stuff in the kernel, i.e. headers that are related to features
it supports and that are under constant improvements, such as eBPF, kvm,
syscall tables, etc.
Anyway, will do the headers_install step inside a container, to avoid
polluting my workstation.
Thanks for doing the respin and for the clarifications about building
samples/bpf/.
- Arnaldo
^ permalink raw reply
* [PATCH net] ixgbe: update the rss key on h/w, when ethtool ask for it.
From: Paolo Abeni @ 2016-12-15 14:20 UTC (permalink / raw)
To: netdev; +Cc: Jeff Kirsher, intel-wired-lan
Currently ixgbe_set_rxfh() updates the rss_key copy in the driver
memory, but does not push the new value into the h/w. This commit
add a new helper for the latter operation and call it in
ixgbe_set_rxfh(), so that the h/w rss key value can be really
updated via ethtool.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 +
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 4 +++-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 19 ++++++++++++++++---
3 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index ef81c3d..8fb9fbf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -1026,6 +1026,7 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
struct ixgbe_adapter *adapter,
struct ixgbe_ring *tx_ring);
u32 ixgbe_rss_indir_tbl_entries(struct ixgbe_adapter *adapter);
+void ixgbe_store_key(struct ixgbe_adapter *adapter);
void ixgbe_store_reta(struct ixgbe_adapter *adapter);
s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, u32 lp_reg,
u32 adv_sym, u32 adv_asm, u32 lp_sym, u32 lp_asm);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index fd192bf..e40f9ce 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -3003,8 +3003,10 @@ static int ixgbe_set_rxfh(struct net_device *netdev, const u32 *indir,
}
/* Fill out the rss hash key */
- if (key)
+ if (key) {
memcpy(adapter->rss_key, key, ixgbe_get_rxfh_key_size(netdev));
+ ixgbe_store_key(adapter);
+ }
ixgbe_store_reta(adapter);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1e2f39e..0c23ab8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3411,6 +3411,21 @@ u32 ixgbe_rss_indir_tbl_entries(struct ixgbe_adapter *adapter)
}
/**
+ * ixgbe_store_key - Write the RSS key to HW
+ * @adapter: device handle
+ *
+ * Write the RSS key stored in adapter.rss_key to HW.
+ */
+void ixgbe_store_key(struct ixgbe_adapter *adapter)
+{
+ struct ixgbe_hw *hw = &adapter->hw;
+ int i;
+
+ for (i = 0; i < 10; i++)
+ IXGBE_WRITE_REG(hw, IXGBE_RSSRK(i), adapter->rss_key[i]);
+}
+
+/**
* ixgbe_store_reta - Write the RETA table to HW
* @adapter: device handle
*
@@ -3475,7 +3490,6 @@ static void ixgbe_store_vfreta(struct ixgbe_adapter *adapter)
static void ixgbe_setup_reta(struct ixgbe_adapter *adapter)
{
- struct ixgbe_hw *hw = &adapter->hw;
u32 i, j;
u32 reta_entries = ixgbe_rss_indir_tbl_entries(adapter);
u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
@@ -3488,8 +3502,7 @@ static void ixgbe_setup_reta(struct ixgbe_adapter *adapter)
rss_i = 4;
/* Fill out hash function seeds */
- for (i = 0; i < 10; i++)
- IXGBE_WRITE_REG(hw, IXGBE_RSSRK(i), adapter->rss_key[i]);
+ ixgbe_store_key(adapter);
/* Fill out redirection table */
memset(adapter->rss_indir_tbl, 0, sizeof(adapter->rss_indir_tbl));
--
1.8.3.1
^ permalink raw reply related
* RE: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
From: David Laight @ 2016-12-15 13:56 UTC (permalink / raw)
To: 'Hannes Frederic Sowa', Jason A. Donenfeld
Cc: Netdev, kernel-hardening@lists.openwall.com,
Jean-Philippe Aumasson, LKML, Linux Crypto Mailing List,
Daniel J . Bernstein, Linus Torvalds, Eric Biggers
In-Reply-To: <0f3c3694-c00b-aae2-5b08-25bc64bf6372@stressinduktion.org>
From: Hannes Frederic Sowa
> Sent: 15 December 2016 12:50
> On 15.12.2016 13:28, David Laight wrote:
> > From: Hannes Frederic Sowa
> >> Sent: 15 December 2016 12:23
> > ...
> >> Hmm? Even the Intel ABI expects alignment of unsigned long long to be 8
> >> bytes on 32 bit. Do you question that?
> >
> > Yes.
> >
> > The linux ABI for x86 (32 bit) only requires 32bit alignment for u64 (etc).
>
> Hmm, u64 on 32 bit is unsigned long long and not unsigned long. Thus I
> am actually not sure if the ABI would say anything about that (sorry
> also for my wrong statement above).
>
> Alignment requirement of unsigned long long on gcc with -m32 actually
> seem to be 8.
It depends on the architecture.
For x86 it is definitely 4.
It might be 8 for sparc, ppc and/or alpha.
David
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox