* [PATCH net-next v2 2/2] i40e: fix setting debug parameter early
From: Stefan Assmann @ 2016-09-23 13:30 UTC (permalink / raw)
To: intel-wired-lan
Cc: netdev, davem, jeffrey.t.kirsher, carolyn.wyborny, sassmann
In-Reply-To: <1474637458-5255-1-git-send-email-sassmann@kpanic.de>
pf->msg_enable is a bitmask, therefore assigning the value of the
"debug" parameter is wrong. It is initialized again later in
i40e_sw_init() so it didn't cause any problem, except that we missed
early debug messages. Moved the initialization and assigned
pf->hw.debug_mask the bitmask as that's what the driver actually uses
in i40e_debug(). Otherwise the debug parameter is just a noop.
Fixes: 5b5faa4 ("i40e: enable debug earlier")
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 56369761..f972f0d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8498,11 +8498,6 @@ static int i40e_sw_init(struct i40e_pf *pf)
int err = 0;
int size;
- pf->msg_enable = netif_msg_init(debug,
- NETIF_MSG_DRV |
- NETIF_MSG_PROBE |
- NETIF_MSG_LINK);
-
/* Set default capability flags */
pf->flags = I40E_FLAG_RX_CSUM_ENABLED |
I40E_FLAG_MSI_ENABLED |
@@ -10812,10 +10807,13 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
mutex_init(&hw->aq.asq_mutex);
mutex_init(&hw->aq.arq_mutex);
- if (debug != -1) {
- pf->msg_enable = pf->hw.debug_mask;
- pf->msg_enable = debug;
- }
+ /* enable debug prints if requested */
+ pf->msg_enable = netif_msg_init(debug,
+ NETIF_MSG_DRV |
+ NETIF_MSG_PROBE |
+ NETIF_MSG_LINK);
+ if (debug != -1)
+ pf->hw.debug_mask = pf->msg_enable;
/* do a special CORER for clearing PXE mode once at init */
if (hw->revision_id == 0 &&
--
2.7.4
^ permalink raw reply related
* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
From: David Laight @ 2016-09-23 13:35 UTC (permalink / raw)
To: 'Vlastimil Babka', Eric Dumazet
Cc: Alexander Viro, Andrew Morton, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko,
netdev@vger.kernel.org, Linux API, linux-man@vger.kernel.org
In-Reply-To: <3bbcc269-ec8b-12dd-e0ae-190c18bc3f47@suse.cz>
From: Vlastimil Babka
> Sent: 23 September 2016 10:59
...
> > I suspect that fdt->max_fds is an upper bound for the highest fd the
> > process has open - not the RLIMIT_NOFILE value.
>
> I gathered that the highest fd effectively limits the number of files,
> so it's the same. I might be wrong.
An application can reduce RLIMIT_NOFILE below that of an open file.
David
^ permalink raw reply
* Re: [PATCH net-next] tcp: add tcp_add_backlog()
From: Eric Dumazet @ 2016-09-23 13:42 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20160923124517.GB17222@localhost.localdomain>
On Fri, 2016-09-23 at 09:45 -0300, Marcelo Ricardo Leitner wrote:
> Aye. In that case, what about using tail instead of end?
What do you mean exactly ?
> Because
> accounting for something that we have to tweak the limits to accept is
> like adding a constant to both sides of the equation.
> But perhaps that would cut out too much of the fat which could be used
> later by the stack.
Are you facing a particular problem with current code ?
I am working to reduce the SACK at their source (the receiver), instead
of trying to filter them when they had to travel all the way back to TCP
sender.
^ permalink raw reply
* [PATCH] cxgb4: fix -ve error check on a signed iq
From: Colin King @ 2016-09-23 13:45 UTC (permalink / raw)
To: Hariprasad S, netdev, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
iq is unsigned, so the error check for iq < 0 has no effect so errors
can slip past this check. Fix this by making iq signed and also
get_filter_steerq return a signed int so a -ve error can be returned.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
index 2a61617..9c3644d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
@@ -117,11 +117,11 @@ static int validate_filter(struct net_device *dev,
return 0;
}
-static unsigned int get_filter_steerq(struct net_device *dev,
+static int get_filter_steerq(struct net_device *dev,
struct ch_filter_specification *fs)
{
struct adapter *adapter = netdev2adap(dev);
- unsigned int iq;
+ int iq;
/* If the user has requested steering matching Ingress Packets
* to a specific Queue Set, we need to make sure it's in range
@@ -443,10 +443,10 @@ int __cxgb4_set_filter(struct net_device *dev, int filter_id,
struct filter_ctx *ctx)
{
struct adapter *adapter = netdev2adap(dev);
- unsigned int max_fidx, fidx, iq;
+ unsigned int max_fidx, fidx;
struct filter_entry *f;
u32 iconf;
- int ret;
+ int iq, ret;
max_fidx = adapter->tids.nftids;
if (filter_id != (max_fidx + adapter->tids.nsftids - 1) &&
--
2.9.3
^ permalink raw reply related
* Re: [PATCH] net: VRF: Fix receiving multicast traffic
From: David Ahern @ 2016-09-23 13:49 UTC (permalink / raw)
To: Mark Tomlinson, netdev@vger.kernel.org
In-Reply-To: <f1ead8ba-5404-36f3-40a9-35105006017d@alliedtelesis.co.nz>
On 9/22/16 9:06 PM, Mark Tomlinson wrote:
>
> On 09/23/2016 10:41 AM, David Ahern wrote:
>> On 9/22/16 4:10 PM, Mark Tomlinson wrote:
>>> On 09/23/2016 03:14 AM, David Ahern wrote:
>>>> l3mdev devices do not support IPv4 multicast so checking mcast against that device should not be working at all. For that reason I was fine with the change in the previous patch. ie., you want the real ingress device there not the vrf device.
>>>>
>>>> What test are you running that says your previous patch broke something?
>>> Although we do not expect any multicast routing to work in an l3mdev,
>>> (IGMP snooping or PIM), we still want to have multicast packets
>>> delivered for protocols such as RIP. This was working before my previous
>>> patch, but these multicast packets are now dropped. This current patch
>>> fixes that again, hopefully still with the benefits of my first patch.
>>>
>> can you discern which check is making that happen?
>>
>> It does not make sense to look at the in_device of a vrf device for mcast addresses. For IPv6 linklocal and mcast is specifically blocked. IPv4 should do the same. So, how is RIP getting the packet at all?
> This might be due to some other changes we've made for VRF and multicast
> but haven't sent upstream. In particular, a change to do_ip_setsockopt()
> and its handling of IP_MULTICAST_IF as well as IP_ADD/DROP_MEMBERSHIP. I
> am guessing that without these changes, we wouldn't be able to receive
> multicast packets in RIP. With our changes, the in_dev->mc_list does
> contain the RIP MC address (224.0.0.9) in the master interface, and so
> the function ip_check_mc_rcu() returns success with the master only.
>
> Our RIP daemon is VRF-aware. So it does use setsockopt(SO_BINDTODEVICE,
> "vrf-master") when running in a VRF. Without following it all the way
> down, I believe that it is this that allows the multicast lookup at the
> top of ip_check_mc_rcu() to succeed on the vrf-master, but not the
> ingress interface. That is, in_dev->mc_list does contain 224.0.0.9 only
> on the vrf-master. Provided the lookup in ip_check_mc_rcu() succeeds (im
> != NULL), this function can return success.
>
> Are you interested in the other patches at the moment?
>
Yes, but with the context of the bigger IPv4 multicast solution. I am on PTO today - about to get on a plane. How about we leave the upstream kernel as is - i.e., drop this patch. You can carry it locally with the others. We can take a look at the bigger mcast picture for 4.10. Agree?
^ permalink raw reply
* Re: [net-next 5/5] PCI: disable FLR for 82579 device
From: Bjorn Helgaas @ 2016-09-23 14:01 UTC (permalink / raw)
To: Jeff Kirsher
Cc: davem, bhelgaas, Sasha Neftin, netdev, nhorman, sassmann,
jogreene, guru.anbalagane, linux-pci
In-Reply-To: <1474612741-75681-6-git-send-email-jeffrey.t.kirsher@intel.com>
On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote:
> From: Sasha Neftin <sasha.neftin@intel.com>
>
> 82579 has a problem reattaching itself after the device is detached.
> The bug was reported by Redhat. The suggested fix is to disable
> FLR capability in PCIe configuration space.
>
> Reproduction:
> Attach the device to a VM, then detach and try to attach again.
>
> Fix:
> Disable FLR capability to prevent the 82579 from hanging.
Is there a bugzilla or other reference URL to include here? Should
this be marked for stable?
> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> drivers/pci/quirks.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44e0ff3..59fba6e 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct pci_dev *pdev)
> }
> }
> DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);
> +/*
> + * Workaround FLR issues for 82579
> + * This code disables the FLR (Function Level Reset) via PCIe, in order
> + * to workaround a bug found while using device passthrough, where the
> + * interface would become non-responsive.
> + * NOTE: the FLR bit is Read/Write Once (RWO) in config space, so if
> + * the BIOS or kernel writes this register * then this workaround will
> + * not work.
This doesn't sound like a root cause. Is the issue a hardware
erratum? Linux PCI core bug? VFIO bug? Device firmware bug?
The changelog suggests that the problem only affects passthrough,
which suggests some sort of kernel bug related to how passthrough is
implemented.
> + */
> +static void quirk_intel_flr_cap_dis(struct pci_dev *dev)
> +{
> + int pos = pci_find_capability(dev, PCI_CAP_ID_AF);
> + if (pos) {
> + u8 cap;
> + pci_read_config_byte(dev, pos + PCI_AF_CAP, &cap);
> + cap = cap & (~PCI_AF_CAP_FLR);
> + pci_write_config_byte(dev, pos + PCI_AF_CAP, cap);
> + }
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_flr_cap_dis);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_flr_cap_dis);
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] net: bcmgenet: Fix EPHY reset in power up
From: Andrew Lunn @ 2016-09-23 14:06 UTC (permalink / raw)
To: Jaedon Shin; +Cc: Florian Fainelli, David S . Miller, Philippe Reynes, netdev
In-Reply-To: <20160923132004.5734-1-jaedon.shin@gmail.com>
On Fri, Sep 23, 2016 at 10:20:04PM +0900, Jaedon Shin wrote:
> The bcmgenet_mii_reset() is always not running in power up sequence
> after 'commit 62469c76007e ("net: ethernet: bcmgenet: use phydev from
> struct net_device")'. This'll show extremely high latency and duplicate
> packets while interface down and up repeatedly.
>
> For now, adds again a private phydev for mii reset when runs power up to
> open interface.
Hi Jaedon
How does this fix the issue? It sounds like you are papering over the
crack, not truly fixing it.
Andrew
^ permalink raw reply
* Re: [PATCH net-next] tcp: add tcp_add_backlog()
From: Marcelo Ricardo Leitner @ 2016-09-23 14:09 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <1474638171.28155.18.camel@edumazet-glaptop3.roam.corp.google.com>
On Fri, Sep 23, 2016 at 06:42:51AM -0700, Eric Dumazet wrote:
> On Fri, 2016-09-23 at 09:45 -0300, Marcelo Ricardo Leitner wrote:
>
> > Aye. In that case, what about using tail instead of end?
>
>
> What do you mean exactly ?
Something like:
-skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
+skb->truesize = SKB_TRUESIZE(skb_tail_offset(skb));
And define skb_tail_offset() to something similar skb_end_offset(), so
that it would account only for the data and not unused space in the
buffer.
>
> > Because
> > accounting for something that we have to tweak the limits to accept is
> > like adding a constant to both sides of the equation.
> > But perhaps that would cut out too much of the fat which could be used
> > later by the stack.
>
> Are you facing a particular problem with current code ?
>
For TCP, no, just wondering. :-)
I'm having similar issues with SCTP: if the socket gets backlogged, the
buffer accounting gets pretty messy. SCTP calculates the rwnd to be just
rcvbuf/2, but this ratio is often different in backlog and it causes the
advertized window to be too big, resulting in packet drops in the
backlog.
SCTP has some way to identify and compensate this "extra" rwnd, via
rwnd_press, and will shrink it if it detects that the window is bigger
than the buffer available. But as the socket is backlogged, it's doesn't
kick in soon enough to prevent such drops.
It's not just a matter of adjusting the overhead ratio (rcvbuf/2)
because with SCTP the packets may have different sizes, so a packet with
a chunk of 100 bytes will have a ratio and another with 1000 bytes will
have another, within the same association.
So I'm leaning towards on updating truesize before adding to the
backlog, but to account just for the actual packet, regardless of the
buffer that was used for it. It still has the overhead ratio issue with
different packet sizes, though, but smaller.
Note that SCTP doesn't have buffer auto-tuning yet.
> I am working to reduce the SACK at their source (the receiver), instead
> of trying to filter them when they had to travel all the way back to TCP
> sender.
>
Cool.
^ permalink raw reply
* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Tom Herbert @ 2016-09-23 14:14 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
Tariq Toukan, Brenden Blanco, Alexei Starovoitov, Eric Dumazet,
Jesper Dangaard Brouer
In-Reply-To: <06280af8-5451-c423-d295-a5f3f51e63cf@mojatatu.com>
On Fri, Sep 23, 2016 at 4:13 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 16-09-20 06:00 PM, Tom Herbert wrote:
>>
>> This patch creates an infrastructure for registering and running code at
>> XDP hooks in drivers. This is based on the orignal XDP?BPF and borrows
>> heavily from the techniques used by netfilter to make generic nfhooks.
>>
>> An XDP hook is defined by the xdp_hook_ops. This structure contains the
>> ops of an XDP hook. A pointer to this structure is passed into the XDP
>> register function to set up a hook. The XDP register function mallocs
>> its own xdp_hook_ops structure and copies the values from the
>> xdp_hook_ops passed in. The register function also stores the pointer
>> value of the xdp_hook_ops argument; this pointer is used in subsequently
>> calls to XDP to identify the registered hook.
>>
>> The interface is defined in net/xdp.h. This includes the definition of
>> xdp_hook_ops, functions to register and unregister hook ops on a device
>> or individual instances of napi, and xdp_hook_run that is called by
>> drivers to run the hooks.
>>
>
> Tom,
> perused the thread and it seems you are serious ;->
> Are we heading towards Frankenstein Avenue?
I'm surprised. I thought you'd like democratizing an interface. :-)
> The whole point behind letting in XDP is so that _small programs_
> can be written to do some quick thing. eBPF 4K program limit was
> touted as the check and bound. eBPF sounded fine.
> This sounds like a huge contradiction.
Jamal, thanks for the feedback.
IMO, XDP != BPF. XDP is an interface in drivers that allow raw packet
processing in the receive path,. This is akin to DPDK in the kernel.
BPF is critical use case of the interface.
As for the 4K limit that still allows for a lot of damage and abuse
that the user can do with a loadable program and, yes, code in the
kernel can do significant damage as well. However, unlike BPF programs
being set from userspace, kernel code has a well established process
for acceptance, any suggested code gets review, and if it's going
"Frankenstein Avenue" I'd expect it to be rejected. I get a little
worried that we are going to start artificially limiting what we can
do in the kernel on the basis that we don't trust kernel programmers
to be competent enough not to abuse kernel interfaces-- the fact that
we are enabling userspace to arbitrarily program the kernel but
purposely limiting what the kernel itself can do would be a huge irony
to me.
Thanks,
Tom
>
> cheers,
> jamal
>
>
^ permalink raw reply
* Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()
From: Alexander Duyck @ 2016-09-23 14:17 UTC (permalink / raw)
To: David Miller
Cc: Sowmini Varadhan, Jiri Benc, Netdev, Hannes Frederic Sowa,
Alexander Duyck, Daniel Borkmann, Paolo Abeni
In-Reply-To: <20160923.080618.678103827885223586.davem@davemloft.net>
On Fri, Sep 23, 2016 at 5:06 AM, David Miller <davem@davemloft.net> wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Thu, 22 Sep 2016 17:30:10 -0400
>
>> On (09/22/16 01:52), David Miller wrote:
>>> Alternatively we can do Alexander Duyck's trick, by pushing
>>> the headers into the frag list, forcing a pull and realignment
>>> by the next protocol layer.
>>
>> What is the "Alexander Duyck trick" (hints about module or commit id,
>> where this can be found, please)?
>>
>> Is this basically about, e.g., putting the vxlanhdr in its own
>> skb_frag_t, or something else?
>
> Yes, and this way skb_header_pointer() is forced to do a memcpy.
It is something I had proposed on the Rx side of things back in
February when the issue came up in relation to skb parsing. On
architectures that require NET_IP_ALIGN to be set I had proposed that
we should only pull the outer headers from the page frag, and then
when the time is right we drop the outer headers and pull the inner
headers from the page frag. That way we can keep all the headers
aligned.
For Tx it all becomes a bit trickier since it would likely require us
to shift the frags up by 1 when we go from outer headers to inner
headers. One thought I had on that is that we could possibly avoid
having to do any allocation and could just take a reference on the
head_frag if that is what we are using. Then we just add a 2 byte pad
and resume writing headers in place and the pointer offsets for the
inner headers would remain valid, though they would be past the point
of skb->tail.
- Alex
^ permalink raw reply
* Re: [RFC] net: store port/representative id in metadata_dst
From: John Fastabend @ 2016-09-23 14:23 UTC (permalink / raw)
To: Jakub Kicinski, Jiri Benc, Jiri Pirko
Cc: netdev, Thomas Graf, Roopa Prabhu, ogerlitz, sridhar.samudrala,
ast, daniel, simon.horman, Paolo Abeni, Pravin B Shelar, hannes,
kubakici
In-Reply-To: <20160923135512.64e5d5e9@jkicinski-Precision-T1700>
On 16-09-23 05:55 AM, Jakub Kicinski wrote:
> On Fri, 23 Sep 2016 11:06:09 +0200, Jiri Benc wrote:
>> On Fri, 23 Sep 2016 08:34:29 +0200, Jiri Pirko wrote:
>>> So if I understand that correctly, this would need some "shared netdev"
>>> which would effectively serve only as a sink for all port netdevices to
>>> tx packets to. On RX, this would be completely avoided. This lower
>>> device looks like half zombie to me.
>>
>> Looks more like a quarter zombie. Even tx would not be allowed unless
>> going through one of the ports, as all skbs without
>> METADATA_HW_PORT_MUX metadata_dst would be dropped. But it would be
>> possible to attach qdisc to the "lower" netdevice and it would actually
>> have an effect. On rx this netdevice would be ignored completely. This
>> is very weird behavior.
>>
>>> I don't like it :( I wonder if the
>>> solution would not be possible without this lower netdev.
>>
>> I agree. This approach doesn't sound correct. The skbs should not be
>> requeued.
>
> Thanks for the responses!
Nice timing we were just thinking about this.
>
> I think SR-IOV NICs are coming at this problem from a different angle,
> we already have a big, feature-full per-port netdevs and now we want to
> spawn representators for VFs to handle fallback traffic. This patch
> would help us mux VFR traffic on all the queues of the physical port
> netdevs (the ones which were already present in legacy mode, that's the
> lower device).
Yep, I like the idea in general. I had a slightly different approach in
mind though. If you look at __dev_queue_xmit() there is a void
accel_priv pointer (gather you found this based on your commit note).
My take was we could extend this a bit so it can be used by the VFR
devices and they could do a dev_queue_xmit_accel(). In this way there is
no need to touch /net/core/{filter, dst, ip_tunnel}.c etc. Maybe the
accel logic needs to be extended to push the priv pointer all the way
through the xmit routine of the target netdev though. This should look
a lot like the macvlan accelerated xmit device path without the
switching logic.
Of course maybe the name would be extended to dev_queue_xmit_extended()
or something.
So the flow on ingress would be,
1. pkt_received_by_PF_netdev
2. PF_netdev reads some tag off packet/descriptor and sets correct
skb->dev field. This is needed so stack "sees" packets from
correct VF ports.
3. packet passed up to stack.
I guess it is a bit "zombie" like on the receive path because the packet
is never actually handled by VF netdev code per se and on egress can
traverse both the VFR and PF netdevs qdiscs. But on the other hand the
VFR netdevs and PF netdevs are all in the same driver. Plus using a
queue per VFR is a bit of a waste as its not needed and also hardware
may not have any mechanism to push VF traffic onto a rx queue.
On egress,
1. VFR xmit is called
2. VFR xmit calls dev_queue_xmit_accel() with some meta-data if needed
for the lower netdev
3. lower netdev sends out the packet.
Again we don't need to waste any queues for each VFR and the VFR can be
a LLTX device. In this scheme I think you avoid much of the changes in
your patch and keep it all contained in the driver. Any thoughts?
To address 'I wonder if the solution can be done without this lower
netdev' I think it can be but it creates two issues which I'm not sure
have a good solution.
Without a lowerdev we either (a) give each VFR its own queue which I
don't like because it complicates mgmt and uses resources or (b) we
implicitly share queues. The later could be fine it just looks a bit
cleaner IMO to make it explicit.
With regard to VF-PF flow rules if we allow matching on ingress port
then can all your flow rules be pushed through the PF netdevices or
if you want any of the VFR netdevs? After all I expsect the flow rule
table is actually a shared resource between all attached ports.
Thanks,
.John
>
> I read the mlxsw code when I was thinking about this and I wasn't
> 100% comfortable with returning NETDEV_TX_BUSY, I thought this
> behaviour should be generally avoided. (BTW a very lame question - does
> mlxsw ever stop the queues? AFAICS it only returns BUSY, isn't that
> confusing to the stack?)
>
> FWIW the switchdev SR-IOV model we have now seems to be to treat the
> existing netdevs as "MAC ports" and spawn representatives for VFs but
> not represent PFs in any way. This makes it impossible to install
> VF-PF flow rules. I worry this can bite us later but that's slightly
> different discussion :) For the purpose of this patch please assume
> the lower dev is the MAC/physical/external port.
>
^ permalink raw reply
* Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP
From: Alexei Starovoitov @ 2016-09-23 14:26 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Jamal Hadi Salim
Cc: Tom Herbert, davem, netdev, kernel-team, tariqt, bblanco,
alexei.starovoitov, eric.dumazet, Thomas Graf
In-Reply-To: <20160923150040.0e3dc7d5@redhat.com>
On 9/23/16 6:00 AM, Jesper Dangaard Brouer wrote:
> To support Tom's case, with eBPF, I think we would have to implement
> specific eBPF helper functions that can do the ILA table lookups. And
> then need a userspace component to load the eBPF program. Why add all
> this, when we could contain all this in the kernel, and simply call
> this as native C-code via the XDP hook?
well, ILA router was written in BPF already :) Once for tc+clsact
and once for XDP. In both cases no extra bpf helpers were needed.
Due to this use case we added an optimization to be able to do
a map lookup with packet data directly (instead of copying packet
bytes to stack and pointing map_lookup helper to stack).
Looks like those ila router patches never made it to the list.
I'll dig them out and forward.
^ permalink raw reply
* [PATCH 1/2] net: qcom/emac: do not use devm on internal phy pdev
From: Timur Tabi @ 2016-09-23 14:27 UTC (permalink / raw)
To: netdev, linaro-acpi, linux-arm-kernel, shankerd, vikrams
The platform_device returned by of_find_device_by_node() is not
automatically released when the driver unprobes. Therefore,
managed calls like devm_ioremap_resource() should not be used.
Instead, we manually allocate the resources and then free them
on driver release.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
---
drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 42 +++++++++++++++++++------
drivers/net/ethernet/qualcomm/emac/emac.c | 4 +++
2 files changed, 37 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
index 9a9a963..2a095fb 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
@@ -681,6 +681,7 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
struct resource *res;
const struct of_device_id *match;
struct device_node *np;
+ int ret;
np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0);
if (!np) {
@@ -697,26 +698,49 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
match = of_match_device(emac_sgmii_dt_match, &sgmii_pdev->dev);
if (!match) {
dev_err(&pdev->dev, "unrecognized internal phy node\n");
- return -ENODEV;
+ ret = -ENODEV;
+ goto error_put_device;
}
phy->initialize = (emac_sgmii_initialize)match->data;
/* Base address is the first address */
res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 0);
- phy->base = devm_ioremap_resource(&sgmii_pdev->dev, res);
- if (IS_ERR(phy->base))
- return PTR_ERR(phy->base);
+ phy->base = ioremap(res->start, resource_size(res));
+ if (IS_ERR(phy->base)) {
+ ret = PTR_ERR(phy->base);
+ goto error_put_device;
+ }
/* v2 SGMII has a per-lane digital digital, so parse it if it exists */
res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 1);
if (res) {
- phy->digital = devm_ioremap_resource(&sgmii_pdev->dev, res);
- if (IS_ERR(phy->base))
- return PTR_ERR(phy->base);
-
+ phy->digital = ioremap(res->start, resource_size(res));
+ if (IS_ERR(phy->digital)) {
+ ret = PTR_ERR(phy->digital);
+ goto error_unmap_base;
+ }
}
- return phy->initialize(adpt);
+ ret = phy->initialize(adpt);
+ if (ret)
+ goto error;
+
+ /* We've remapped the addresses, so we don't need the device any
+ * more. of_find_device_by_node() says we should release it.
+ */
+ put_device(&sgmii_pdev->dev);
+
+ return 0;
+
+error:
+ if (phy->digital)
+ iounmap(phy->digital);
+error_unmap_base:
+ iounmap(phy->base);
+error_put_device:
+ put_device(&sgmii_pdev->dev);
+
+ return ret;
}
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c
index e47d387..429b4cb 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -723,6 +723,10 @@ static int emac_remove(struct platform_device *pdev)
mdiobus_unregister(adpt->mii_bus);
free_netdev(netdev);
+ if (adpt->phy.digital)
+ iounmap(adpt->phy.digital);
+ iounmap(adpt->phy.base);
+
return 0;
}
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply related
* [PATCH 2/2] net: qcom/emac: initial ACPI support
From: Timur Tabi @ 2016-09-23 14:27 UTC (permalink / raw)
To: netdev, linaro-acpi, linux-arm-kernel, shankerd, vikrams
In-Reply-To: <1474640854-1698-1-git-send-email-timur@codeaurora.org>
Add support for reading addresses, interrupts, and _DSD properties
from ACPI tables, just like with device tree. The HID for the
EMAC device itself is QCOM8070. The internal PHY is represented
by a child node with a HID of QCOM8071.
The EMAC also has some complex clock initialization requirements
that are not represented by this patch. This will be addressed
in a future patch.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
---
drivers/net/ethernet/qualcomm/emac/emac-phy.c | 39 +++++++++++---
drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 72 ++++++++++++++++++-------
drivers/net/ethernet/qualcomm/emac/emac.c | 60 ++++++++++++++++++++-
3 files changed, 143 insertions(+), 28 deletions(-)
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-phy.c b/drivers/net/ethernet/qualcomm/emac/emac-phy.c
index c412ba9..7c7c3bb 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-phy.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-phy.c
@@ -19,6 +19,7 @@
#include <linux/of_mdio.h>
#include <linux/phy.h>
#include <linux/iopoll.h>
+#include <linux/acpi.h>
#include "emac.h"
#include "emac-mac.h"
#include "emac-phy.h"
@@ -163,11 +164,10 @@ static int emac_mdio_write(struct mii_bus *bus, int addr, int regnum, u16 val)
return ret;
}
-/* Configure the MDIO bus and connect the external PHY */
+/* Read phy configuration and initialize it */
int emac_phy_config(struct platform_device *pdev, struct emac_adapter *adpt)
{
struct device_node *np = pdev->dev.of_node;
- struct device_node *phy_np;
struct mii_bus *mii_bus;
int ret;
@@ -183,14 +183,37 @@ int emac_phy_config(struct platform_device *pdev, struct emac_adapter *adpt)
mii_bus->parent = &pdev->dev;
mii_bus->priv = adpt;
- ret = of_mdiobus_register(mii_bus, np);
- if (ret) {
- dev_err(&pdev->dev, "could not register mdio bus\n");
- return ret;
+ if (has_acpi_companion(&pdev->dev)) {
+ u32 phy_addr;
+
+ ret = mdiobus_register(mii_bus);
+ if (ret) {
+ dev_err(&pdev->dev, "could not register mdio bus\n");
+ return ret;
+ }
+ ret = device_property_read_u32(&pdev->dev, "phy-channel",
+ &phy_addr);
+ if (ret)
+ /* If we can't read a valid phy address, then assume
+ * that there is only one phy on this mdio bus.
+ */
+ adpt->phydev = phy_find_first(mii_bus);
+ else
+ adpt->phydev = mdiobus_get_phy(mii_bus, phy_addr);
+
+ } else {
+ struct device_node *phy_np;
+
+ ret = of_mdiobus_register(mii_bus, np);
+ if (ret) {
+ dev_err(&pdev->dev, "could not register mdio bus\n");
+ return ret;
+ }
+
+ phy_np = of_parse_phandle(np, "phy-handle", 0);
+ adpt->phydev = of_phy_find_device(phy_np);
}
- phy_np = of_parse_phandle(np, "phy-handle", 0);
- adpt->phydev = of_phy_find_device(phy_np);
if (!adpt->phydev) {
dev_err(&pdev->dev, "could not find external phy\n");
mdiobus_unregister(mii_bus);
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
index 2a095fb..8b3e6c3 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
@@ -14,6 +14,7 @@
*/
#include <linux/iopoll.h>
+#include <linux/acpi.h>
#include <linux/of_device.h>
#include "emac.h"
#include "emac-mac.h"
@@ -662,6 +663,24 @@ void emac_sgmii_reset(struct emac_adapter *adpt)
clk_set_rate(adpt->clk[EMAC_CLK_HIGH_SPEED], 125000000);
}
+static int emac_sgmii_acpi_match(struct device *dev, void *data)
+{
+ static const struct acpi_device_id match_table[] = {
+ {
+ .id = "QCOM8071",
+ .driver_data = (kernel_ulong_t)emac_sgmii_init_v2,
+ },
+ {}
+ };
+ const struct acpi_device_id *id = acpi_match_device(match_table, dev);
+ emac_sgmii_initialize *initialize = data;
+
+ if (id)
+ *initialize = (emac_sgmii_initialize)id->driver_data;
+
+ return !!id;
+}
+
static const struct of_device_id emac_sgmii_dt_match[] = {
{
.compatible = "qcom,fsm9900-emac-sgmii",
@@ -679,30 +698,45 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
struct platform_device *sgmii_pdev = NULL;
struct emac_phy *phy = &adpt->phy;
struct resource *res;
- const struct of_device_id *match;
- struct device_node *np;
int ret;
- np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0);
- if (!np) {
- dev_err(&pdev->dev, "missing internal-phy property\n");
- return -ENODEV;
- }
+ if (has_acpi_companion(&pdev->dev)) {
+ struct device *dev;
- sgmii_pdev = of_find_device_by_node(np);
- if (!sgmii_pdev) {
- dev_err(&pdev->dev, "invalid internal-phy property\n");
- return -ENODEV;
- }
+ dev = device_find_child(&pdev->dev, &phy->initialize,
+ emac_sgmii_acpi_match);
- match = of_match_device(emac_sgmii_dt_match, &sgmii_pdev->dev);
- if (!match) {
- dev_err(&pdev->dev, "unrecognized internal phy node\n");
- ret = -ENODEV;
- goto error_put_device;
- }
+ if (!dev) {
+ dev_err(&pdev->dev, "cannot find internal phy node\n");
+ return -ENODEV;
+ }
- phy->initialize = (emac_sgmii_initialize)match->data;
+ sgmii_pdev = to_platform_device(dev);
+ } else {
+ const struct of_device_id *match;
+ struct device_node *np;
+
+ np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0);
+ if (!np) {
+ dev_err(&pdev->dev, "missing internal-phy property\n");
+ return -ENODEV;
+ }
+
+ sgmii_pdev = of_find_device_by_node(np);
+ if (!sgmii_pdev) {
+ dev_err(&pdev->dev, "invalid internal-phy property\n");
+ return -ENODEV;
+ }
+
+ match = of_match_device(emac_sgmii_dt_match, &sgmii_pdev->dev);
+ if (!match) {
+ dev_err(&pdev->dev, "unrecognized internal phy node\n");
+ ret = -ENODEV;
+ goto error_put_device;
+ }
+
+ phy->initialize = (emac_sgmii_initialize)match->data;
+ }
/* Base address is the first address */
res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 0);
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c
index 429b4cb..4cde237 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -22,6 +22,7 @@
#include <linux/of_device.h>
#include <linux/phy.h>
#include <linux/platform_device.h>
+#include <linux/acpi.h>
#include "emac.h"
#include "emac-mac.h"
#include "emac-phy.h"
@@ -570,6 +571,49 @@ static int emac_probe_resources(struct platform_device *pdev,
return 0;
}
+/* Get the resources */
+static int emac_probe_acpi_resources(struct platform_device *pdev,
+ struct emac_adapter *adpt)
+{
+ struct net_device *netdev = adpt->netdev;
+ struct resource *res;
+ char maddr[ETH_ALEN];
+ int ret = 0;
+
+ /* get mac address */
+
+ ret = device_property_read_u8_array(&pdev->dev, "mac-address",
+ maddr, ETH_ALEN);
+ if (ret)
+ eth_hw_addr_random(netdev);
+ else
+ ether_addr_copy(netdev->dev_addr, maddr);
+
+ ret = platform_get_irq(pdev, 0);
+ if (ret < 0) {
+ dev_err(&pdev->dev,
+ "error: missing irq resource (error=%i)\n", ret);
+ return ret;
+ }
+ adpt->irq.irq = ret;
+
+ /* get base register addresses */
+ res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+ adpt->base = devm_ioremap_resource(&pdev->dev, res);
+ if (IS_ERR(adpt->base))
+ return PTR_ERR(adpt->base);
+
+ /* get CSR register addresses */
+ res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
+ adpt->csr = devm_ioremap_resource(&pdev->dev, res);
+ if (IS_ERR(adpt->csr))
+ return PTR_ERR(adpt->csr);
+
+ netdev->base_addr = (unsigned long)adpt->base;
+
+ return 0;
+}
+
static const struct of_device_id emac_dt_match[] = {
{
.compatible = "qcom,fsm9900-emac",
@@ -577,6 +621,16 @@ static const struct of_device_id emac_dt_match[] = {
{}
};
+#if IS_ENABLED(CONFIG_ACPI)
+static const struct acpi_device_id emac_acpi_match[] = {
+ {
+ .id = "QCOM8070",
+ },
+ {}
+};
+MODULE_DEVICE_TABLE(acpi, emac_acpi_match);
+#endif
+
static int emac_probe(struct platform_device *pdev)
{
struct net_device *netdev;
@@ -620,7 +674,10 @@ static int emac_probe(struct platform_device *pdev)
adpt->irq.mask = RX_PKT_INT0 | IMR_NORMAL_MASK;
- ret = emac_probe_resources(pdev, adpt);
+ if (has_acpi_companion(&pdev->dev))
+ ret = emac_probe_acpi_resources(pdev, adpt);
+ else
+ ret = emac_probe_resources(pdev, adpt);
if (ret)
goto err_undo_netdev;
@@ -736,6 +793,7 @@ static struct platform_driver emac_platform_driver = {
.driver = {
.name = "qcom-emac",
.of_match_table = emac_dt_match,
+ .acpi_match_table = ACPI_PTR(emac_acpi_match),
},
};
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply related
* Re: [PATCH net-next] tcp: add tcp_add_backlog()
From: Eric Dumazet @ 2016-09-23 14:36 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20160923140932.GA9307@localhost.localdomain>
On Fri, 2016-09-23 at 11:09 -0300, Marcelo Ricardo Leitner wrote:
> On Fri, Sep 23, 2016 at 06:42:51AM -0700, Eric Dumazet wrote:
> > On Fri, 2016-09-23 at 09:45 -0300, Marcelo Ricardo Leitner wrote:
> >
> > > Aye. In that case, what about using tail instead of end?
> >
> >
> > What do you mean exactly ?
>
> Something like:
> -skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> +skb->truesize = SKB_TRUESIZE(skb_tail_offset(skb));
Certainly not ;)
This would be lying.
We really want a precise memory accounting to avoid OOM.
Some USB drivers use 8KB for their skb->head, you do not want to pretend
its 66+NET_SKB_PAF=F bytes just because there is no payload in the
packet.
>
> And define skb_tail_offset() to something similar skb_end_offset(), so
> that it would account only for the data and not unused space in the
> buffer.
>
> >
> > > Because
> > > accounting for something that we have to tweak the limits to accept is
> > > like adding a constant to both sides of the equation.
> > > But perhaps that would cut out too much of the fat which could be used
> > > later by the stack.
> >
> > Are you facing a particular problem with current code ?
> >
>
> For TCP, no, just wondering. :-)
>
> I'm having similar issues with SCTP: if the socket gets backlogged, the
> buffer accounting gets pretty messy. SCTP calculates the rwnd to be just
> rcvbuf/2, but this ratio is often different in backlog and it causes the
> advertized window to be too big, resulting in packet drops in the
> backlog.
>
> SCTP has some way to identify and compensate this "extra" rwnd, via
> rwnd_press, and will shrink it if it detects that the window is bigger
> than the buffer available. But as the socket is backlogged, it's doesn't
> kick in soon enough to prevent such drops.
>
> It's not just a matter of adjusting the overhead ratio (rcvbuf/2)
> because with SCTP the packets may have different sizes, so a packet with
> a chunk of 100 bytes will have a ratio and another with 1000 bytes will
> have another, within the same association.
>
> So I'm leaning towards on updating truesize before adding to the
> backlog, but to account just for the actual packet, regardless of the
> buffer that was used for it. It still has the overhead ratio issue with
> different packet sizes, though, but smaller.
>
> Note that SCTP doesn't have buffer auto-tuning yet.
Also for TCP, we might need to use sk->sk_wmem_queued instead of
sk->sk_sndbuf
This is because SACK processing can suddenly split skbs in 1-MSS pieces.
Problem is that for very large BDP, we can end up with thousands of skb
in backlog. So I am also considering to try to coalesce stupid ACK sent
by non GRO receivers or simply the verbose SACK blocks...
eg if backlog is under pressure and its tail is :
ACK 1 <sack 4000:5000>
and the incoming packet is :
ACK 1 <sack 4000:6000>
Then we could replace the tail by the incoming packet with minimal
impact.
Since we might receive hundred of 'sequential' SACK blocks, this would
help to reduce time taken by the application to process the (now
smaller) backlog
^ permalink raw reply
* RE: [PATCH net-next] tcp: add tcp_add_backlog()
From: David Laight @ 2016-09-23 14:43 UTC (permalink / raw)
To: 'Eric Dumazet', Marcelo Ricardo Leitner
Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <1474641392.28155.27.camel@edumazet-glaptop3.roam.corp.google.com>
From: Eric Dumazet
> Sent: 23 September 2016 15:37
> On Fri, 2016-09-23 at 11:09 -0300, Marcelo Ricardo Leitner wrote:
> > On Fri, Sep 23, 2016 at 06:42:51AM -0700, Eric Dumazet wrote:
> > > On Fri, 2016-09-23 at 09:45 -0300, Marcelo Ricardo Leitner wrote:
> > >
> > > > Aye. In that case, what about using tail instead of end?
> > >
> > >
> > > What do you mean exactly ?
> >
> > Something like:
> > -skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> > +skb->truesize = SKB_TRUESIZE(skb_tail_offset(skb));
>
> Certainly not ;)
>
> This would be lying.
> We really want a precise memory accounting to avoid OOM.
>
> Some USB drivers use 8KB for their skb->head, you do not want to pretend
> its 66+NET_SKB_PAF=F bytes just because there is no payload in the
> packet.
Last I look some of the USBnet drivers used 64k+ skb (and then
lied about truesize).
That whole usbnet stuff needs a rework for usb3 (xhci) so that once
the rings (etc) are all setup packet transfer looks much more like
a normal ethernet device.
I think it needs to stop trying to receive into skb, instead just
receive into usb sized buffers - and then generate skb that contain
the ethernet frame data.
Not that I'm volunteering :-)
David
^ permalink raw reply
* Re: [PATCH] net: bcmgenet: Fix EPHY reset in power up
From: Jaedon Shin @ 2016-09-23 15:04 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, Philippe Reynes, netdev
In-Reply-To: <20160923140609.GA22965@lunn.ch>
Hi Andrew,
On 23 Sep 2016, at 11:06 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Fri, Sep 23, 2016 at 10:20:04PM +0900, Jaedon Shin wrote:
>> The bcmgenet_mii_reset() is always not running in power up sequence
>> after 'commit 62469c76007e ("net: ethernet: bcmgenet: use phydev from
>> struct net_device")'. This'll show extremely high latency and duplicate
>> packets while interface down and up repeatedly.
>>
>> For now, adds again a private phydev for mii reset when runs power up to
>> open interface.
>
> Hi Jaedon
>
> How does this fix the issue? It sounds like you are papering over the
> crack, not truly fixing it.
>
> Andrew
Yes, It feel like a workaround, but I think it must need v4.8 stable
version. If we find better way that fixes internal PHY to initialize
after re-open interface, this patch will be dropped.
Additionally,
http://www.spinics.net/lists/netdev/msg350506.html
Thanks,
Jaedon
^ permalink raw reply
* Re: [PATCH 18/19] stmmac: dwmac-sti: Remove obsolete STi platforms
From: Rob Herring @ 2016-09-23 15:11 UTC (permalink / raw)
To: Peter Griffin
Cc: linux-arm-kernel, linux-kernel, kernel, patrice.chotard,
devicetree, lee.jones, peppe.cavallaro, alexandre.torgue, netdev
In-Reply-To: <1473859677-9231-19-git-send-email-peter.griffin@linaro.org>
On Wed, Sep 14, 2016 at 02:27:56PM +0100, Peter Griffin wrote:
> This patch removes support for STiH415/6 SoC's from the
> dwmac-sti driver and dt binding doc, as support for these
> platforms is being removed from the kernel. It also removes
> STiD127 related code, which has never actually been supported
> upstream.
>
> Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
> Cc: <peppe.cavallaro@st.com>
> Cc: <alexandre.torgue@st.com>
> Cc: <netdev@vger.kernel.org>
> ---
> .../devicetree/bindings/net/sti-dwmac.txt | 3 +-
Acked-by: Rob Herring <robh@kernel.org>
> drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c | 37 ----------------------
> 2 files changed, 1 insertion(+), 39 deletions(-)
^ permalink raw reply
* Re: [PATCH net-next] tcp: add tcp_add_backlog()
From: Marcelo Ricardo Leitner @ 2016-09-23 15:12 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng
In-Reply-To: <1474641392.28155.27.camel@edumazet-glaptop3.roam.corp.google.com>
On Fri, Sep 23, 2016 at 07:36:32AM -0700, Eric Dumazet wrote:
> On Fri, 2016-09-23 at 11:09 -0300, Marcelo Ricardo Leitner wrote:
> > On Fri, Sep 23, 2016 at 06:42:51AM -0700, Eric Dumazet wrote:
> > > On Fri, 2016-09-23 at 09:45 -0300, Marcelo Ricardo Leitner wrote:
> > >
> > > > Aye. In that case, what about using tail instead of end?
> > >
> > >
> > > What do you mean exactly ?
> >
> > Something like:
> > -skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> > +skb->truesize = SKB_TRUESIZE(skb_tail_offset(skb));
>
> Certainly not ;)
>
> This would be lying.
Yep, but also so is adding txbuf to the equation to account for that,
right? :-) Unless you're considering that acks should/can be sort of
accounted by the txbuf instead, then it makes sense to sum the buffer
sizes in there.
> We really want a precise memory accounting to avoid OOM.
Indeed
>
> Some USB drivers use 8KB for their skb->head, you do not want to pretend
> its 66+NET_SKB_PAF=F bytes just because there is no payload in the
> packet.
Oh.
>
> >
> > And define skb_tail_offset() to something similar skb_end_offset(), so
> > that it would account only for the data and not unused space in the
> > buffer.
> >
> > >
> > > > Because
> > > > accounting for something that we have to tweak the limits to accept is
> > > > like adding a constant to both sides of the equation.
> > > > But perhaps that would cut out too much of the fat which could be used
> > > > later by the stack.
> > >
> > > Are you facing a particular problem with current code ?
> > >
> >
> > For TCP, no, just wondering. :-)
> >
> > I'm having similar issues with SCTP: if the socket gets backlogged, the
> > buffer accounting gets pretty messy. SCTP calculates the rwnd to be just
> > rcvbuf/2, but this ratio is often different in backlog and it causes the
> > advertized window to be too big, resulting in packet drops in the
> > backlog.
> >
> > SCTP has some way to identify and compensate this "extra" rwnd, via
> > rwnd_press, and will shrink it if it detects that the window is bigger
> > than the buffer available. But as the socket is backlogged, it's doesn't
> > kick in soon enough to prevent such drops.
> >
> > It's not just a matter of adjusting the overhead ratio (rcvbuf/2)
> > because with SCTP the packets may have different sizes, so a packet with
> > a chunk of 100 bytes will have a ratio and another with 1000 bytes will
> > have another, within the same association.
> >
> > So I'm leaning towards on updating truesize before adding to the
> > backlog, but to account just for the actual packet, regardless of the
> > buffer that was used for it. It still has the overhead ratio issue with
> > different packet sizes, though, but smaller.
> >
> > Note that SCTP doesn't have buffer auto-tuning yet.
>
> Also for TCP, we might need to use sk->sk_wmem_queued instead of
> sk->sk_sndbuf
This is interesting. Then it would only stretch the backlog limit if
there is a heavy tx going on.
>
> This is because SACK processing can suddenly split skbs in 1-MSS pieces.
>
> Problem is that for very large BDP, we can end up with thousands of skb
> in backlog. So I am also considering to try to coalesce stupid ACK sent
> by non GRO receivers or simply the verbose SACK blocks...
>
> eg if backlog is under pressure and its tail is :
>
> ACK 1 <sack 4000:5000>
>
> and the incoming packet is :
>
> ACK 1 <sack 4000:6000>
>
> Then we could replace the tail by the incoming packet with minimal
> impact.
>
> Since we might receive hundred of 'sequential' SACK blocks, this would
> help to reduce time taken by the application to process the (now
> smaller) backlog
>
That would be cool.
Thanks,
Marcelo
^ permalink raw reply
* [PATCH net-next 00/15] rxrpc: Bug fixes and tracepoints
From: David Howells @ 2016-09-23 15:15 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
Here are a bunch of bug fixes:
(1) Need to set the timestamp on a Tx packet before queueing it to avoid
trouble with the retransmission function.
(2) Don't send an ACK at the end of the service reply transmission; it's
the responsibility of the client to send an ACK to close the call.
The service can resend the last DATA packet or send a PING ACK.
(3) Wake sendmsg() on abnormal call termination.
(4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.
(5) Use before_eq() & co. to compare serial numbers (which may wrap).
(6) Start the resend timer on DATA packet transmission.
(7) Don't accidentally cancel a retransmission upon receiving a NACK.
(8) Fix the call timer setting function to deal with timeouts that are now
or past.
(9) Don't use a flag to communicate the presence of the last packet in the
Tx buffer from sendmsg to the input routines where ACK and DATA
reception is handled. The problem is that there's a window between
queueing the last packet for transmission and setting the flag in
which ACKs or reply DATA packets can arrive, causing apparent state
machine violation issues.
Instead use the annotation buffer to mark the last packet and pick up
and set the flag in the input routines.
(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
someone else nicked the ACK we were about to transmit.
There are also new tracepoints and one altered tracepoint used to track
down the above bugs:
(11) Call timer tracepoint.
(12) Data Tx tracepoint (and adjustments to ACK tracepoint).
(13) Injected Rx packet loss tracepoint.
(14) Ack proposal tracepoint.
(15) Retransmission selection tracepoint.
The patches can be found here also:
http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite
Tagged thusly:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160923
David
---
David Howells (15):
rxrpc: Preset timestamp on Tx sk_buffs
rxrpc: Don't send an ACK at the end of service call response transmission
rxrpc: Make sure sendmsg() is woken on call completion
rxrpc: Should be using ktime_add_ms() not ktime_add_ns()
rxrpc: Use before_eq() and friends to compare serial numbers
rxrpc: Need to start the resend timer on initial transmission
rxrpc: Fix accidental cancellation of scheduled resend by ACK parser
rxrpc: Fix call timer
rxrpc: Pass the last Tx packet marker in the annotation buffer
rxrpc: Don't call the tx_ack tracepoint if don't generate an ACK
rxrpc: Add a tracepoint for the call timer
rxrpc: Add data Tx tracepoint and adjust Tx ACK tracepoint
rxrpc: Add a tracepoint to log injected Rx packet loss
rxrpc: Add tracepoint for ACK proposal
rxrpc: Add a tracepoint to log which packets will be retransmitted
include/rxrpc/packet.h | 1
include/trace/events/rxrpc.h | 174 ++++++++++++++++++++++++++++++++++++++++--
net/rxrpc/ar-internal.h | 45 ++++++++++-
net/rxrpc/call_event.c | 57 ++++++++------
net/rxrpc/call_object.c | 8 +-
net/rxrpc/conn_event.c | 5 -
net/rxrpc/input.c | 136 +++++++++++++++++++++------------
net/rxrpc/misc.c | 41 +++++++---
net/rxrpc/output.c | 32 ++++----
net/rxrpc/recvmsg.c | 5 -
net/rxrpc/sendmsg.c | 28 +++++--
11 files changed, 405 insertions(+), 127 deletions(-)
^ permalink raw reply
* Re: [PATCH RFC 05/11] skbuff: Extend gso_type to unsigned int.
From: Alexander Duyck @ 2016-09-23 15:15 UTC (permalink / raw)
To: David Laight
Cc: Steffen Klassert, netdev@vger.kernel.org, Sowmini Varadhan,
Ilan Tayari, Boris Pismenny, Ariel Levkovich, Hay, Joshua A
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0107FC2@AcuExch.aculab.com>
On Fri, Sep 23, 2016 at 6:19 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Steffen Klassert
>> Sent: 23 September 2016 08:54
>> All available gso_type flags are currently in use,
>> so extend gso_type to be able to add further flags.
>>
>> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
>> ---
>> include/linux/skbuff.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index f21da42..c1fd854 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -417,7 +417,7 @@ struct skb_shared_info {
>> unsigned short gso_size;
>> /* Warning: this field is not always filled in (UFO)! */
>> unsigned short gso_segs;
>> - unsigned short gso_type;
>> + unsigned int gso_type;
>> struct sk_buff *frag_list;
>> struct skb_shared_hwtstamps hwtstamps;
>> u32 tskey;
>
> That add a lot of padding.
> I'm not even sure DM will like this structure being extended.
> If ktime_t is 64 bit I think there is already some padding later on.
+1. Just increasing the size is a bad idea as it will add a 6 byte
hole and push frag 0 outside the first cache line.
You may want to take a look at rearranging the structure a bit. That
way you could eat into the 4 byte hole that is available on x86_64 so
that instead of shifting everything up by 8 bytes they can mostly stay
put with only a bit of rearranging. This is one of those times a tool
like pahole comes in handy.
Also it occurs to me that one other thing you could look at is if you
were to move gso_type into the slot after dataref we might be able to
make use of the hole while also saving us some initialization time as
the change in size would push us from 32 to 36 which would likely
impact the memset operation by at least a 1 cycle increase. If we
could make it so that we don't have to initialize the memory for
gso_type unless we are planning to set gso_size we can avoid the extra
overhead for that. It would just be a matter of validating that
nothing reads gso_type unless gso_size is set and that gso_type is
always written to prior to setting gso_size to a non-zero value.
- Alex
^ permalink raw reply
* [PATCH net-next 01/15] rxrpc: Preset timestamp on Tx sk_buffs
From: David Howells @ 2016-09-23 15:15 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147464371753.5090.1634919599283321856.stgit@warthog.procyon.org.uk>
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.
If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/sendmsg.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index ca7c3be60ad2..ca3811bfbd17 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -99,6 +99,11 @@ static void rxrpc_queue_packet(struct rxrpc_call *call, struct sk_buff *skb,
ASSERTCMP(seq, ==, call->tx_top + 1);
+ /* We have to set the timestamp before queueing as the retransmit
+ * algorithm can see the packet as soon as we queue it.
+ */
+ skb->tstamp = ktime_get_real();
+
ix = seq & RXRPC_RXTX_BUFF_MASK;
rxrpc_get_skb(skb, rxrpc_skb_tx_got);
call->rxtx_annotations[ix] = RXRPC_TX_ANNO_UNACK;
^ permalink raw reply related
* [PATCH net-next 02/15] rxrpc: Don't send an ACK at the end of service call response transmission
From: David Howells @ 2016-09-23 15:15 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147464371753.5090.1634919599283321856.stgit@warthog.procyon.org.uk>
Don't send an IDLE ACK at the end of the transmission of the response to a
service call. The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data. At that point, the call is complete.
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/recvmsg.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 6ba4af5a8d95..99e4c0ae30f1 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -143,8 +143,6 @@ static void rxrpc_end_rx_phase(struct rxrpc_call *call)
if (call->state == RXRPC_CALL_CLIENT_RECV_REPLY) {
rxrpc_propose_ACK(call, RXRPC_ACK_IDLE, 0, 0, true, false);
rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ACK);
- } else {
- rxrpc_propose_ACK(call, RXRPC_ACK_IDLE, 0, 0, false, false);
}
write_lock_bh(&call->state_lock);
^ permalink raw reply related
* [PATCH net-next 03/15] rxrpc: Make sure sendmsg() is woken on call completion
From: David Howells @ 2016-09-23 15:15 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147464371753.5090.1634919599283321856.stgit@warthog.procyon.org.uk>
Make sure that sendmsg() gets woken up if the call it is waiting for
completes abnormally.
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/ar-internal.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index b13754a6dd7a..808ab750dc6b 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -758,6 +758,7 @@ static inline bool __rxrpc_set_call_completion(struct rxrpc_call *call,
call->error = error;
call->completion = compl,
call->state = RXRPC_CALL_COMPLETE;
+ wake_up(&call->waitq);
return true;
}
return false;
^ permalink raw reply related
* [PATCH net-next 04/15] rxrpc: Should be using ktime_add_ms() not ktime_add_ns()
From: David Howells @ 2016-09-23 15:15 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147464371753.5090.1634919599283321856.stgit@warthog.procyon.org.uk>
ktime_add_ms() should be used to add the resend time (in ms) rather than
ktime_add_ns().
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/call_event.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 6e2ea8f4ae75..a2909da5d581 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -187,7 +187,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
call->rxtx_annotations[ix] = RXRPC_TX_ANNO_RETRANS | annotation;
}
- resend_at = ktime_sub(ktime_add_ns(oldest, rxrpc_resend_timeout), now);
+ resend_at = ktime_sub(ktime_add_ms(oldest, rxrpc_resend_timeout), now);
call->resend_at = jiffies + nsecs_to_jiffies(ktime_to_ns(resend_at));
/* Now go through the Tx window and perform the retransmissions. We
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox