Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [pci PATCH v8 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
From: Gregory Rose @ 2018-04-20 22:06 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, linux-pci
  Cc: virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, ddutile, mheyne, liang-min.wang,
	mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162814.46077.11939.stgit@ahduyck-green-test.jf.intel.com>


On 4/20/2018 9:28 AM, Alexander Duyck wrote:
> This patch adds a common configuration function called
> pci_sriov_configure_simple that will allow for managing VFs on devices
> where the PF is not capable of managing VF resources.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Mark Rustad <mark.d.rustad@intel.com>
> ---
>
> v5: New patch replacing pci_sriov_configure_unmanaged with
>        pci_sriov_configure_simple
>      Dropped bits related to autoprobe changes
> v6: Defined pci_sriov_configure_simple as NULL if IOV is disabled
> v7: Updated pci_sriov_configure_simple to drop need for err value
>      Fixed comment explaining why pci_sriov_configure_simple is NULL
>
>   drivers/pci/iov.c   |   31 +++++++++++++++++++++++++++++++
>   include/linux/pci.h |    3 +++
>   2 files changed, 34 insertions(+)
>
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924a..3e0a7fd 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -807,3 +807,34 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>   	return dev->sriov->total_VFs;
>   }
>   EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> +
> +/**
> + * pci_sriov_configure_simple - helper to configure unmanaged SR-IOV
> + * @dev: the PCI device
> + * @nr_virtfn: number of virtual functions to enable, 0 to disable
> + *
> + * Used to provide generic enable/disable SR-IOV option for devices
> + * that do not manage the VFs generated by their driver. Return value
> + * is negative on error, or number of VFs allocated on success.
> + */
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
> +{
> +	might_sleep();
> +
> +	if (!dev->is_physfn)
> +		return -ENODEV;
> +
> +	if (pci_vfs_assigned(dev)) {
> +		pci_warn(dev,
> +			 "Cannot modify SR-IOV while VFs are assigned\n");
> +		return -EPERM;
> +	}
> +
> +	if (!nr_virtfn) {
> +		sriov_disable(dev);
> +		return 0;
> +	}
> +
> +	return sriov_enable(dev, nr_virtfn) ? : nr_virtfn;
> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_configure_simple);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index ae42289..7d36e39 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1955,6 +1955,7 @@ static inline void pci_mmcfg_late_init(void) { }
>   int pci_vfs_assigned(struct pci_dev *dev);
>   int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>   int pci_sriov_get_totalvfs(struct pci_dev *dev);
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>   resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>   void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
>   #else
> @@ -1982,6 +1983,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
>   { return 0; }
>   static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>   { return 0; }
> +/* this is expected to be used as a function pointer, just define as NULL */
> +#define pci_sriov_configure_simple NULL
>   static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
>   { return 0; }
>   static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
>

Thanks Alex!

Reviewed-by: Greg Rose <gvrose8192@gmail.com>

^ permalink raw reply

* Re: Q: force netif ON even when there is no real link ?
From: Andrew Lunn @ 2018-04-20 22:04 UTC (permalink / raw)
  To: Ran Shalit; +Cc: netdev
In-Reply-To: <CAJ2oMhLPKfzi=-ZvXiHbG8rcEZc39nmU3qocvZc6NrmHwHQadg@mail.gmail.com>

> By saying "mac driver", I mean Ethernet driver with fixed phy.

The fixed-phy implements the usual PHY API. So the MAC driver just
sees a PHY which has a fixed speed, and is up.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next] net: phy: mdio-boardinfo: Allow recursive mdiobus_register()
From: Andrew Lunn @ 2018-04-20 21:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, f.fainelli, vivien.didelot
In-Reply-To: <20180420.103430.1692883931904134627.davem@davemloft.net>

On Fri, Apr 20, 2018 at 10:34:30AM -0400, David Miller wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> Date: Thu, 19 Apr 2018 02:00:47 +0200
> 
> > mdiobus_register will search for any mdiobus board info registered for
> > the bus being registered. If found, it will probe devices on the bus.
> > That device, if for example it is an ethernet switch, may then try to
> > register an mdio bus. Thus we need to allow recursive calls to
> > mdiobus_register.
> > 
> > Holding the mdio_board_lock will cause a deadlock during this
> > recursion. Release the lock and use list_for_each_entry_safe.
> > 
> > Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> 
> Applied.
> 
> While looking over this code I see that we currently never unregister
> mdio boardinfo objects.
> 
> If we have drivers that can be unloaded, as it seems the one you plan
> to add that needs this change should be, the situation could get more
> tricky here.

Hi David

That is in fact normal, if you look at the i2c and spi versions of
this. This is/was generally used for ARM platforms, from before the
times of DT. There was a board file setting up platform devices for
the different bits of the hardware. The hardware never goes away, so
there is no need to remove the description of the hardware, what
devices are on which bus, etc. The drivers for that hardware can
however be removed.

The platform i'm working on is however x86. So i don't have a board
file as such, just a driver which gets loaded because of matches with
ACPI DMI strings. It populates this mdio boardinfo, as well as i2c
boardinfo, causing other drivers to be loaded. And i don't implement a
remove function, so the driver can never be unloaded. I'm happy with
that.

	Andrew

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Willem de Bruijn @ 2018-04-20 21:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <CAKgT0UcaapGYAwdQ7udr0qN=r+8d_ii6MZw2uQy6Ct+tMtvfpA@mail.gmail.com>

>>> Also any plans for HW offload support for this? I vaguely recall that
>>> the igb and ixgbe parts had support for something like this in
>>> hardware. I would have to double check to see what exactly is
>>> supported.
>>
>> I hadn't given that much thought until the request yesterday to
>> expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
>> virtue of having only a single fixed segmentation length, it
>> appears reasonably straightforward to offload.
>
> Actually I just got a chance to start on a review of things. Do we
> need to have to use both GSO_UDP and and GSO_UDP_L4? It might be
> better if we could split these up and specifically call out GSO_UDP as
> UFO and GSO_UDP_L4 as being UDP segmentation.

Thanks for taking a look, Alex.

Agreed, I'll revise that. My initial thought was that both gso skbs need
to take the same udp gso special case branches in places like act_csum
and ovs. But on rereading that seems an unsafe approach, as some
branches are fragmentation specific. I'll review them all and add
separate SKB_GSO_UDP_L4 cases where needed, instead.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-20 21:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, David Miller, Andrew Morton, linux-mm, eric.dumazet,
	edumazet, bhutchings, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180420210200.GH10788@bombadil.infradead.org>



On Fri, 20 Apr 2018, Matthew Wilcox wrote:

> On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > No way. This is just wrong! First of all, you will explode most likely
> > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > enabled quite often.
> > 
> > You're an evil person who doesn't want to fix bugs.
> 
> Steady on.  There's no need for that.  Michal isn't evil.  Please
> apologise.

I see this attitude from Michal again and again.

He didn't want to fix vmalloc(GFP_NOIO), he didn't want to fix alloc_pages 
sleeping when __GFP_NORETRY is used. So what should I say? Fix them and 
you won't be evil :-)

(he could also fix the oom killer, so that it is triggered when 
free_memory+cache+free_swap goes beyond a threshold and not when you loop 
too long in the allocator)

> > You refused to fix vmalloc(GFP_NOIO) misbehavior a year ago (did you make 
> > some progress with it since that time?) and you refuse to fix kvmalloc 
> > misuses.
> 
> I understand you're frustrated, but this is not the way to get the problems
> fixed.
> 
> > I tried this patch on text-only virtual machine and /proc/vmallocinfo 
> > shows 614kB more memory. I tried it on a desktop machine with the chrome 
> > browser open and /proc/vmallocinfo space is increased by 7MB. So no - this 
> > won't exhaust memory and kill the machine.
> 
> This is good data, thank you for providing it.
> 
> > Arguing that this increases memory consumption is as bogus as arguing that 
> > CONFIG_LOCKDEP increses memory consumption. No one is forcing you to 
> > enable CONFIG_LOCKDEP and no one is forcing you to enable this kvmalloc 
> > test too.
> 
> I think there's a real problem which is that CONFIG_DEBUG_VM is too broad.
> It inserts code in a *lot* of places, some of which is quite expensive.
> We would do better to split it into more granular pieces ... although
> an explosion of configuration options isn't great either.  Maybe just
> CONFIG_DEBUG_VM and CONFIG_DEBUG_VM_EXPENSIVE.
> 
> Michal may be wrong, but he's not evil.

I already said that we can change it from CONFIG_DEBUG_VM to 
CONFIG_DEBUG_SG - or to whatever other option you may want, just to make 
sure that it is enabled in distro debug kernels by default.

Mikulas

^ permalink raw reply

* Re: [PATCH RFC net-next] igb: adjust SYSTIM register using TIMADJ register
From: Richard Cochran @ 2018-04-20 21:12 UTC (permalink / raw)
  To: Kshitiz Gupta; +Cc: jeffrey.t.kirsher, intel-wired-lan, netdev, linux-kernel
In-Reply-To: <1524254196-21705-1-git-send-email-kshitiz.gupta@ni.com>

On Fri, Apr 20, 2018 at 02:56:36PM -0500, Kshitiz Gupta wrote:
> Currently the driver adjusts time by reading the current time and then
> modifying it before writing to SYSTIM register. This can introduce
> inaccuracies in SYSTIM. With a PREEMPT_RT kernel, spinlocks may be
> interrupted, which in the existing implementation may lead to increased
> time between the read and the write.
> 
> Alternatively as per section 7.8.3.2 in the i210 data sheet, this
> operation can be done more accurately by using the TIMADJ registers,
> but this should only be used for adjustments less than one 8th of the
> sync interval. Once this register is written, the software can poll on
> TSICR.TADJ to make sure that adjustment operation is completed.

I doubt the utility of this.  The first jump is typically to correct a
large offset of seconds, minutes, or even months.  After that, the
servo corrects any remaining error.

It would help if you would show us a clearly improved servo response
with this change applied.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Matthew Wilcox @ 2018-04-20 21:02 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, David Miller, Andrew Morton, linux-mm, eric.dumazet,
	edumazet, bhutchings, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804201635180.25408@file01.intranet.prod.int.rdu2.redhat.com>

On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> On Fri, 20 Apr 2018, Michal Hocko wrote:
> > No way. This is just wrong! First of all, you will explode most likely
> > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > enabled quite often.
> 
> You're an evil person who doesn't want to fix bugs.

Steady on.  There's no need for that.  Michal isn't evil.  Please
apologise.

> You refused to fix vmalloc(GFP_NOIO) misbehavior a year ago (did you make 
> some progress with it since that time?) and you refuse to fix kvmalloc 
> misuses.

I understand you're frustrated, but this is not the way to get the problems
fixed.

> I tried this patch on text-only virtual machine and /proc/vmallocinfo 
> shows 614kB more memory. I tried it on a desktop machine with the chrome 
> browser open and /proc/vmallocinfo space is increased by 7MB. So no - this 
> won't exhaust memory and kill the machine.

This is good data, thank you for providing it.

> Arguing that this increases memory consumption is as bogus as arguing that 
> CONFIG_LOCKDEP increses memory consumption. No one is forcing you to 
> enable CONFIG_LOCKDEP and no one is forcing you to enable this kvmalloc 
> test too.

I think there's a real problem which is that CONFIG_DEBUG_VM is too broad.
It inserts code in a *lot* of places, some of which is quite expensive.
We would do better to split it into more granular pieces ... although
an explosion of configuration options isn't great either.  Maybe just
CONFIG_DEBUG_VM and CONFIG_DEBUG_VM_EXPENSIVE.

Michal may be wrong, but he's not evil.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-20 20:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
	jasowang, virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180420134901.GB17484@dhcp22.suse.cz>

On Fri, 20 Apr 2018, Michal Hocko wrote:

> On Fri 20-04-18 06:41:36, Matthew Wilcox wrote:
> > On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > > > In order to detect these bugs reliably I submit this patch that changes
> > > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > 
> > > No way. This is just wrong! First of all, you will explode most likely
> > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > enabled quite often.
> > 
> > I think it'll still suit Mikulas' debugging needs if we always use
> > vmalloc for sizes above PAGE_SIZE?
> 
> Even if that was the case then this doesn't sounds like CONFIG_DEBUG_VM
> material. We do not want a completely different behavior when the config
> 
> -- 
> Michal Hocko
> SUSE Labs

I'm not arguing that it must be turned on exactly by CONFIG_DEBUG_VM. It 
may be turned on some other option that is enabled in debug kernels 
(CONFIG_DEBUG_SG may be a better option, because you'll get meaningful 
stracktraces from DMA API then).

> is enabled. If we really need some better fallback testing coverage
> then the fault injection, as suggested by Vlastimil, sounds much more
> reasonable to me

People who test kernels will install the kernel-debug package, reboot to 
the debug kernel and run their testsuites. They won't turn on magic 
options in debugfs or use some hideous kernel commandline arguments. If 
the kvmalloc test isn't in the debug kernel, then the testing crew won't 
test it - that's it.

Mikulas

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-20 20:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, Andrew Morton, linux-mm, eric.dumazet, edumazet,
	bhutchings, netdev, linux-kernel, mst, jasowang, virtualization,
	dm-devel, Vlastimil Babka
In-Reply-To: <20180420130852.GC16083@dhcp22.suse.cz>

On Fri, 20 Apr 2018, Michal Hocko wrote:

> On Thu 19-04-18 12:12:38, Mikulas Patocka wrote:
> [...]
> > From: Mikulas Patocka <mpatocka@redhat.com>
> > Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
> > 
> > The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> > kmalloc fails.
> > 
> > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > code.
> > 
> > These bugs are hard to reproduce because vmalloc falls back to kmalloc
> > only if memory is fragmented.
> > 
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> No way. This is just wrong! First of all, you will explode most likely
> on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> enabled quite often.

You're an evil person who doesn't want to fix bugs.

You refused to fix vmalloc(GFP_NOIO) misbehavior a year ago (did you make 
some progress with it since that time?) and you refuse to fix kvmalloc 
misuses.

I tried this patch on text-only virtual machine and /proc/vmallocinfo 
shows 614kB more memory. I tried it on a desktop machine with the chrome 
browser open and /proc/vmallocinfo space is increased by 7MB. So no - this 
won't exhaust memory and kill the machine.

Arguing that this increases memory consumption is as bogus as arguing that 
CONFIG_LOCKDEP increses memory consumption. No one is forcing you to 
enable CONFIG_LOCKDEP and no one is forcing you to enable this kvmalloc 
test too.

Mikulas

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Randy Dunlap @ 2018-04-20 20:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Bjorn Helgaas, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0UewuO7LWc8wkHPTv2cksF39-8+5nFfEP_DmXVRQtgOWdg@mail.gmail.com>

On 04/20/18 13:01, Alexander Duyck wrote:
> On Fri, Apr 20, 2018 at 10:23 AM, Randy Dunlap <rdunlap@infradead.org> wrote:
>> On 04/20/18 09:28, Alexander Duyck wrote:
>>> This series is meant to add support for SR-IOV on devices when the VFs are
>>> not managed by the kernel. Examples of recent patches attempting to do this
>>> include:
>>> virto - https://patchwork.kernel.org/patch/10241225/
>>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>>> vfio - https://patchwork.kernel.org/patch/10103353/
>>> uio - https://patchwork.kernel.org/patch/9974031/
>>
>> Hi,
>>
>> Somewhere in this patch series it would be nice to tell us what the heck
>> a "PF" is.  :)
>>
>> Thanks.
> 
> Sorry, I was kind of operating on the assumption of everyone
> understanding SR-IOV nomenclature.

Yes, I understood that. :)

> A "PF" is a PCIe Physical Function. When you bring up a PCIe device
> that supports SR-IOV it is the device that is there to begin with.
> 
> A "VF" is a PCIe Virtual Function. You could think of as a logical
> device that is spawned from the physical function using a combination
> of hardware configuration via the SR-IOV block in the PCIe extended
> configuration space and kernel/driver features.
> 
> There are also a number of online resources you could use to research
> SR-IOV further. Hope that helps to clarify some of this.
> 
> Thanks.

Thank you.


-- 
~Randy

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Julian Anastasov @ 2018-04-20 20:43 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team
In-Reply-To: <20180419212324.1542504-1-kafai@fb.com>


	Hello,

On Thu, 19 Apr 2018, Martin KaFai Lau wrote:

> This patch is not a proper fix and mainly serves for discussion purpose.
> It is based on net-next which I have been using to debug the issue.
> 
> The change that works around the issue is in ensure_mtu_is_adequate().
> Other changes are the rippling effect in function arg.
> 
> This bug was uncovered by one of our legacy service that
> are still using ipvs for load balancing.  In that setup,
> the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> before tx it out to eth0.
> 
> The problem is the kernel stack could pass a skb (which was
> originated from a sys_write(tcp_fd)) to the driver with skb->len
> bigger than the device MTU.  In one NIC setup (with gso and tso off)
> that we are using, it upset the NIC/driver and caused the tx queue
> stalled for tens of seconds which is how it got uncovered.
> (On the NIC side, the NIC firmware and driver have been fixed
> to avoid this tx queue stall after seeing this skb).
> 
> On the kernel side, based on the commit log, this bug should have
> been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").

	Before I go to deeply analyze the GSO code, here
are some preliminary observations:

- IPVS started to use iptunnel_handle_offloads() around 2014,
commit ea1d5d7755a3

- later (2016) the "ip6ip6: Support for GSO/GRO" commits started
to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
I'm not sure if such call can help. At least, I don't see what
different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.

> Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> Due to this error out, it avoid passing it to the driver.  The TCP
> stack then timeout and the TCP mtu probing eventually kicked in to
> lower the skb->len enough to avoid gso_segment.
> 
> After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> -> tcp6_gso_segment() which segment the packet based on a mss
> that does not account for the extra IPv6 hdr.
> 
> Here is a stack from the WARN_ON() that we added to the driver to
> capture the issue:
> [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> ...
> [ 1129.016536] Call Trace:
> [ 1129.021412]  ? skb_release_data+0xfc/0x120
> [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> [ 1129.053111]  __qdisc_run+0x223/0x5a0
> [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> [ 1129.068268]  dev_queue_xmit+0x10/0x20
> [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> [ 1129.091573]  ip6_finish_output+0x125/0x190
> [ 1129.099754]  ip6_output+0x5f/0x100
> [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> [ 1129.114212]  ip6_local_out+0x35/0x40
> [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> [ 1129.195800]  ip6_xmit+0x30b/0x540
> [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> [ 1129.250106]  tcp_push+0xde/0x100
> [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> [ 1129.271703]  inet_sendmsg+0x31/0xb0
> [ 1129.278672]  sock_write_iter+0xf8/0x110
> [ 1129.286335]  new_sync_write+0xd9/0x120
> [ 1129.293823]  vfs_write+0x18d/0x1e0
> [ 1129.300614]  SyS_write+0x48/0xa0
> [ 1129.307045]  do_syscall_64+0x69/0x1e0
> [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> ...
> [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> [ 1129.657407] skb->len:1554 MTU:1522
> 
> The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> 
> [host-a] > ip -6 r show table local
> local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> 
> [host-a] > ip -6 a
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
>     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
>        valid_lft forever preferred_lft forever
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
>     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
>        valid_lft forever preferred_lft forever
> 
> [host-a] > cat /proc/net/ip_vs
> TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
>   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
>   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
>   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> 
> [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> send-something-long-here-to-trigger-the-bug
> 
> Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> can also side step the issue.  Like this:

	Yes, reducing the MTU was a solution recommended from
long time ago in the IPVS HOWTO.

> 
> > ip -6 r show table local
> local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
>  1 file changed, 33 insertions(+), 16 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 11c416f3d6e3..88cc0d53ebce 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	struct netns_ipvs *ipvs = cp->ipvs;
> +
>  #ifdef CONFIG_IP_VS_IPV6
> -	if (skb_af == AF_INET6) {
> +	if (cp->af == AF_INET6) {
>  		struct net *net = ipvs->net;
>  
>  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		}
>  	}
>  
> +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> +		unsigned short hdr_len = (th->doff << 2) +
> +			skb_network_header_len(skb);
> +
> +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> +			skb_decrease_gso_size(skb_shinfo(skb),
> +					      skb_shinfo(skb)->gso_size -
> +					      (mtu - hdr_len));

	So, software segmentation happens and we want the
tunnel header to be accounted immediately and not after PMTU
probing period? Is this a problem only for IPVS tunnels?
Do we see such delays with other tunnels? May be this should
be solved for all protocols (not just TCP) and for all tunnels?
Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
local sender, so we should really alter the gso_size for proper
segmentation?

Regards

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 11/13] audit: add support for containerid to network namespaces
From: Richard Guy Briggs @ 2018-04-20 20:42 UTC (permalink / raw)
  To: Paul Moore
  Cc: simo, jlayton, carlos, linux-api, containers, LKML, Eric Paris,
	dhowells, Linux-Audit Mailing List, ebiederm, luto, netdev,
	linux-fsdevel, cgroups, serge, viro
In-Reply-To: <CAHC9VhTOYUAyCJidm99som6FVmjouQUGsEHarQ4h_NhwJxQQfw@mail.gmail.com>

On 2018-04-20 16:22, Paul Moore wrote:
> On Fri, Apr 20, 2018 at 4:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2018-04-18 21:46, Paul Moore wrote:
> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > Audit events could happen in a network namespace outside of a task
> >> > context due to packets received from the net that trigger an auditing
> >> > rule prior to being associated with a running task.  The network
> >> > namespace could in use by multiple containers by association to the
> >> > tasks in that network namespace.  We still want a way to attribute
> >> > these events to any potential containers.  Keep a list per network
> >> > namespace to track these container identifiiers.
> >> >
> >> > Add/increment the container identifier on:
> >> > - initial setting of the container id via /proc
> >> > - clone/fork call that inherits a container identifier
> >> > - unshare call that inherits a container identifier
> >> > - setns call that inherits a container identifier
> >> > Delete/decrement the container identifier on:
> >> > - an inherited container id dropped when child set
> >> > - process exit
> >> > - unshare call that drops a net namespace
> >> > - setns call that drops a net namespace
> >> >
> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >> > See: https://github.com/linux-audit/audit-testsuite/issues/64
> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> >> > ---
> >> >  include/linux/audit.h       |  7 +++++++
> >> >  include/net/net_namespace.h | 12 ++++++++++++
> >> >  kernel/auditsc.c            |  9 ++++++---
> >> >  kernel/nsproxy.c            |  6 ++++++
> >> >  net/core/net_namespace.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >> >  5 files changed, 76 insertions(+), 3 deletions(-)
> 
> ...
> 
> >> > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> >> > index f6c5d33..d9f1090 100644
> >> > --- a/kernel/nsproxy.c
> >> > +++ b/kernel/nsproxy.c
> >> > @@ -140,6 +140,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> >> >         struct nsproxy *old_ns = tsk->nsproxy;
> >> >         struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
> >> >         struct nsproxy *new_ns;
> >> > +       u64 containerid = audit_get_containerid(tsk);
> >> >
> >> >         if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> >> >                               CLONE_NEWPID | CLONE_NEWNET |
> >> > @@ -167,6 +168,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> >> >                 return  PTR_ERR(new_ns);
> >> >
> >> >         tsk->nsproxy = new_ns;
> >> > +       net_add_audit_containerid(new_ns->net_ns, containerid);
> >> >         return 0;
> >> >  }
> >>
> >> Hopefully we can handle this in audit_net_init(), we just need to
> >> figure out where we can get the correct task_struct for the audit
> >> container ID (some backpointer in the net struct?).
> >
> > I don't follow.  This needs to happen on every task startup.
> > audit_net_init() is only called when a new network namespace starts up.
> 
> Yep, sorry, my mistake.  I must have confused myself when I was
> looking at the code.
> 
> I'm thinking out loud here, bear with me ...
> 
> Assuming we move the netns/audit-container-ID tracking to audit_net,
> and considering we already have an audit hook in copy_process() (it
> calls audit_alloc()), would this be better handled by the
> copy_process() hook?  This ignores naming, audit_alloc() reuse, etc.;
> those can be easily fixed.  I'm just thinking of ways to limit our
> impact on the core kernel and leverage our existing interaction
> points.

The new namespace hasn't been cloned yet and this is the only function
where we have access to both namespaces, so I don't see how that could
work...

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 11/13] audit: add support for containerid to network namespaces
From: Paul Moore @ 2018-04-20 20:22 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <20180420200226.7tyxzuovdbgclw3m@madcap2.tricolour.ca>

On Fri, Apr 20, 2018 at 4:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-04-18 21:46, Paul Moore wrote:
>> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> > Audit events could happen in a network namespace outside of a task
>> > context due to packets received from the net that trigger an auditing
>> > rule prior to being associated with a running task.  The network
>> > namespace could in use by multiple containers by association to the
>> > tasks in that network namespace.  We still want a way to attribute
>> > these events to any potential containers.  Keep a list per network
>> > namespace to track these container identifiiers.
>> >
>> > Add/increment the container identifier on:
>> > - initial setting of the container id via /proc
>> > - clone/fork call that inherits a container identifier
>> > - unshare call that inherits a container identifier
>> > - setns call that inherits a container identifier
>> > Delete/decrement the container identifier on:
>> > - an inherited container id dropped when child set
>> > - process exit
>> > - unshare call that drops a net namespace
>> > - setns call that drops a net namespace
>> >
>> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> > See: https://github.com/linux-audit/audit-testsuite/issues/64
>> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> > ---
>> >  include/linux/audit.h       |  7 +++++++
>> >  include/net/net_namespace.h | 12 ++++++++++++
>> >  kernel/auditsc.c            |  9 ++++++---
>> >  kernel/nsproxy.c            |  6 ++++++
>> >  net/core/net_namespace.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
>> >  5 files changed, 76 insertions(+), 3 deletions(-)

...

>> > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> > index f6c5d33..d9f1090 100644
>> > --- a/kernel/nsproxy.c
>> > +++ b/kernel/nsproxy.c
>> > @@ -140,6 +140,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>> >         struct nsproxy *old_ns = tsk->nsproxy;
>> >         struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
>> >         struct nsproxy *new_ns;
>> > +       u64 containerid = audit_get_containerid(tsk);
>> >
>> >         if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> >                               CLONE_NEWPID | CLONE_NEWNET |
>> > @@ -167,6 +168,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>> >                 return  PTR_ERR(new_ns);
>> >
>> >         tsk->nsproxy = new_ns;
>> > +       net_add_audit_containerid(new_ns->net_ns, containerid);
>> >         return 0;
>> >  }
>>
>> Hopefully we can handle this in audit_net_init(), we just need to
>> figure out where we can get the correct task_struct for the audit
>> container ID (some backpointer in the net struct?).
>
> I don't follow.  This needs to happen on every task startup.
> audit_net_init() is only called when a new network namespace starts up.

Yep, sorry, my mistake.  I must have confused myself when I was
looking at the code.

I'm thinking out loud here, bear with me ...

Assuming we move the netns/audit-container-ID tracking to audit_net,
and considering we already have an audit hook in copy_process() (it
calls audit_alloc()), would this be better handled by the
copy_process() hook?  This ignores naming, audit_alloc() reuse, etc.;
those can be easily fixed.  I'm just thinking of ways to limit our
impact on the core kernel and leverage our existing interaction
points.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH net] bpf: sockmap remove dead check
From: Daniel Borkmann @ 2018-04-20 20:13 UTC (permalink / raw)
  To: Jann Horn, ast, netdev, linux-kernel, john.fastabend
In-Reply-To: <20180420161630.233178-1-jannh@google.com>

On 04/20/2018 06:16 PM, Jann Horn wrote:
> Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the
> previous check already bails on `attr->value_size != 4`.
> 
> Signed-off-by: Jann Horn <jannh@google.com>

Applied to bpf tree, thanks Jann!

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Tushar Dave @ 2018-04-21  3:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Willem de Bruijn,
	Samudrala, Sridhar, Netdev, Willem de Bruijn
In-Reply-To: <CAKgT0UddxeAR2Sf4gtHOyN6o7+9p02xFnwYB0ceYxBCV4iTi-w@mail.gmail.com>



On 04/20/2018 01:08 PM, Alexander Duyck wrote:
> On Fri, Apr 20, 2018 at 11:27 AM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
>>
>>
>> On 04/18/2018 11:12 AM, Alexander Duyck wrote:
>>>
>>> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net>
>>> wrote:
>>>>
>>>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>>>
>>>>> However, I share Sridhar's concerns about the very fundamental change
>>>>> to UDP message boundary semantics here.  There is actually no such thing
>>>>> as a "segment" in udp, so in general this feature makes me a little
>>>>> uneasy.  Well behaved udp applications should already be sending mtu
>>>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>>>> on IP fragmentation/reassembly to take care of datagram boundary
>>>>> semantics
>>>>> for them?
>>>>>
>>>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>>>> implementation, it will have no way of knowing that message boundaries
>>>>> have been re-adjusted at the sender.
>>>>
>>>>
>>>> There are no "semantics".
>>>>
>>>> What ends up on the wire is the same before the kernel/app changes as
>>>> afterwards.
>>>>
>>>> The only difference is that instead of the application doing N - 1
>>>> sendmsg() calls with mtu sized writes, it's giving everything all at
>>>> once and asking the kernel to segment.
>>>>
>>>> It even gives the application control over the size of the packets,
>>>> which I think is completely prudent in this situation.
>>>
>>>
>>> My only concern with the patch set is verifying what mitigations are
>>> in case so that we aren't trying to set an MSS size that results in a
>>> frame larger than MTU. I'm still digging through the code and trying
>>> to grok it, but I figured I might just put the question out there to
>>> may my reviewing easier.
>>>
>>> Also any plans for HW offload support for this? I vaguely recall that
>>> the igb and ixgbe parts had support for something like this in
>>> hardware. I would have to double check to see what exactly is
>>> supported.
>>
>>
>> Alex,
>>
>> If by HW support you meant UFO (UDP Fragmentation Offload), then I have
>> dig into that last year using ixgbe. And I found that Intel 10G HW does
>> break large UDP packets into MTU size however it does not generate
>> *true* IP fragments. Instead, when large (> MTU) size UDP packet is
>> given to NIC, HW generates unique UDP packets with distinct IP
>> fragments. This makes it impossible for receiving station to reassemble
>> them into one UDP packet.
>>
>> I am not sure about igb!
>>
>> -Tushar
> 
> Tushar,
> 
> I am not sure you have been following this thread, but this is about
> adding UDP GSO support, not fragmentation offload. With GSO support
> the UDP frames are not expected to be reassembled they are meant to be
> handled as individual frames.
> 
> What you have described is why I am interested. This patch set adds
> support for GSO segmentation, not fragmentation.
I see. Never mind.
Thanks.

-Tushar
> 
> Thanks.
> 
> - Alex
> 

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Alexander Duyck @ 2018-04-20 20:08 UTC (permalink / raw)
  To: Tushar Dave
  Cc: David Miller, Sowmini Varadhan, Willem de Bruijn,
	Samudrala, Sridhar, Netdev, Willem de Bruijn
In-Reply-To: <5195a7d9-7b7d-044b-0031-6fa13c0fe48a@oracle.com>

On Fri, Apr 20, 2018 at 11:27 AM, Tushar Dave <tushar.n.dave@oracle.com> wrote:
>
>
> On 04/18/2018 11:12 AM, Alexander Duyck wrote:
>>
>> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net>
>> wrote:
>>>
>>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>>
>>>> However, I share Sridhar's concerns about the very fundamental change
>>>> to UDP message boundary semantics here.  There is actually no such thing
>>>> as a "segment" in udp, so in general this feature makes me a little
>>>> uneasy.  Well behaved udp applications should already be sending mtu
>>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>>> on IP fragmentation/reassembly to take care of datagram boundary
>>>> semantics
>>>> for them?
>>>>
>>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>>> implementation, it will have no way of knowing that message boundaries
>>>> have been re-adjusted at the sender.
>>>
>>>
>>> There are no "semantics".
>>>
>>> What ends up on the wire is the same before the kernel/app changes as
>>> afterwards.
>>>
>>> The only difference is that instead of the application doing N - 1
>>> sendmsg() calls with mtu sized writes, it's giving everything all at
>>> once and asking the kernel to segment.
>>>
>>> It even gives the application control over the size of the packets,
>>> which I think is completely prudent in this situation.
>>
>>
>> My only concern with the patch set is verifying what mitigations are
>> in case so that we aren't trying to set an MSS size that results in a
>> frame larger than MTU. I'm still digging through the code and trying
>> to grok it, but I figured I might just put the question out there to
>> may my reviewing easier.
>>
>> Also any plans for HW offload support for this? I vaguely recall that
>> the igb and ixgbe parts had support for something like this in
>> hardware. I would have to double check to see what exactly is
>> supported.
>
>
> Alex,
>
> If by HW support you meant UFO (UDP Fragmentation Offload), then I have
> dig into that last year using ixgbe. And I found that Intel 10G HW does
> break large UDP packets into MTU size however it does not generate
> *true* IP fragments. Instead, when large (> MTU) size UDP packet is
> given to NIC, HW generates unique UDP packets with distinct IP
> fragments. This makes it impossible for receiving station to reassemble
> them into one UDP packet.
>
> I am not sure about igb!
>
> -Tushar

Tushar,

I am not sure you have been following this thread, but this is about
adding UDP GSO support, not fragmentation offload. With GSO support
the UDP frames are not expected to be reassembled they are meant to be
handled as individual frames.

What you have described is why I am interested. This patch set adds
support for GSO segmentation, not fragmentation.

Thanks.

- Alex

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 11/13] audit: add support for containerid to network namespaces
From: Richard Guy Briggs @ 2018-04-20 20:02 UTC (permalink / raw)
  To: Paul Moore
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <CAHC9VhRkstDMjd5T3w+iOUDjzDAs1AOm0xd3p6v_xn6fNGYQhA@mail.gmail.com>

On 2018-04-18 21:46, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Audit events could happen in a network namespace outside of a task
> > context due to packets received from the net that trigger an auditing
> > rule prior to being associated with a running task.  The network
> > namespace could in use by multiple containers by association to the
> > tasks in that network namespace.  We still want a way to attribute
> > these events to any potential containers.  Keep a list per network
> > namespace to track these container identifiiers.
> >
> > Add/increment the container identifier on:
> > - initial setting of the container id via /proc
> > - clone/fork call that inherits a container identifier
> > - unshare call that inherits a container identifier
> > - setns call that inherits a container identifier
> > Delete/decrement the container identifier on:
> > - an inherited container id dropped when child set
> > - process exit
> > - unshare call that drops a net namespace
> > - setns call that drops a net namespace
> >
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  include/linux/audit.h       |  7 +++++++
> >  include/net/net_namespace.h | 12 ++++++++++++
> >  kernel/auditsc.c            |  9 ++++++---
> >  kernel/nsproxy.c            |  6 ++++++
> >  net/core/net_namespace.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 76 insertions(+), 3 deletions(-)
> 
> ...
> 
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index 0490084..343a428 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -33,6 +33,7 @@
> >  #include <linux/ns_common.h>
> >  #include <linux/idr.h>
> >  #include <linux/skbuff.h>
> > +#include <linux/audit.h>
> >
> >  struct user_namespace;
> >  struct proc_dir_entry;
> > @@ -150,6 +151,7 @@ struct net {
> >  #endif
> >         struct sock             *diag_nlsk;
> >         atomic_t                fnhe_genid;
> > +       struct list_head        audit_containerid;
> >  } __randomize_layout;
> 
> We talked about this briefly off-list, you should be using audit_net
> and the net_generic mechanism instead of this.
> 
> >  #include <linux/seq_file_net.h>
> > @@ -301,6 +303,16 @@ static inline struct net *read_pnet(const possible_net_t *pnet)
> >  #define __net_initconst        __initconst
> >  #endif
> >
> > +#ifdef CONFIG_NET_NS
> > +void net_add_audit_containerid(struct net *net, u64 containerid);
> > +void net_del_audit_containerid(struct net *net, u64 containerid);
> > +#else
> > +static inline void net_add_audit_containerid(struct net *, u64)
> > +{ }
> > +static inline void net_del_audit_containerid(struct net *, u64)
> > +{ }
> > +#endif
> > +
> >  int peernet2id_alloc(struct net *net, struct net *peer);
> >  int peernet2id(struct net *net, struct net *peer);
> >  bool peernet_has_id(struct net *net, struct net *peer);
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 2f02ed9..208da962 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -75,6 +75,7 @@
> >  #include <linux/uaccess.h>
> >  #include <linux/fsnotify_backend.h>
> >  #include <uapi/linux/limits.h>
> > +#include <net/net_namespace.h>
> >
> >  #include "audit.h"
> >
> > @@ -2175,16 +2176,18 @@ static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainer
> >   */
> >  int audit_set_containerid(struct task_struct *task, u64 containerid)
> >  {
> > -       u64 oldcontainerid;
> > +       u64 oldcontainerid = audit_get_containerid(task);
> >         int rc;
> > -
> > -       oldcontainerid = audit_get_containerid(task);
> > +       struct net *net = task->nsproxy->net_ns;
> >
> >         rc = audit_set_containerid_perm(task, containerid);
> >         if (!rc) {
> > +               if (cid_valid(oldcontainerid))
> > +                       net_del_audit_containerid(net, oldcontainerid);
> 
> Using audit_net we can handle this internal to audit, which is a Good Thing.

No problem, done.

> >                 task_lock(task);
> >                 task->containerid = containerid;
> >                 task_unlock(task);
> > +               net_add_audit_containerid(net, containerid);
> 
> Same.
> 
> >         }
> >
> >         audit_log_set_containerid(task, oldcontainerid, containerid, rc);
> > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> > index f6c5d33..d9f1090 100644
> > --- a/kernel/nsproxy.c
> > +++ b/kernel/nsproxy.c
> > @@ -140,6 +140,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> >         struct nsproxy *old_ns = tsk->nsproxy;
> >         struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
> >         struct nsproxy *new_ns;
> > +       u64 containerid = audit_get_containerid(tsk);
> >
> >         if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> >                               CLONE_NEWPID | CLONE_NEWNET |
> > @@ -167,6 +168,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> >                 return  PTR_ERR(new_ns);
> >
> >         tsk->nsproxy = new_ns;
> > +       net_add_audit_containerid(new_ns->net_ns, containerid);
> >         return 0;
> >  }
> 
> Hopefully we can handle this in audit_net_init(), we just need to
> figure out where we can get the correct task_struct for the audit
> container ID (some backpointer in the net struct?).

I don't follow.  This needs to happen on every task startup.
audit_net_init() is only called when a new network namespace starts up.

> > @@ -217,6 +219,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> >  void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> >  {
> >         struct nsproxy *ns;
> > +       u64 containerid = audit_get_containerid(p);
> >
> >         might_sleep();
> >
> > @@ -224,6 +227,9 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> >         ns = p->nsproxy;
> >         p->nsproxy = new;
> >         task_unlock(p);
> > +       net_del_audit_containerid(ns->net_ns, containerid);
> > +       if (new)
> > +               net_add_audit_containerid(new->net_ns, containerid);
> 
> Okay, we might need a hook here for switching namespaces, but I would
> much rather it be a generic audit hook that calls directly into audit.

Trivial, done.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Alexander Duyck @ 2018-04-20 20:01 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Alexander Duyck, Bjorn Helgaas, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <5fc62d10-790f-2356-c2f6-fd49f542dd3c@infradead.org>

On Fri, Apr 20, 2018 at 10:23 AM, Randy Dunlap <rdunlap@infradead.org> wrote:
> On 04/20/18 09:28, Alexander Duyck wrote:
>> This series is meant to add support for SR-IOV on devices when the VFs are
>> not managed by the kernel. Examples of recent patches attempting to do this
>> include:
>> virto - https://patchwork.kernel.org/patch/10241225/
>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>> vfio - https://patchwork.kernel.org/patch/10103353/
>> uio - https://patchwork.kernel.org/patch/9974031/
>
> Hi,
>
> Somewhere in this patch series it would be nice to tell us what the heck
> a "PF" is.  :)
>
> Thanks.

Sorry, I was kind of operating on the assumption of everyone
understanding SR-IOV nomenclature.

A "PF" is a PCIe Physical Function. When you bring up a PCIe device
that supports SR-IOV it is the device that is there to begin with.

A "VF" is a PCIe Virtual Function. You could think of as a logical
device that is spawned from the physical function using a combination
of hardware configuration via the SR-IOV block in the PCIe extended
configuration space and kernel/driver features.

There are also a number of online resources you could use to research
SR-IOV further. Hope that helps to clarify some of this.

Thanks.

Alex

^ permalink raw reply

* [PATCH RFC net-next] igb: adjust SYSTIM register using TIMADJ register
From: Kshitiz Gupta @ 2018-04-20 19:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher, richardcochran
  Cc: intel-wired-lan, netdev, linux-kernel, Kshitiz Gupta

Currently the driver adjusts time by reading the current time and then
modifying it before writing to SYSTIM register. This can introduce
inaccuracies in SYSTIM. With a PREEMPT_RT kernel, spinlocks may be
interrupted, which in the existing implementation may lead to increased
time between the read and the write.

Alternatively as per section 7.8.3.2 in the i210 data sheet, this
operation can be done more accurately by using the TIMADJ registers,
but this should only be used for adjustments less than one 8th of the
sync interval. Once this register is written, the software can poll on
TSICR.TADJ to make sure that adjustment operation is completed.

This change implements using TIMADJ registers for adjTime call for
deltas less a configurable threshold. This threshold is exposed as
configurable module parameter. Once the delta is written to the
register, driver polls on TSICR.TADJ register to make sure the
adjustment is finished. For deltas greater than this threshold the
driver reverts back to using the old Read-Modify-Write approach.

Signed-off-by: Kshitiz Gupta <kshitiz.gupta@ni.com>
---

This change does create an oddity in the way the adjument takes place.
For any adjustments greater than the threshold the operation is
"immediate", but for the case where this new approach of TIMADJ is
used it can lead to the driver blocking for 8ns * delta.
There is another approach that could be taken. The driver could write
the TIMADJ and return immediately. Any subsequent could block until the
operation is done or return EINPROGRESS to let the caller know to
retry.
---
 drivers/net/ethernet/intel/igb/e1000_regs.h |  1 +
 drivers/net/ethernet/intel/igb/igb_ptp.c    | 64 ++++++++++++++++++++++++++---
 2 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h b/drivers/net/ethernet/intel/igb/e1000_regs.h
index d84afdd..607e0ee 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -100,6 +100,7 @@
 #define E1000_SYSTIML    0x0B600 /* System time register Low - RO */
 #define E1000_SYSTIMH    0x0B604 /* System time register High - RO */
 #define E1000_TIMINCA    0x0B608 /* Increment attributes register - RW */
+#define E1000_TIMADJ     0x0B60C /* Time Adjustment Offset Register - RW */
 #define E1000_TSAUXC     0x0B640 /* Timesync Auxiliary Control register */
 #define E1000_TRGTTIML0  0x0B644 /* Target Time Register 0 Low  - RW */
 #define E1000_TRGTTIMH0  0x0B648 /* Target Time Register 0 High - RW */
diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 9714b7f..6e9e0ac 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -76,6 +76,16 @@
 
 static void igb_ptp_tx_hwtstamp(struct igb_adapter *adapter);
 
+/* If adjTime is called with delta less that timadj_threshold the driver uses
+ * the TIMADJ register to update the SYSTIM register. This is used only for
+ * deltas which are less than 1/8th of the SYNC interval. Using default value
+ * of 15.6ms, which happens to be 1/8th of the default SYNC interval for
+ * IEEE 802.1AS-2011.
+ */
+static int timadj_threshold = 15625000;
+module_param(timadj_threshold, int, 0644);
+MODULE_PARM_DESC(timadj_threshold, "Threshold under which TIMADJ will be used to update SYSTIM");
+
 /* SYSTIM read access for the 82576 */
 static cycle_t igb_ptp_read_82576(const struct cyclecounter *cc)
 {
@@ -269,18 +279,60 @@ static int igb_ptp_adjtime_i210(struct ptp_clock_info *ptp, s64 delta)
 {
 	struct igb_adapter *igb = container_of(ptp, struct igb_adapter,
 					       ptp_caps);
-	unsigned long flags;
+	unsigned long flags, timeout_us, i, wait_us = 1000;
+	unsigned long clock_interval_ns = 8;
 	struct timespec64 now, then = ns_to_timespec64(delta);
+	struct e1000_hw *hw = &igb->hw;
+	u32 adjust_offset = 0, tsicr = 0;
+	int ret = 0;
 
 	spin_lock_irqsave(&igb->tmreg_lock, flags);
 
-	igb_ptp_read_i210(igb, &now);
-	now = timespec64_add(now, then);
-	igb_ptp_write_i210(igb, (const struct timespec64 *)&now);
+	if (abs(delta) > timadj_threshold) {
+		igb_ptp_read_i210(igb, &now);
+		now = timespec64_add(now, then);
+		igb_ptp_write_i210(igb, (const struct timespec64 *)&now);
+	} else {
+		/* For smaller values based on the threshold set in the module
+		 * parameter adjust SYSTIM registers using TIMADJ registers.
+		 * Following the write access to the TIMADJ register, the
+		 * hardware at every 8ns clock repeats the following two steps:
+		 *
+		 * SYSTIM = SYSTIM + INC_TIME +/- 1 nsec. Add or subtract 1 nsec
+		 * is defined by TIMADJ.Sign ( 0 means Add and 1 means Subtract)
+		 *
+		 * Tadjust = Tadjust - 1 nsec
+		 */
 
-	spin_unlock_irqrestore(&igb->tmreg_lock, flags);
+		/* TIMADJ is one's complement, set the magnitude and sign */
+		adjust_offset = abs(delta);
+		if (delta < 0)
+			adjust_offset |= ISGN;
 
-	return 0;
+		wr32(E1000_TIMADJ, adjust_offset);
+
+		/* Poll on TSICR.TADJ flag. This is set when hardware completes
+		 * the adjustment based on TIMADJ register.
+		 */
+		timeout_us = timadj_threshold * clock_interval_ns;
+		for (i = 0; i < timeout_us; i += wait_us) {
+			tsicr = rd32(E1000_TSICR);
+
+			if (tsicr & TSINTR_TADJ)
+				break;
+
+			udelay(wait_us);
+		}
+
+		if (i >= timeout_us) {
+			tsicr = rd32(E1000_TSICR);
+			if (!(tsicr & TSINTR_TADJ))
+				ret = -ETIMEDOUT;
+		}
+	}
+
+	spin_unlock_irqrestore(&igb->tmreg_lock, flags);
+	return ret;
 }
 
 static int igb_ptp_gettime_82576(struct ptp_clock_info *ptp,
-- 
1.9.1

^ permalink raw reply related

* [PATCH net] ibmvnic: Clean actual number of RX or TX pools
From: Thomas Falcon @ 2018-04-20 19:25 UTC (permalink / raw)
  To: netdev; +Cc: nfont, jallen, linuxppc-dev, Thomas Falcon

Avoid using value stored in the login response buffer when
cleaning TX and RX buffer pools since these could be inconsistent
depending on the device state. Instead use the field in the driver's
private data that tracks the number of active pools.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 2df01ad..6e8d6a6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1128,7 +1128,7 @@ static void clean_rx_pools(struct ibmvnic_adapter *adapter)
 	if (!adapter->rx_pool)
 		return;
 
-	rx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_rxadd_subcrqs);
+	rx_scrqs = adapter->num_active_rx_pools;
 	rx_entries = adapter->req_rx_add_entries_per_subcrq;
 
 	/* Free any remaining skbs in the rx buffer pools */
@@ -1177,7 +1177,7 @@ static void clean_tx_pools(struct ibmvnic_adapter *adapter)
 	if (!adapter->tx_pool || !adapter->tso_pool)
 		return;
 
-	tx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs);
+	tx_scrqs = adapter->num_active_tx_pools;
 
 	/* Free any remaining skbs in the tx buffer pools */
 	for (i = 0; i < tx_scrqs; i++) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCHv4 net 3/3] net: sched: ife: check on metadata length
From: Alexander Aring @ 2018-04-20 19:15 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180420191505.27633-1-aring@mojatatu.com>

This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/ife/ife.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7fbe70a0af4b..13bbf8cb6a39 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -69,6 +69,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
 	int total_pull;
 	u16 ifehdrln;
 
+	if (!pskb_may_pull(skb, skb->dev->hard_header_len + IFE_METAHDRLEN))
+		return NULL;
+
 	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
 	ifehdrln = ntohs(ifehdr->metalen);
 	total_pull = skb->dev->hard_header_len + ifehdrln;
-- 
2.11.0

^ permalink raw reply related

* [PATCHv4 net 2/3] net: sched: ife: handle malformed tlv length
From: Alexander Aring @ 2018-04-20 19:15 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180420191505.27633-1-aring@mojatatu.com>

There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 35 +++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  7 ++++++-
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/net/ife.h b/include/net/ife.h
index 44b9c00f7223..e117617e3c34 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -12,7 +12,8 @@
 void *ife_encode(struct sk_buff *skb, u16 metalen);
 void *ife_decode(struct sk_buff *skb, u16 *metalen);
 
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen);
 int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 			const void *dval);
 
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7d1ec76e7f43..7fbe70a0af4b 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -92,12 +92,43 @@ struct meta_tlvhdr {
 	__be16 len;
 };
 
+static bool __ife_tlv_meta_valid(const unsigned char *skbdata,
+				 const unsigned char *ifehdr_end)
+{
+	const struct meta_tlvhdr *tlv;
+	u16 tlvlen;
+
+	if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
+		return false;
+
+	tlv = (const struct meta_tlvhdr *)skbdata;
+	tlvlen = ntohs(tlv->len);
+
+	/* tlv length field is inc header, check on minimum */
+	if (tlvlen < NLA_HDRLEN)
+		return false;
+
+	/* overflow by NLA_ALIGN check */
+	if (NLA_ALIGN(tlvlen) < tlvlen)
+		return false;
+
+	if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
+		return false;
+
+	return true;
+}
+
 /* Caller takes care of presenting data in network order
  */
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen)
 {
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+	struct meta_tlvhdr *tlv;
+
+	if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
+		return NULL;
 
+	tlv = (struct meta_tlvhdr *)skbdata;
 	*dlen = ntohs(tlv->len) - NLA_HDRLEN;
 	*attrtype = ntohs(tlv->type);
 
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 49b8ab551fbe..8527cfdc446d 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 		u16 mtype;
 		u16 dlen;
 
-		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
+		curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
+						&dlen, NULL);
+		if (!curr_data) {
+			qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
+			return TC_ACT_SHOT;
+		}
 
 		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
-- 
2.11.0

^ permalink raw reply related

* [PATCHv4 net 1/3] net: sched: ife: signal not finding metaid
From: Alexander Aring @ 2018-04-20 19:15 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180420191505.27633-1-aring@mojatatu.com>

We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/act_ife.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index a5994cf0512b..49b8ab551fbe 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 		}
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
-- 
2.11.0

^ permalink raw reply related

* [PATCHv4 net 0/3] net: sched: ife: malformed ife packet fixes
From: Alexander Aring @ 2018-04-20 19:15 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring

As promised at netdev 2.2 tc workshop I am working on adding scapy support for
tdc testing. It is still work in progress. I will submit the patches to tdc
later (they are not in good shape yet). The good news is I have been able to
find bugs which normal packet testing would not be able to find.
With fuzzy testing I was able to craft certain malformed packets that IFE
action was not able to deal with. This patch set fixes those bugs.

changes since v4:
 - use pskb_may_pull before pointer assign

changes since v3:
 - use pskb_may_pull

changes since v2:
 - remove inline from __ife_tlv_meta_valid
 - add const to cast to meta_tlvhdr
 - add acked and reviewed tags

Alexander Aring (3):
  net: sched: ife: signal not finding metaid
  net: sched: ife: handle malformed tlv length
  net: sched: ife: check on metadata length

 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 38 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  9 +++++++--
 3 files changed, 45 insertions(+), 5 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net] strparser: Do not call mod_delayed_work with a timeout of LONG_MAX
From: Doron Roberts-Kedes @ 2018-04-20 19:11 UTC (permalink / raw)
  To: David Miller; +Cc: Tejun Heo, netdev, Doron Roberts-Kedes

struct sock's sk_rcvtimeo is initialized to
LONG_MAX/MAX_SCHEDULE_TIMEOUT in sock_init_data. Calling
mod_delayed_work with a timeout of LONG_MAX causes spurious execution of
the work function. timer->expires is set equal to jiffies + LONG_MAX.
When timer_base->clk falls behind the current value of jiffies,
the delta between timer_base->clk and jiffies + LONG_MAX causes the
expiration to be in the past. Returning early from strp_start_timer if
timeo == LONG_MAX solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
---
 net/strparser/strparser.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 805b139..092bebc 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -67,7 +67,7 @@ static void strp_abort_strp(struct strparser *strp, int err)

 static void strp_start_timer(struct strparser *strp, long timeo)
 {
-	if (timeo)
+	if (timeo && timeo != LONG_MAX)
 		mod_delayed_work(strp_wq, &strp->msg_timer_work, timeo);
 }

-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox