Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] macmace: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-10  1:25 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: David S. Miller, linux-m68k, netdev, Linux Kernel Mailing List,
	Christoph Hellwig
In-Reply-To: <CAMuHMdU1XBqt7hwEW6JTas64ZNGCGCMr5HMZwuLo0O-ZBCOWyA@mail.gmail.com>

On Thu, 3 May 2018, Geert Uytterhoeven wrote:

> 
> Perhaps you can add a new helper 
> (platform_device_register_simple_dma()?) that takes the DMA mask, too?

Would there be enough potential callers in future to justify that API?
It seems that there haven't been many in the past. I found four users of 
platform_device_register_simple() which might benefit. Mostly these call 
dma_set_coherent_mask() in the platform driver probe routine.

drivers/gpu/drm/etnaviv/etnaviv_drv.c
drivers/gpu/drm/exynos/exynos_drm_drv.c
drivers/gpu/drm/omapdrm/omap_drv.c
drivers/parport/parport_pc.c

(Am I missing any others?)

To actually hoist the dma mask setup out of existing platform drivers 
would have implications for every device that matches with those drivers. 

That's a bit risky since I can't test those devices -- that's assuming I 
could identify them all; sometimes platform device matching is not well 
defined at build time (see loongson_sysconf.ecname).

So far, it looks like macmace and macsonic would be the only callers of 
this new API call.

What's worse, if you do pass a dma_mask in struct platform_device_info, 
you end up with this problem in platform_device_register_full():

        if (pdevinfo->dma_mask) {
                /*
                 * This memory isn't freed when the device is put,
                 * I don't have a nice idea for that though.  Conceptually
                 * dma_mask in struct device should not be a pointer.
                 * See http://thread.gmane.org/gmane.linux.kernel.pci/9081
                 */
                pdev->dev.dma_mask =
                        kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL);

Most of the platform drivers that call dma_coerce_mask_and_coherent() are 
using pdev->of_match_table, not platform_device_register_simple(). Many of 
them have a comment like this:

        /*
         * Right now device-tree probed devices don't get dma_mask set.
         * Since shared usb code relies on it, set it here for now.
         * Once we have dma capability bindings this can go away.
         */

> With people setting the mask to kill the WARNING splat, this may become 
> more common.

Since the commit which introduced the WARNING, only commits f61e64310b75 
("m68k: set dma and coherent masks for platform FEC ethernets") and 
7bcfab202ca7 ("powerpc/macio: set a proper dma_coherent_mask") seem to be 
aimed at squelching that WARNING.

(Am I missing any others?)

So far, this is not looking like a common problem, and I'm having trouble 
finding some way to improve on my original patches.

-- 

^ permalink raw reply

* Re: [PATCH net] macmace: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-10  1:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Geert Uytterhoeven, David S. Miller, linux-m68k, netdev,
	Linux Kernel Mailing List
In-Reply-To: <20180503085120.GA14574@lst.de>

On Thu, 3 May 2018, Christoph Hellwig wrote:

> On Thu, May 03, 2018 at 10:46:56AM +0200, Geert Uytterhoeven wrote:
> > Perhaps you can add a new helper 
> > (platform_device_register_simple_dma()?) that takes the DMA mask, too? 
> > With people setting the mask to kill the WARNING splat, this may 
> > become more common.
> > 
> > struct platform_device_info already has a dma_mask field, but 
> > platform_device_register_resndata() explicitly sets it to zero.
> 
> Yes, that would be useful.  The other assumption could be that platform 
> devices always allow an all-0xff dma mask.

Could that have unintended side effects? The mask is presently unset by 
default, and my worry would be that changing it may cause some drivers to 
behave differently.

A quick grep turns up this in drivers/spi/spi-au1550.c for example,

	if (pdev->dev.dma_mask == NULL)
		dev_warn(&pdev->dev, "no dma mask\n");
	else
		hw->usedma = 1;

Also, if pdev.dev.dma_mask is to get a default value, shouldn't it use the 
same default as dma_get_mask, to avoid unintended side effects?

static inline u64 dma_get_mask(struct device *dev)
{
        if (dev && dev->dma_mask && *dev->dma_mask)
                return *dev->dma_mask;
        return DMA_BIT_MASK(32);
}

-- 

^ permalink raw reply

* Re: [PATCH net-next RFC 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.
From: Jakub Kicinski @ 2018-05-10  1:10 UTC (permalink / raw)
  To: Michael Chan; +Cc: Or Gerlitz, David Miller, Netdev
In-Reply-To: <CACKFLik6Mcd3YL8hUGG=uFV6QkBbnj3rskg08GezQrWKK5TBMA@mail.gmail.com>

On Wed, 9 May 2018 17:22:50 -0700, Michael Chan wrote:
> On Wed, May 9, 2018 at 4:15 PM, Jakub Kicinski wrote:
> > On Wed,  9 May 2018 07:21:41 -0400, Michael Chan wrote:  
> >> VF Queue resources are always limited and there is currently no
> >> infrastructure to allow the admin. on the host to add or reduce queue
> >> resources for any particular VF.  With ever increasing number of VFs
> >> being supported, it is desirable to allow the admin. to configure queue
> >> resources differently for the VFs.  Some VFs may require more or fewer
> >> queues due to different bandwidth requirements or different number of
> >> vCPUs in the VM.  This patch adds the infrastructure to do that by
> >> adding IFLA_VF_QUEUES netlink attribute and a new .ndo_set_vf_queues()
> >> to the net_device_ops.
> >>
> >> Four parameters are exposed for each VF:
> >>
> >> o min_tx_queues - Guaranteed or current tx queues assigned to the VF.  
> >
> > This muxing of semantics may be a little awkward and unnecessary, would
> > it make sense for struct ifla_vf_info to have a separate fields for
> > current number of queues and the admin-set guaranteed min?  
> 
> The loose semantics is mainly to allow some flexibility in
> implementation.  Sure, we can tighten the definitions or add
> additional fields.

I would appreciate that, if others don't disagree.  I personally don't
see the need for flexibility (AKA per-vendor behaviour) here, quite the
opposite, min/max/current number of queues seems quite self-explanatory.

Or at least don't allow min to mean current?  Otherwise the API gets a
bit asymmetrical :(

> > Is there a real world use case for the min value or are you trying to
> > make the API feature complete?  
> 
> In this proposal, these parameters are mainly viewed as the bounds for
> the queues that each VF can potentially allocate.  The actual number
> of queues chosen by the VF driver or modified by the VF user can be
> any number within the bounds.

Perhaps you have misspoken here - these are not allowed bounds, right?
min is the guarantee that queues will be available, not requirement.
Similar to bandwidth allocation.

IOW if the bounds are set [4, 16] the VF may still choose to use 1
queue, event thought that's not within bounds.

> We currently need to have min and max parameters to support the
> different modes we use to distribute the queue resources to the VFs.
> In one mode, for example, resources are statically divided and each VF
> has a small number of guaranteed queues (min = max).  In a different
> mode, we allow more flexible resource allocation with each VF having a
> small number of guaranteed queues but a higher number of
> non-guaranteed queues (min < max).  Some VFs may be able to allocate
> queues much higher than min when resources are still available, while
> others may only be able to allocate min queues when resources are used
> up.
> 
> With min and max exposed, the PF user can properly tweak the resources
> for each VF described above.

Right, I was just looking for a real world scenario where this
flexibility is going to be used - mainly because the switchdev model I
described below won't allow it.  I'm not sure users will leave a
portion of queues to be allocated by chance.

> >> o max_tx_queues - Maximum but not necessarily guaranteed tx queues
> >>   available to the VF.
> >>
> >> o min_rx_queues - Guaranteed or current rx queues assigned to the VF.
> >>
> >> o max_rx_queues - Maximum but not necessarily guaranteed rx queues
> >>   available to the VF.
> >>
> >> The "ip link set" command will subsequently be patched to support the new
> >> operation to set the above parameters.
> >>
> >> After the admin. makes a change to the above parameters, the corresponding
> >> VF will have a new range of channels to set using ethtool -L.
> >>
> >> Signed-off-by: Michael Chan <michael.chan@broadcom.com>  
> >
> > In switchdev mode we can use number of queues on the representor as a
> > proxy for max number of queues allowed for the ASIC port.  This works
> > better when representors are muxed in the first place than when they
> > have actual queues backing them.  WDYT about such scheme, Or?  A very
> > pleasant side-effect is that one can configure qdiscs and get stats
> > per-HW queue.  
> 
> This is an interesting approach.  But it doesn't have the min and max
> for each VF, and also only works in switchdev mode.

It allows controlling all ports of the switch with the same, existing
and well known API (incl. PFs).  But sadly I don't think we are at the
point where switchdev-mode solutions are considered an alternative, so
I'm only mentioning it to broaden the discussion :)  I'm not opposed to
your patches :)

^ permalink raw reply

* Proposal
From: Zeliha Omer Faruk @ 2018-05-10  0:39 UTC (permalink / raw)




-- 
Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey

^ permalink raw reply

* Re: [PATCH net-next RFC 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.
From: Michael Chan @ 2018-05-10  0:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Or Gerlitz, David Miller, Netdev
In-Reply-To: <20180509161509.373f8c1b@cakuba.netronome.com>

On Wed, May 9, 2018 at 4:15 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Wed,  9 May 2018 07:21:41 -0400, Michael Chan wrote:
>> VF Queue resources are always limited and there is currently no
>> infrastructure to allow the admin. on the host to add or reduce queue
>> resources for any particular VF.  With ever increasing number of VFs
>> being supported, it is desirable to allow the admin. to configure queue
>> resources differently for the VFs.  Some VFs may require more or fewer
>> queues due to different bandwidth requirements or different number of
>> vCPUs in the VM.  This patch adds the infrastructure to do that by
>> adding IFLA_VF_QUEUES netlink attribute and a new .ndo_set_vf_queues()
>> to the net_device_ops.
>>
>> Four parameters are exposed for each VF:
>>
>> o min_tx_queues - Guaranteed or current tx queues assigned to the VF.
>
> This muxing of semantics may be a little awkward and unnecessary, would
> it make sense for struct ifla_vf_info to have a separate fields for
> current number of queues and the admin-set guaranteed min?

The loose semantics is mainly to allow some flexibility in
implementation.  Sure, we can tighten the definitions or add
additional fields.

>
> Is there a real world use case for the min value or are you trying to
> make the API feature complete?

In this proposal, these parameters are mainly viewed as the bounds for
the queues that each VF can potentially allocate.  The actual number
of queues chosen by the VF driver or modified by the VF user can be
any number within the bounds.

We currently need to have min and max parameters to support the
different modes we use to distribute the queue resources to the VFs.
In one mode, for example, resources are statically divided and each VF
has a small number of guaranteed queues (min = max).  In a different
mode, we allow more flexible resource allocation with each VF having a
small number of guaranteed queues but a higher number of
non-guaranteed queues (min < max).  Some VFs may be able to allocate
queues much higher than min when resources are still available, while
others may only be able to allocate min queues when resources are used
up.

With min and max exposed, the PF user can properly tweak the resources
for each VF described above.

>
>> o max_tx_queues - Maximum but not necessarily guaranteed tx queues
>>   available to the VF.
>>
>> o min_rx_queues - Guaranteed or current rx queues assigned to the VF.
>>
>> o max_rx_queues - Maximum but not necessarily guaranteed rx queues
>>   available to the VF.
>>
>> The "ip link set" command will subsequently be patched to support the new
>> operation to set the above parameters.
>>
>> After the admin. makes a change to the above parameters, the corresponding
>> VF will have a new range of channels to set using ethtool -L.
>>
>> Signed-off-by: Michael Chan <michael.chan@broadcom.com>
>
> In switchdev mode we can use number of queues on the representor as a
> proxy for max number of queues allowed for the ASIC port.  This works
> better when representors are muxed in the first place than when they
> have actual queues backing them.  WDYT about such scheme, Or?  A very
> pleasant side-effect is that one can configure qdiscs and get stats
> per-HW queue.

This is an interesting approach.  But it doesn't have the min and max
for each VF, and also only works in switchdev mode.

^ permalink raw reply

* [PATCH net 2/2] bonding: send learning packets for vlans on slave
From: Debabrata Banerjee @ 2018-05-09 23:32 UTC (permalink / raw)
  To: David S . Miller, netdev, Vlad Yasevich
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, dbanerje
In-Reply-To: <20180509233211.28207-1-dbanerje@akamai.com>

There was a regression at some point from the intended functionality of
commit f60c3704e87d ("bonding: Fix alb mode to only use first level
vlans.")

Given the return value vlan_get_encap_level() we need to store the nest
level of the bond device, and then compare the vlan's encap level to
this. Without this, this check always fails and learning packets are
never sent.

In addition, this same commit caused a regression in the behavior of
balance_alb, which requires learning packets be sent for all interfaces
using the slave's mac in order to load balance properly. For vlan's
that have not set a user mac, we can send after checking one bit.
Otherwise we need send the set mac, albeit defeating rx load balancing
for that vlan.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 drivers/net/bonding/bond_alb.c  | 13 ++++++++-----
 drivers/net/bonding/bond_main.c |  2 ++
 include/net/bonding.h           |  1 +
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 3f6faa657360..5eb0df2e5464 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -943,6 +943,10 @@ static void alb_send_lp_vid(struct slave *slave, u8 mac_addr[],
 	skb->priority = TC_PRIO_CONTROL;
 	skb->dev = slave->dev;
 
+	netdev_dbg(slave->bond->dev,
+		   "Send learning packet: dev %s mac %pM vlan %d\n",
+		   slave->dev->name, mac_addr, vid);
+
 	if (vid)
 		__vlan_hwaccel_put_tag(skb, vlan_proto, vid);
 
@@ -965,14 +969,13 @@ static int alb_upper_dev_walk(struct net_device *upper, void *_data)
 	u8 *mac_addr = data->mac_addr;
 	struct bond_vlan_tag *tags;
 
-	if (is_vlan_dev(upper) && vlan_get_encap_level(upper) == 0) {
-		if (strict_match &&
-		    ether_addr_equal_64bits(mac_addr,
-					    upper->dev_addr)) {
+	if (is_vlan_dev(upper) &&
+	    bond->nest_level == vlan_get_encap_level(upper) - 1) {
+		if (upper->addr_assign_type == NET_ADDR_STOLEN) {
 			alb_send_lp_vid(slave, mac_addr,
 					vlan_dev_vlan_proto(upper),
 					vlan_dev_vlan_id(upper));
-		} else if (!strict_match) {
+		} else {
 			alb_send_lp_vid(slave, upper->dev_addr,
 					vlan_dev_vlan_proto(upper),
 					vlan_dev_vlan_id(upper));
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 718e4914e3a0..1f1e97b26f95 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1738,6 +1738,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 	if (bond_mode_uses_xmit_hash(bond))
 		bond_update_slave_arr(bond, NULL);
 
+	bond->nest_level = dev_get_nest_level(bond_dev);
+
 	netdev_info(bond_dev, "Enslaving %s as %s interface with %s link\n",
 		    slave_dev->name,
 		    bond_is_active_slave(new_slave) ? "an active" : "a backup",
diff --git a/include/net/bonding.h b/include/net/bonding.h
index f801fc940b29..b52235158836 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -198,6 +198,7 @@ struct bonding {
 	struct   slave __rcu *primary_slave;
 	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
 	bool     force_primary;
+	u32      nest_level;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
 			      struct slave *);
-- 
2.17.0

^ permalink raw reply related

* Proposal
From: Zeliha Omer Faruk @ 2018-05-09 23:57 UTC (permalink / raw)




-- 
Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey

^ permalink raw reply

* [PATCH net 1/2] bonding: do not allow rlb updates to invalid mac
From: Debabrata Banerjee @ 2018-05-09 23:32 UTC (permalink / raw)
  To: David S . Miller, netdev, Vlad Yasevich
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, dbanerje
In-Reply-To: <20180509233211.28207-1-dbanerje@akamai.com>

Make sure multicast, broadcast, and zero mac's cannot be the output of rlb
updates, which should all be directed arps. Receive load balancing will be
collapsed if any of these happen, as the switch will broadcast.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
 drivers/net/bonding/bond_alb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 1ed9529e7bd1..3f6faa657360 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -450,7 +450,7 @@ static void rlb_update_client(struct rlb_client_info *client_info)
 {
 	int i;

-	if (!client_info->slave)
+	if (!client_info->slave || !is_valid_ether_addr(client_info->mac_dst))
 		return;

 	for (i = 0; i < RLB_ARP_BURST_SIZE; i++) {
-- 
2.17.0

^ permalink raw reply related

* [PATCH net 0/2] bonding: bug fixes and regressions
From: Debabrata Banerjee @ 2018-05-09 23:32 UTC (permalink / raw)
  To: David S . Miller, netdev, Vlad Yasevich
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, dbanerje

Fixes to bonding driver for balance-alb mode, suitable for stable.

Debabrata Banerjee (2):
  bonding: do not allow rlb updates to invalid mac
  bonding: send learning packets for vlans on slave

 drivers/net/bonding/bond_alb.c  | 15 +++++++++------
 drivers/net/bonding/bond_main.c |  2 ++
 include/net/bonding.h           |  1 +
 3 files changed, 12 insertions(+), 6 deletions(-)

-- 
2.17.0

^ permalink raw reply

* Re: linux-next: Tree for May 9 (mlx5)
From: Saeed Mahameed @ 2018-05-09 23:21 UTC (permalink / raw)
  To: sfr@canb.auug.org.au, rdunlap@infradead.org, Israel Rukshin,
	linux-next@vger.kernel.org, Max Gurtovoy
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Leon Romanovsky, Matan Barak
In-Reply-To: <826ab15b-f2bc-0c5d-8d5e-6466badcd3e0@infradead.org>

On Wed, 2018-05-09 at 08:31 -0700, Randy Dunlap wrote:
> On 05/09/2018 04:21 AM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20180508:
> > 
> 
> on x86_64:
> # CONFIG_SMP is not set
> 
> In file included from
> ../drivers/net/ethernet/mellanox/mlx5/core/main.c:43:0:
> ../include/linux/mlx5/driver.h: In function
> 'mlx5_get_vector_affinity_hint':
> ../include/linux/mlx5/driver.h:1299:13: error: 'struct irq_desc' has
> no member named 'affinity_hint'
>   return desc->affinity_hint;
>              ^
> 
> 

Hi Stephen,

Israel and Max are working on a solution, will provide it ASAP.

Thanks,
Saeed.

^ permalink raw reply

* Re: [PATCH] net/mlx5: Fix mlx5_get_vector_affinity function
From: Saeed Mahameed @ 2018-05-09 23:19 UTC (permalink / raw)
  To: linux@roeck-us.net, tglx@linutronix.de
  Cc: netdev@vger.kernel.org, Max Gurtovoy, Israel Rukshin,
	linux-rdma@vger.kernel.org, Matan Barak, dledford@redhat.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <20180509221906.GA7548@roeck-us.net>

On Wed, 2018-05-09 at 15:19 -0700, Guenter Roeck wrote:
> On Sun, May 06, 2018 at 09:33:26AM +0200, Thomas Gleixner wrote:
> > On Sat, 5 May 2018, Guenter Roeck wrote:
> > 
> > > On Thu, Apr 12, 2018 at 09:49:11AM +0000, Israel Rukshin wrote:
> > > > Adding the vector offset when calling to mlx5_vector2eqn() is
> > > > wrong.
> > > > This is because mlx5_vector2eqn() checks if EQ index is equal
> > > > to vector number
> > > > and the fact that the internal completion vectors that mlx5
> > > > allocates
> > > > don't get an EQ index.
> > > > 
> > > > The second problem here is that using effective_affinity_mask
> > > > gives the same
> > > > CPU for different vectors.
> > > > This leads to unmapped queues when calling it from
> > > > blk_mq_rdma_map_queues().
> > > > This doesn't happen when using affinity_hint mask.
> > > > 
> > > 
> > > Except that affinity_hint is only defined if SMP is enabled.
> > > Without:
> > > 
> > > include/linux/mlx5/driver.h: In function
> > > ‘mlx5_get_vector_affinity_hint’:
> > > include/linux/mlx5/driver.h:1299:13: error:
> > >         ‘struct irq_desc’ has no member named ‘affinity_hint’
> > > 
> > > Note that this is the only use of affinity_hint outside
> > > kernel/irq.
> > > Don't other drivers have similar problems ?
> > 
> > Aside of that.
> > 
> > > >  static inline const struct cpumask *
> > > > -mlx5_get_vector_affinity(struct mlx5_core_dev *dev, int
> > > > vector)
> > > > +mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int
> > > > vector)
> > > >  {
> > > > -	const struct cpumask *mask;
> > > >  	struct irq_desc *desc;
> > > >  	unsigned int irq;
> > > >  	int eqn;
> > > >  	int err;
> > > >  
> > > > -	err = mlx5_vector2eqn(dev, MLX5_EQ_VEC_COMP_BASE +
> > > > vector, &eqn, &irq);
> > > > +	err = mlx5_vector2eqn(dev, vector, &eqn, &irq);
> > > >  	if (err)
> > > >  		return NULL;
> > > >  
> > > >  	desc = irq_to_desc(irq);
> > > > -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
> > > > -	mask = irq_data_get_effective_affinity_mask(&desc-
> > > > >irq_data);
> > > > -#else
> > > > -	mask = desc->irq_common_data.affinity;
> > > > -#endif
> > > > -	return mask;
> > > > +	return desc->affinity_hint;
> > 
> > NAK.
> > 
> 
> The offending patch is upstream, breaking non-SMP test builds, and I
> have
> not seen any feedback from the submitter. Any suggestion how to
> proceed ?
> 
> Guenter
> 
> > 
> 

Hi Guenter and Thomas,

Max and Israel are handling this internally to find a solution that
provides the needed functionality for rdma and addresses your comments.

Thanks,
Saeed.

^ permalink raw reply

* Re: [PATCH net-next RFC 1/3] net: Add support to configure SR-IOV VF minimum and maximum queues.
From: Jakub Kicinski @ 2018-05-09 23:15 UTC (permalink / raw)
  To: Michael Chan, Or Gerlitz; +Cc: davem, netdev
In-Reply-To: <1525864903-32619-2-git-send-email-michael.chan@broadcom.com>

On Wed,  9 May 2018 07:21:41 -0400, Michael Chan wrote:
> VF Queue resources are always limited and there is currently no
> infrastructure to allow the admin. on the host to add or reduce queue
> resources for any particular VF.  With ever increasing number of VFs
> being supported, it is desirable to allow the admin. to configure queue
> resources differently for the VFs.  Some VFs may require more or fewer
> queues due to different bandwidth requirements or different number of
> vCPUs in the VM.  This patch adds the infrastructure to do that by
> adding IFLA_VF_QUEUES netlink attribute and a new .ndo_set_vf_queues()
> to the net_device_ops.
> 
> Four parameters are exposed for each VF:
> 
> o min_tx_queues - Guaranteed or current tx queues assigned to the VF.

This muxing of semantics may be a little awkward and unnecessary, would
it make sense for struct ifla_vf_info to have a separate fields for
current number of queues and the admin-set guaranteed min?

Is there a real world use case for the min value or are you trying to
make the API feature complete?

> o max_tx_queues - Maximum but not necessarily guaranteed tx queues
>   available to the VF.
> 
> o min_rx_queues - Guaranteed or current rx queues assigned to the VF.
> 
> o max_rx_queues - Maximum but not necessarily guaranteed rx queues
>   available to the VF.
> 
> The "ip link set" command will subsequently be patched to support the new
> operation to set the above parameters.
> 
> After the admin. makes a change to the above parameters, the corresponding
> VF will have a new range of channels to set using ethtool -L.
> 
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

In switchdev mode we can use number of queues on the representor as a
proxy for max number of queues allowed for the ASIC port.  This works
better when representors are muxed in the first place than when they
have actual queues backing them.  WDYT about such scheme, Or?  A very
pleasant side-effect is that one can configure qdiscs and get stats
per-HW queue.

^ permalink raw reply

* Proposal
From: Zeliha Omer Faruk @ 2018-05-09 23:04 UTC (permalink / raw)




-- 
Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey

^ permalink raw reply

* Re: [PATCH v6 05/13] firmware_loader: enhance Kconfig documentation over FW_LOADER
From: Kees Cook @ 2018-05-09 23:01 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Greg KH, Andrew Morton, Josh Triplett, maco, Andy Gross,
	David Brown, bjorn.andersson, Tom Gundersen, wagi, Hans de Goede,
	Andres Rodriguez, Mimi Zohar, Jakub Kicinski, Shuah Khan,
	Martin Fuzzey, David Howells, pali.rohar, Takashi Iwai,
	Kalle Valo, Arend van Spriel, Rafał Miłecki,
	Nicolas Broeking
In-Reply-To: <20180509205544.GW27853@wotan.suse.de>

On Wed, May 9, 2018 at 1:55 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> On Tue, May 08, 2018 at 03:42:33PM -0700, Kees Cook wrote:
>> On Tue, May 8, 2018 at 11:12 AM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
>> > +         This used to be the default firmware loading facility, and udev used
>> > +         to listen for uvents to load firmware for the kernel. The firmware
>> > +         loading facility functionality in udev has been removed, as such it
>> > +         can no longer be relied upon as a fallback mechanism. Linux no longer
>> > +         relies on or uses a fallback mechanism in userspace. If you need to
>> > +         rely on one refer to the permissively licensed firmwared:
>>
>> Typo: firmware
>
> Thanks fixed all typos except this one, this one is meant to be firmwared as
> that is the name of the project, the url is below.
>
>>
>> > +
>> > +         https://github.com/teg/firmwared

Oh! Yes, hah. :) Thanks!

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: sync tools bpf.h uapi header
From: Daniel Borkmann @ 2018-05-09 22:59 UTC (permalink / raw)
  To: Prashant Bhole, Alexei Starovoitov; +Cc: David S . Miller, netdev
In-Reply-To: <20180509020459.6564-1-bhole_prashant_q7@lab.ntt.co.jp>

On 05/09/2018 04:04 AM, Prashant Bhole wrote:
> sync the header from include/uapi/linux/bpf.h which was updated to add
> fib lookup helper function. This fixes selftests/bpf build failure
> 
> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
> ---
>  tools/include/uapi/linux/bpf.h | 84 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 83 insertions(+), 1 deletion(-)

This doesn't apply cleanly to bpf-next, please respin.

^ permalink raw reply

* Re: [PATCH v4 0/2] selftests/bpf
From: Daniel Borkmann @ 2018-05-09 22:56 UTC (permalink / raw)
  To: Sirio Balmelli; +Cc: netdev
In-Reply-To: <20180508133544.6kwfglryatxkzatp@vm4>

On 05/08/2018 03:35 PM, Sirio Balmelli wrote:
> Review of v3 patch much appreciated.
> 
> Respun the series to omit the Makefile include, will work on
> a separate patch for that;
> replied to the v3 thread with queries specific to the include issue.
> 
> best,
> 
> Sirio
> 
> Sirio Balmelli (2):
>   selftests/bpf: add architecture-agnostic headers
>   selftests/bpf: ignore build products
> 
>  tools/bpf/bpftool/.gitignore           |  3 +++
>  tools/include/uapi/asm/bitsperlong.h   | 18 ++++++++++++++++++
>  tools/include/uapi/asm/errno.h         | 17 +++++++++++++++++
>  tools/testing/selftests/bpf/.gitignore |  1 +
>  4 files changed, 39 insertions(+)
>  create mode 100644 tools/bpf/bpftool/.gitignore
>  create mode 100644 tools/include/uapi/asm/bitsperlong.h
>  create mode 100644 tools/include/uapi/asm/errno.h

Applied to bpf-next, thanks Sirio!

^ permalink raw reply

* Re: [PATCH bpf-next v5 3/4] bpf: selftest additions for SOCKHASH
From: Daniel Borkmann @ 2018-05-09 22:53 UTC (permalink / raw)
  To: John Fastabend, borkmann, ast; +Cc: netdev
In-Reply-To: <1525562710-11603-4-git-send-email-john.fastabend@gmail.com>

Hi John,

On 05/06/2018 01:25 AM, John Fastabend wrote:
> This runs existing SOCKMAP tests with SOCKHASH map type. To do this
> we push programs into include file and build two BPF programs. One
> for SOCKHASH and one for SOCKMAP.
> 
> We then run the entire test suite with each type.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Acked-by: David S. Miller <davem@davemloft.net>
[...]
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 9d76218..28316f1 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
>  	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
>  	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
>  	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
> -	test_get_stack_rawtp.o
> +	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o
>  
>  # Order correspond to 'make run_tests' order
>  TEST_PROGS := test_kmod.sh \
> diff --git a/tools/testing/selftests/bpf/test_sockhash_kern.c b/tools/testing/selftests/bpf/test_sockhash_kern.c
> new file mode 100644
> index 0000000..3bf4ad4
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_sockhash_kern.c
> @@ -0,0 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
> +#define TEST_MAP_TYPE BPF_MAP_TYPE_SOCKHASH
> +#include "./test_sockmap_kern.h"
> diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
> index 29c022d..df7afc7 100644
> --- a/tools/testing/selftests/bpf/test_sockmap.c
> +++ b/tools/testing/selftests/bpf/test_sockmap.c
> @@ -47,7 +47,8 @@
>  #define S1_PORT 10000
>  #define S2_PORT 10001
>  
> -#define BPF_FILENAME "test_sockmap_kern.o"
> +#define BPF_SOCKMAP_FILENAME "test_sockmap_kern.o"
> +#define BPF_SOCKHASH_FILENAME "test_sockmap_kern.o"

Is this testing the right thing? Shouldn't above BPF_SOCKHASH_FILENAME say "test_sockhash_kern.o"
in order to select the correct BPF prog for hashmap? Seems here we're testing sock/arraymap twice.

>  #define CG_PATH "/sockmap"
>  
[...]
> +static int test_suite(void)
> +{
> +	int err;
> +
> +	err = __test_suite(BPF_SOCKMAP_FILENAME);
> +	if (err)
> +		goto out;
> +	err = __test_suite(BPF_SOCKHASH_FILENAME);
> +out:
> +	return err;
> +}
> +
Thanks,
Daniel

^ permalink raw reply

* Re: libbpf backward compatibility (was: [PATCH bpf-next v5 09/10] bpf: btf: Add BTF support to libbpf)
From: Jakub Kicinski @ 2018-05-09 22:20 UTC (permalink / raw)
  To: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrey Ignatov
  Cc: netdev, kernel-team, David Beckett
In-Reply-To: <20180509151712.7d6826f7@cakuba.netronome.com>

On Wed, 9 May 2018 15:17:12 -0700, Jakub Kicinski wrote:
> On Wed, 18 Apr 2018 15:56:05 -0700, Martin KaFai Lau wrote:
> > diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> > index 39f6a0d64a3b..01bda076310f 100644
> > --- a/tools/lib/bpf/bpf.h
> > +++ b/tools/lib/bpf/bpf.h
> > @@ -26,6 +26,20 @@
> >  #include <linux/bpf.h>
> >  #include <stddef.h>
> >  
> > +struct bpf_create_map_attr {
> > +	const char *name;
> > +	enum bpf_map_type map_type;
> > +	__u32 map_flags;
> > +	__u32 key_size;
> > +	__u32 value_size;
> > +	__u32 max_entries;
> > +	__u32 numa_node;
> > +	__u32 btf_fd;
> > +	__u32 btf_key_id;
> > +	__u32 btf_value_id;
> > +};
> > +
> > +int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr);
> >  int bpf_create_map_node(enum bpf_map_type map_type, const char *name,
> >  			int key_size, int value_size, int max_entries,
> >  			__u32 map_flags, int node);
> > @@ -87,4 +101,6 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
> >  int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
> >  		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);
> >  int bpf_raw_tracepoint_open(const char *name, int prog_fd);
> > +int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size,
> > +		 bool do_log);
> >  #endif  
> 
> Does libbpf have to provide backward compatibility?  Are the function
> prototypes not supposed to change?
> 
> Recently a number of *_xattr functions were added, these are nice for
> limiting the number of parameters passed to functions, but they don't
> buy us anything in terms of extensibility unless we can grow the
> structures.  Should structure size be passed in as one of parameters?

Or is the backward compatibility not at the binary level so we can grow
the structures at will?

^ permalink raw reply

* Re: [PATCH] net/mlx5: Fix mlx5_get_vector_affinity function
From: Guenter Roeck @ 2018-05-09 22:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Israel Rukshin, Max Gurtovoy, Matan Barak, Doug Ledford,
	linux-rdma, linux-kernel, netdev
In-Reply-To: <alpine.DEB.2.21.1805060929540.1685@nanos.tec.linutronix.de>

On Sun, May 06, 2018 at 09:33:26AM +0200, Thomas Gleixner wrote:
> On Sat, 5 May 2018, Guenter Roeck wrote:
> 
> > On Thu, Apr 12, 2018 at 09:49:11AM +0000, Israel Rukshin wrote:
> > > Adding the vector offset when calling to mlx5_vector2eqn() is wrong.
> > > This is because mlx5_vector2eqn() checks if EQ index is equal to vector number
> > > and the fact that the internal completion vectors that mlx5 allocates
> > > don't get an EQ index.
> > > 
> > > The second problem here is that using effective_affinity_mask gives the same
> > > CPU for different vectors.
> > > This leads to unmapped queues when calling it from blk_mq_rdma_map_queues().
> > > This doesn't happen when using affinity_hint mask.
> > > 
> > Except that affinity_hint is only defined if SMP is enabled. Without:
> > 
> > include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
> > include/linux/mlx5/driver.h:1299:13: error:
> >         ‘struct irq_desc’ has no member named ‘affinity_hint’
> > 
> > Note that this is the only use of affinity_hint outside kernel/irq.
> > Don't other drivers have similar problems ?
> 
> Aside of that.
> 
> > >  static inline const struct cpumask *
> > > -mlx5_get_vector_affinity(struct mlx5_core_dev *dev, int vector)
> > > +mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
> > >  {
> > > -	const struct cpumask *mask;
> > >  	struct irq_desc *desc;
> > >  	unsigned int irq;
> > >  	int eqn;
> > >  	int err;
> > >  
> > > -	err = mlx5_vector2eqn(dev, MLX5_EQ_VEC_COMP_BASE + vector, &eqn, &irq);
> > > +	err = mlx5_vector2eqn(dev, vector, &eqn, &irq);
> > >  	if (err)
> > >  		return NULL;
> > >  
> > >  	desc = irq_to_desc(irq);
> > > -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
> > > -	mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
> > > -#else
> > > -	mask = desc->irq_common_data.affinity;
> > > -#endif
> > > -	return mask;
> > > +	return desc->affinity_hint;
> 
> NAK.
> 

The offending patch is upstream, breaking non-SMP test builds, and I have
not seen any feedback from the submitter. Any suggestion how to proceed ?

Guenter

> Nothing in regular device drivers is supposed to ever fiddle with struct
> irq_desc. The existing code is already a violation of that rule and needs
> to be fixed, but not in that way.
> 
> The logic here is completely screwed. affinity_hint is set by the driver,
> so the driver already knows what it is. If the driver does not set it, then
> the thing is NULL.
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply

* libbpf backward compatibility (was: [PATCH bpf-next v5 09/10] bpf: btf: Add BTF support to libbpf)
From: Jakub Kicinski @ 2018-05-09 22:17 UTC (permalink / raw)
  To: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrey Ignatov
  Cc: netdev, kernel-team, David Beckett
In-Reply-To: <20180418225606.2771620-10-kafai@fb.com>

On Wed, 18 Apr 2018 15:56:05 -0700, Martin KaFai Lau wrote:
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 39f6a0d64a3b..01bda076310f 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -26,6 +26,20 @@
>  #include <linux/bpf.h>
>  #include <stddef.h>
>  
> +struct bpf_create_map_attr {
> +	const char *name;
> +	enum bpf_map_type map_type;
> +	__u32 map_flags;
> +	__u32 key_size;
> +	__u32 value_size;
> +	__u32 max_entries;
> +	__u32 numa_node;
> +	__u32 btf_fd;
> +	__u32 btf_key_id;
> +	__u32 btf_value_id;
> +};
> +
> +int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr);
>  int bpf_create_map_node(enum bpf_map_type map_type, const char *name,
>  			int key_size, int value_size, int max_entries,
>  			__u32 map_flags, int node);
> @@ -87,4 +101,6 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
>  int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
>  		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);
>  int bpf_raw_tracepoint_open(const char *name, int prog_fd);
> +int bpf_load_btf(void *btf, __u32 btf_size, char *log_buf, __u32 log_buf_size,
> +		 bool do_log);
>  #endif

Does libbpf have to provide backward compatibility?  Are the function
prototypes not supposed to change?

Recently a number of *_xattr functions were added, these are nice for
limiting the number of parameters passed to functions, but they don't
buy us anything in terms of extensibility unless we can grow the
structures.  Should structure size be passed in as one of parameters?

^ permalink raw reply

* Re: [PATCH] selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()
From: Paul Moore @ 2018-05-09 22:02 UTC (permalink / raw)
  To: Stephen Smalley, Alexey Kodanev, Richard Haines
  Cc: selinux, Eric Paris, linux-security-module, netdev
In-Reply-To: <CAHC9VhSg9UAmXaK0xsy_h9HfzM4uUBqCNKN5qtf3hHLSi=ZENw@mail.gmail.com>

On Wed, May 9, 2018 at 11:34 AM, Paul Moore <paul@paul-moore.com> wrote:
> On Wed, May 9, 2018 at 11:11 AM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
>> On 05/09/2018 11:01 AM, Paul Moore wrote:
>>> On Wed, May 9, 2018 at 8:37 AM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>>> On 05/08/2018 08:25 PM, Paul Moore wrote:
>>>>> On Tue, May 8, 2018 at 2:40 PM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>>>>> On 05/08/2018 01:05 PM, Paul Moore wrote:
>>>>>>> On Tue, May 8, 2018 at 10:05 AM, Alexey Kodanev
>>>>>>> <alexey.kodanev@oracle.com> wrote:
>>>>>>>> Commit d452930fd3b9 ("selinux: Add SCTP support") breaks compatibility
>>>>>>>> with the old programs that can pass sockaddr_in with AF_UNSPEC and
>>>>>>>> INADDR_ANY to bind(). As a result, bind() returns EAFNOSUPPORT error.
>>>>>>>> It was found with LTP/asapi_01 test.
>>>>>>>>
>>>>>>>> Similar to commit 29c486df6a20 ("net: ipv4: relax AF_INET check in
>>>>>>>> bind()"), which relaxed AF_INET check for compatibility, add AF_UNSPEC
>>>>>>>> case to AF_INET and make sure that the address is INADDR_ANY.
>>>>>>>>
>>>>>>>> Also, in the end of selinux_socket_bind(), instead of adding AF_UNSPEC
>>>>>>>> to 'address->sa_family == AF_INET', verify AF_INET6 first.
>>>>>>>>
>>>>>>>> Fixes: d452930fd3b9 ("selinux: Add SCTP support")
>>>>>>>> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>>>>> ---
>>>>>>>>  security/selinux/hooks.c | 12 +++++++++---
>>>>>>>>  1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> Thanks for finding and reporting this regression.
>>>>>>>
>>>>>>> I think I would prefer to avoid having to duplicate the
>>>>>>> AF_UNSPEC/INADDR_ANY checking logic in the SELinux hook, even though
>>>>>>> it is a small bit of code and unlikely to change.  I'm wondering if it
>>>>>>> would be better to check both the socket and sockaddr address family
>>>>>>> in the main if conditional inside selinux_socket_bind(), what do you
>>>>>>> think?  Another option would be to go back to just checking the
>>>>>>> soackaddr address family; we moved away from that for a reason which
>>>>>>> escapes at the moment (code cleanliness?), but perhaps that was a
>>>>>>> mistake.
>>>>>>
>>>>>> We've always used the sk->sk_family there; it was only the recent code from Richard that started
>>>>>> using the socket address family.
>>>>>
>>>>> Yes I know, I thought I was the one that suggested it at some point
>>>>> (I'll take the blame) ... although now that I'm looking at the git
>>>>> log, maybe I'm confusing it with something else.
>>>>>
>>>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>>>>>> index 4cafe6a19167..a3789b167667 100644
>>>>>>> --- a/security/selinux/hooks.c
>>>>>>> +++ b/security/selinux/hooks.c
>>>>>>> @@ -4577,6 +4577,7 @@ static int selinux_socket_bind(struct socket *sock, struc>
>>>>>>> {
>>>>>>>        struct sock *sk = sock->sk;
>>>>>>>        u16 family;
>>>>>>> +       u16 family_sa;
>>>>>>>        int err;
>>>>>>>
>>>>>>>        err = sock_has_perm(sk, SOCKET__BIND);
>>>>>>> @@ -4585,7 +4586,9 @@ static int selinux_socket_bind(struct socket *sock, struc>
>>>>>>>
>>>>>>>        /* If PF_INET or PF_INET6, check name_bind permission for the port. */
>>>>>>>        family = sk->sk_family;
>>>>>>> -       if (family == PF_INET || family == PF_INET6) {
>>>>>>> +       family_sa = address->sa_family;
>>>>>>> +       if ((family == PF_INET || family == PF_INET6) &&
>>>>>>> +           (family_sa == PF_INET || family_sa == PF_INET6)) {
>>>>>>
>>>>>> Wouldn't this allow bypassing the name_bind permission check by passing in AF_UNSPEC?
>>>>>
>>>>> I believe these name_bind permission checkis skipped for AF_UNSPEC
>>>>> already, isn't it?  The only way the name_bind check would be
>>>>> triggered is if the source port, snum, was non-zero and I didn't think
>>>>> that was really legal for AF_UNSPEC/INADDR_ANY, is it?
>>>>
>>>> 1) What in inet_bind() prevents that from occurring?
>>>> 2) Regardless, what about the node_bind check?
>>>
>>> Fair enough.  As mentioned above, perhaps the right fix is to move the
>>> address family checking back to how it was pre-SCTP.
>>
>> It isn't clear to me how to do that without breaking SCTP multiple address binding, which is why
>> Richard had to switch to checking address->sa_family instead of just using the sk->sk_family.
>> Alexey's original fix might be the simplest solution.
>
> I'm going to have to apologize, I'm traveling at the moment and more
> distracted than usual as a result.  Let me take a closer look later
> today.  It may be that Alexey's original fix the only practical
> solution, but I really would like to avoid having to duplicate checks
> like that in the SELinux code.

I just had a better look at this and I believe that Alexey and Stephen
are right: this is the best option.  My apologies for the noise
earlier.  However, while looking at the code I think there are some
additional necessary changes:

* In the case of an SCTP socket, we should return -EINVAL, just as we
do with other address families.
* While not strictly related to AF_UNSPEC, we really should be passing
the address family of the sockaddr, and not the socket, to functions
that need to interpret the bind address/port.

I'm waiting for my kernel to compile so I haven't given this any
sanity testing, but the patch below is what I think we need ...

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a19167..5f30045b2053 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4576,6 +4576,7 @@ static int selinux_socket_post_create(struct socket *sock,
int family,
static int selinux_socket_bind(struct socket *sock, struct sockaddr *address, i
nt addrlen)
{
       struct sock *sk = sock->sk;
+       struct sk_security_struct *sksec = sk->sk_security;
       u16 family;
       int err;

@@ -4587,13 +4588,13 @@ static int selinux_socket_bind(struct socket *sock, stru
ct sockaddr *address, in
       family = sk->sk_family;
       if (family == PF_INET || family == PF_INET6) {
               char *addrp;
-               struct sk_security_struct *sksec = sk->sk_security;
               struct common_audit_data ad;
               struct lsm_network_audit net = {0,};
               struct sockaddr_in *addr4 = NULL;
               struct sockaddr_in6 *addr6 = NULL;
               unsigned short snum;
               u32 sid, node_perm;
+               u16 family_sa = address->sa_family;

               /*
                * sctp_bindx(3) calls via selinux_sctp_bind_connect()
@@ -4601,11 +4602,19 @@ static int selinux_socket_bind(struct socket *sock, stru
ct sockaddr *address, in
                * need to check address->sa_family as it is possible to have
                * sk->sk_family = PF_INET6 with addr->sa_family = AF_INET.
                */
-               switch (address->sa_family) {
+               switch (family_sa) {
+               case AF_UNSPEC:
               case AF_INET:
                       if (addrlen < sizeof(struct sockaddr_in))
                               return -EINVAL;
                       addr4 = (struct sockaddr_in *)address;
+                       if (family_sa == AF_UNSPEC) {
+                               /* see "__inet_bind()", we only want to allow
+                                * AF_UNSPEC if the address is INADDR_ANY */
+                               if (addr4->sin_addr.s_addr != htonl(INADDR_ANY))
+                                       goto err_af;
+                               family_sa = AF_INET;
+                       }
                       snum = ntohs(addr4->sin_port);
                       addrp = (char *)&addr4->sin_addr.s_addr;
                       break;
@@ -4617,15 +4626,14 @@ static int selinux_socket_bind(struct socket *sock, stru
ct sockaddr *address, in
                       addrp = (char *)&addr6->sin6_addr.s6_addr;
                       break;
               default:
-                       /* Note that SCTP services expect -EINVAL, whereas
-                        * others expect -EAFNOSUPPORT.
-                        */
-                       if (sksec->sclass == SECCLASS_SCTP_SOCKET)
-                               return -EINVAL;
-                       else
-                               return -EAFNOSUPPORT;
+                       goto err_af;
               }

+               ad.type = LSM_AUDIT_DATA_NET;
+               ad.u.net = &net;
+               ad.u.net->sport = htons(snum);
+               ad.u.net->family = family_sa;
+
               if (snum) {
                       int low, high;

@@ -4637,10 +4645,6 @@ static int selinux_socket_bind(struct socket *sock, struc
t sockaddr *address, in
                                                     snum, &sid);
                               if (err)
                                       goto out;
-                               ad.type = LSM_AUDIT_DATA_NET;
-                               ad.u.net = &net;
-                               ad.u.net->sport = htons(snum);
-                               ad.u.net->family = family;
                               err = avc_has_perm(&selinux_state,
                                                  sksec->sid, sid,
                                                  sksec->sclass,
@@ -4672,16 +4676,11 @@ static int selinux_socket_bind(struct socket *sock, stru
ct sockaddr *address, in
                       break;
               }

-               err = sel_netnode_sid(addrp, family, &sid);
+               err = sel_netnode_sid(addrp, family_sa, &sid);
               if (err)
                       goto out;

-               ad.type = LSM_AUDIT_DATA_NET;
-               ad.u.net = &net;
-               ad.u.net->sport = htons(snum);
-               ad.u.net->family = family;
-
-               if (address->sa_family == AF_INET)
+               if (family_sa == AF_INET)
                       ad.u.net->v4info.saddr = addr4->sin_addr.s_addr;
               else
                       ad.u.net->v6info.saddr = addr6->sin6_addr;
@@ -4694,6 +4693,12 @@ static int selinux_socket_bind(struct socket *sock, struc
t sockaddr *address, in
       }
out:
       return err;
+err_af:
+       /* Note that SCTP services expect -EINVAL, others -EAFNOSUPPORT. */
+       if (sksec->sclass == SECCLASS_SCTP_SOCKET)
+               return -EINVAL;
+       else
+               return -EAFNOSUPPORT;
}

/* This supports connect(2) and SCTP connect services such as sctp_connectx(3)

-- 
paul moore
www.paul-moore.com

^ permalink raw reply related

* Re: [PATCH net] tc-testing: fix tdc tests for 'bpf' action
From: Lucas Bates @ 2018-05-09 21:51 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Roman Mashak, David Miller, Linux Kernel Network Developers
In-Reply-To: <54d69bac92e9c0a216997261ef7b1e0eb0dd28c9.1525884149.git.dcaratti@redhat.com>

On Wed, May 9, 2018 at 12:45 PM, Davide Caratti <dcaratti@redhat.com> wrote:
> - correct a typo in the value of 'matchPattern' of test 282d, potentially
>  causing false negative
> - allow errors when 'teardown' executes '$TC action flush action bpf' in
>  test 282d, to fix false positive when it is run with act_bpf unloaded
> - correct the value of 'matchPattern' in test e939, causing false positive
>  in case the BPF JIT is enabled
>
> Fixes: 440ea4ae1828 ("tc-testing: add selftests for 'bpf' action")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> ---
>  tools/testing/selftests/tc-testing/tc-tests/actions/bpf.json | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/bpf.json b/tools/testing/selftests/tc-testing/tc-tests/actions/bpf.json
> index 5b012f4981d4..6f289a49e5ec 100644
> --- a/tools/testing/selftests/tc-testing/tc-tests/actions/bpf.json
> +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/bpf.json
> @@ -66,7 +66,7 @@
>          "cmdUnderTest": "$TC action add action bpf object-file _b.o index 667",
>          "expExitCode": "0",
>          "verifyCmd": "$TC action get action bpf index 667",
> -        "matchPattern": "action order [0-9]*: bpf _b.o:\\[action\\] id [0-9]* tag 3b185187f1855c4c default-action pipe.*index 667 ref",
> +        "matchPattern": "action order [0-9]*: bpf _b.o:\\[action\\] id [0-9]* tag 3b185187f1855c4c( jited)? default-action pipe.*index 667 ref",
>          "matchCount": "1",
>          "teardown": [
>              "$TC action flush action bpf",
> @@ -92,10 +92,15 @@
>          "cmdUnderTest": "$TC action add action bpf object-file _c.o index 667",
>          "expExitCode": "255",
>          "verifyCmd": "$TC action get action bpf index 667",
> -        "matchPattern": "action order [0-9]*: bpf _b.o:\\[action\\] id [0-9].*index 667 ref",
> +        "matchPattern": "action order [0-9]*: bpf _c.o:\\[action\\] id [0-9].*index 667 ref",
>          "matchCount": "0",
>          "teardown": [
> -            "$TC action flush action bpf",
> +            [
> +                "$TC action flush action bpf",
> +                0,
> +                1,
> +                255
> +            ],
>              "rm -f _c.o"
>          ]
>      },
> --
> 2.14.3
>

Acked-by: Lucas Bates <lucasb@mojatatu.com>

^ permalink raw reply

* Re: [bpf-next v2 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: Daniel Borkmann @ 2018-05-09 21:49 UTC (permalink / raw)
  To: David Ahern, netdev, borkmann, ast
  Cc: davem, shm, roopa, brouer, toke, john.fastabend
In-Reply-To: <d6d01920-efd0-7e8b-f9ba-5e3b642160d3@gmail.com>

On 05/09/2018 11:39 PM, David Ahern wrote:
> On 5/9/18 3:29 PM, David Ahern wrote:
>> On 5/9/18 2:44 PM, Daniel Borkmann wrote:
>>> Generally, no objection. However, could we get rid of the two extra includes altogether
>>> to avoid running into any such dependency issue? Right now the only includes we have in
>>> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
>>> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
>>> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
>>> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
>>> a v6 address, we could do the same here and should be all good then.
>>
>> I was able to drop the include of linux/in6.h and still use in6_addr. I
>> would prefer to keep in6_addr since it works and avoid the need to add
>> typecasts.
> 
> Never mind; that was working because if_ether.h was pulling in skbuff.h
> which included in6.h.
>
>> As for ETH_ALEN, I could redefine it but it just kicks the can down the
>> road. If if_ether.h is included after bpf.h, it will cause redefinition
>> warnings.
> 
> I guess I will continue the open coded magic numbers for mac and ipv6
> addresses.

Agree, it will avoid breakage. We cannot assume that every BPF prog out there has
one specific ordering of if_ether.h and bpf.h includes. Open coding the numbers
seems best here.

^ permalink raw reply

* Re: [bpf-next v2 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: Alexei Starovoitov @ 2018-05-09 21:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Daniel Borkmann, netdev, borkmann, ast, davem, shm, roopa, brouer,
	toke, john.fastabend
In-Reply-To: <d6d01920-efd0-7e8b-f9ba-5e3b642160d3@gmail.com>

On Wed, May 09, 2018 at 03:39:52PM -0600, David Ahern wrote:
> On 5/9/18 3:29 PM, David Ahern wrote:
> > On 5/9/18 2:44 PM, Daniel Borkmann wrote:
> >> Generally, no objection. However, could we get rid of the two extra includes altogether
> >> to avoid running into any such dependency issue? Right now the only includes we have in
> >> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
> >> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
> >> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
> >> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
> >> a v6 address, we could do the same here and should be all good then.
> > 
> > I was able to drop the include of linux/in6.h and still use in6_addr. I
> > would prefer to keep in6_addr since it works and avoid the need to add
> > typecasts.
> 
> Never mind; that was working because if_ether.h was pulling in skbuff.h
> which included in6.h.
> > 
> > As for ETH_ALEN, I could redefine it but it just kicks the can down the
> > road. If if_ether.h is included after bpf.h, it will cause redefinition
> > warnings.
> > 
> 
> I guess I will continue the open coded magic numbers for mac and ipv6
> addresses.

That's the only way.
Adding
+#include <linux/if_ether.h>
+#include <linux/in6.h>
to uapi/bpf.h is no-go. It will cause all sorts of breakage
not only to kernel build as we realized, but to various user space apps too.
Please use be32 ipv6[4] and hard coded mac instead.

^ permalink raw reply

* Re: [bpf-next v2 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: David Ahern @ 2018-05-09 21:39 UTC (permalink / raw)
  To: Daniel Borkmann, netdev, borkmann, ast
  Cc: davem, shm, roopa, brouer, toke, john.fastabend
In-Reply-To: <b2ebbdde-3198-9c58-b4cb-e44602568134@gmail.com>

On 5/9/18 3:29 PM, David Ahern wrote:
> On 5/9/18 2:44 PM, Daniel Borkmann wrote:
>> Generally, no objection. However, could we get rid of the two extra includes altogether
>> to avoid running into any such dependency issue? Right now the only includes we have in
>> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
>> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
>> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
>> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
>> a v6 address, we could do the same here and should be all good then.
> 
> I was able to drop the include of linux/in6.h and still use in6_addr. I
> would prefer to keep in6_addr since it works and avoid the need to add
> typecasts.

Never mind; that was working because if_ether.h was pulling in skbuff.h
which included in6.h.
> 
> As for ETH_ALEN, I could redefine it but it just kicks the can down the
> road. If if_ether.h is included after bpf.h, it will cause redefinition
> warnings.
> 

I guess I will continue the open coded magic numbers for mac and ipv6
addresses.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox