Netdev List
 help / color / mirror / Atom feed
* [net-next  9/9] igb: Add ethtool support to configure number of channels
From: Jeff Kirsher @ 2013-10-01 11:33 UTC (permalink / raw)
  To: davem; +Cc: Laura Mihaela Vasilescu, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1380627236-3190-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Laura Mihaela Vasilescu <laura.vasilescu@rosedu.org>

This patch adds the ethtool callbacks necessary to configure the
number of RSS queues.

The maximum number of queues is in accordance with the datasheets.

Signed-off-by: Laura Mihaela Vasilescu <laura.vasilescu@rosedu.org>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb.h         |  1 +
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 84 ++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/igb/igb_main.c    | 22 ++++++++
 3 files changed, 107 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index cdaa2bc..5e9ed89 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -487,6 +487,7 @@ int igb_up(struct igb_adapter *);
 void igb_down(struct igb_adapter *);
 void igb_reinit_locked(struct igb_adapter *);
 void igb_reset(struct igb_adapter *);
+int igb_reinit_queues(struct igb_adapter *);
 void igb_write_rss_indir_tbl(struct igb_adapter *);
 int igb_set_spd_dplx(struct igb_adapter *, u32, u8);
 int igb_setup_tx_resources(struct igb_ring *);
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index e4c77c0..c8f65e5 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2874,6 +2874,88 @@ static int igb_set_rxfh_indir(struct net_device *netdev, const u32 *indir)
 	return 0;
 }
 
+static unsigned int igb_max_channels(struct igb_adapter *adapter)
+{
+	struct e1000_hw *hw = &adapter->hw;
+	unsigned int max_combined = 0;
+
+	switch (hw->mac.type) {
+	case e1000_i211:
+		max_combined = IGB_MAX_RX_QUEUES_I211;
+		break;
+	case e1000_82575:
+	case e1000_i210:
+		max_combined = IGB_MAX_RX_QUEUES_82575;
+		break;
+	case e1000_i350:
+		if (!!adapter->vfs_allocated_count) {
+			max_combined = 1;
+			break;
+		}
+		/* fall through */
+	case e1000_82576:
+		if (!!adapter->vfs_allocated_count) {
+			max_combined = 2;
+			break;
+		}
+		/* fall through */
+	case e1000_82580:
+	case e1000_i354:
+	default:
+		max_combined = IGB_MAX_RX_QUEUES;
+		break;
+	}
+
+	return max_combined;
+}
+
+static void igb_get_channels(struct net_device *netdev,
+			     struct ethtool_channels *ch)
+{
+	struct igb_adapter *adapter = netdev_priv(netdev);
+
+	/* Report maximum channels */
+	ch->max_combined = igb_max_channels(adapter);
+
+	/* Report info for other vector */
+	if (adapter->msix_entries) {
+		ch->max_other = NON_Q_VECTORS;
+		ch->other_count = NON_Q_VECTORS;
+	}
+
+	ch->combined_count = adapter->rss_queues;
+}
+
+static int igb_set_channels(struct net_device *netdev,
+			    struct ethtool_channels *ch)
+{
+	struct igb_adapter *adapter = netdev_priv(netdev);
+	unsigned int count = ch->combined_count;
+
+	/* Verify they are not requesting separate vectors */
+	if (!count || ch->rx_count || ch->tx_count)
+		return -EINVAL;
+
+	/* Verify other_count is valid and has not been changed */
+	if (ch->other_count != NON_Q_VECTORS)
+		return -EINVAL;
+
+	/* Verify the number of channels doesn't exceed hw limits */
+	if (count > igb_max_channels(adapter))
+		return -EINVAL;
+
+	if (count != adapter->rss_queues) {
+		adapter->rss_queues = count;
+
+		/* Hardware has to reinitialize queues and interrupts to
+		 * match the new configuration.
+		 */
+		return igb_reinit_queues(adapter);
+	}
+
+	return 0;
+}
+
 static const struct ethtool_ops igb_ethtool_ops = {
 	.get_settings		= igb_get_settings,
 	.set_settings		= igb_set_settings,
@@ -2910,6 +2992,8 @@ static const struct ethtool_ops igb_ethtool_ops = {
 	.get_rxfh_indir_size	= igb_get_rxfh_indir_size,
 	.get_rxfh_indir		= igb_get_rxfh_indir,
 	.set_rxfh_indir		= igb_set_rxfh_indir,
+	.get_channels		= igb_get_channels,
+	.set_channels		= igb_set_channels,
 	.begin			= igb_ethtool_begin,
 	.complete		= igb_ethtool_complete,
 };
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 8cf44f2..a56266e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7838,4 +7838,26 @@ s32 igb_write_i2c_byte(struct e1000_hw *hw, u8 byte_offset,
 		return E1000_SUCCESS;
 
 }
+
+int igb_reinit_queues(struct igb_adapter *adapter)
+{
+	struct net_device *netdev = adapter->netdev;
+	struct pci_dev *pdev = adapter->pdev;
+	int err = 0;
+
+	if (netif_running(netdev))
+		igb_close(netdev);
+
+	igb_clear_interrupt_scheme(adapter);
+
+	if (igb_init_interrupt_scheme(adapter, true)) {
+		dev_err(&pdev->dev, "Unable to allocate memory for queues\n");
+		return -ENOMEM;
+	}
+
+	if (netif_running(netdev))
+		err = igb_open(netdev);
+
+	return err;
+}
 /* igb_main.c */
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net 0/4] bridge: Fix problems around the PVID
From: Toshiaki Makita @ 2013-10-01 11:56 UTC (permalink / raw)
  To: vyasevic
  Cc: Toshiaki Makita, David Miller, netdev, Fernando Luis Vazquez Cao,
	Patrick McHardy
In-Reply-To: <5249A03E.4080208@redhat.com>

On Mon, 2013-09-30 at 12:01 -0400, Vlad Yasevich wrote:
> On 09/30/2013 07:46 AM, Toshiaki Makita wrote:
> > On Fri, 2013-09-27 at 14:10 -0400, Vlad Yasevich wrote:
> >> On 09/27/2013 01:11 PM, Toshiaki Makita wrote:
> >>> On Thu, 2013-09-26 at 10:22 -0400, Vlad Yasevich wrote:
> >>>> On 09/26/2013 06:38 AM, Toshiaki Makita wrote:
> >>>>> On Tue, 2013-09-24 at 13:55 -0400, Vlad Yasevich wrote:
> >>>>>> On 09/24/2013 01:30 PM, Toshiaki Makita wrote:
> >>>>>>> On Tue, 2013-09-24 at 09:35 -0400, Vlad Yasevich wrote:
> >>>>>>>> On 09/24/2013 07:45 AM, Toshiaki Makita wrote:
> >>>>>>>>> On Mon, 2013-09-23 at 10:41 -0400, Vlad Yasevich
> >>>>>>>>> wrote:
> >>>>>>>>>> On 09/17/2013 04:12 AM, Toshiaki Makita wrote:
> >>>>>>>>>>> On Mon, 2013-09-16 at 13:49 -0400, Vlad Yasevich
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>> On 09/13/2013 08:06 AM, Toshiaki Makita wrote:
> >>>>>>>>>>>>> On Thu, 2013-09-12 at 16:00 -0400, David
> >>>>>>>>>>>>> Miller wrote:
> >>>>>>>>>>>>>> From: Toshiaki Makita
> >>>>>>>>>>>>>> <makita.toshiaki@lab.ntt.co.jp> Date: Tue,
> >>>>>>>>>>>>>> 10 Sep 2013 19:27:54 +0900
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> There seem to be some undesirable
> >>>>>>>>>>>>>>> behaviors related with PVID. 1. It has no
> >>>>>>>>>>>>>>> effect assigning PVID to a port. PVID
> >>>>>>>>>>>>>>> cannot be applied to any frame regardless
> >>>>>>>>>>>>>>> of whether we set it or not. 2. FDB
> >>>>>>>>>>>>>>> entries learned via frames applied PVID
> >>>>>>>>>>>>>>> are registered with VID 0 rather than VID
> >>>>>>>>>>>>>>> value of PVID. 3. We can set 0 or 4095 as
> >>>>>>>>>>>>>>> a PVID that are not allowed in IEEE
> >>>>>>>>>>>>>>> 802.1Q. This leads interoperational
> >>>>>>>>>>>>>>> problems such as sending frames with VID
> >>>>>>>>>>>>>>> 4095, which is not allowed in IEEE
> >>>>>>>>>>>>>>> 802.1Q, and treating frames with VID 0 as
> >>>>>>>>>>>>>>> they belong to VLAN 0, which is expected
> >>>>>>>>>>>>>>> to be handled as they have no VID
> >>>>>>>>>>>>>>> according to IEEE 802.1Q.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Note: 2nd and 3rd problems are potential
> >>>>>>>>>>>>>>> and not exposed unless 1st problem is
> >>>>>>>>>>>>>>> fixed, because we cannot activate PVID
> >>>>>>>>>>>>>>> due to it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please work out the issues in patch #2 with
> >>>>>>>>>>>>>> Vlad and resubmit this series.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thank you.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm hovering between whether we should fix
> >>>>>>>>>>>>> the issue by changing vlan 0 interface
> >>>>>>>>>>>>> behavior in 8021q module or enabling a bridge
> >>>>>>>>>>>>> port to sending priority-tagged frames, or
> >>>>>>>>>>>>> another better way.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If you could comment it, I'd appreciate it
> >>>>>>>>>>>>> :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> BTW, I think what is discussed in patch #2 is
> >>>>>>>>>>>>> another problem about handling priority-tags,
> >>>>>>>>>>>>> and it exists without this patch set
> >>>>>>>>>>>>> applied. It looks like that we should prepare
> >>>>>>>>>>>>> another patch set than this to fix that
> >>>>>>>>>>>>> problem.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Should I include patches that fix the
> >>>>>>>>>>>>> priority-tags problem in this patch set and
> >>>>>>>>>>>>> resubmit them all together?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am thinking that we might need to do it in
> >>>>>>>>>>>> bridge and it looks like the simplest way to do
> >>>>>>>>>>>> it is to have default priority regeneration
> >>>>>>>>>>>> table (table 6-5 from 802.1Q doc).
> >>>>>>>>>>>>
> >>>>>>>>>>>> That way I think we would conform to the spec.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -vlad
> >>>>>>>>>>>
> >>>>>>>>>>> Unfortunately I don't think the default priority
> >>>>>>>>>>> regeneration table resolves the problem because
> >>>>>>>>>>> IEEE 802.1Q says that a VLAN-aware bridge can
> >>>>>>>>>>> transmit untagged or VLAN-tagged frames only (the
> >>>>>>>>>>> end of section 7.5 and 8.1.7).
> >>>>>>>>>>>
> >>>>>>>>>>> No mechanism to send priority-tagged frames is
> >>>>>>>>>>> found as far as I can see the standard. I think
> >>>>>>>>>>> the regenerated priority is used for outgoing
> >>>>>>>>>>> PCP field only if egress policy is not untagged
> >>>>>>>>>>> (i.e. transmitting as VLAN-tagged), and unused if
> >>>>>>>>>>> untagged (Section 6.9.2 3rd/4th Paragraph).
> >>>>>>>>>>>
> >>>>>>>>>>> If we want to transmit priority-tagged frames
> >>>>>>>>>>> from a bridge port, I think we need to implement
> >>>>>>>>>>> a new (optional) feature that is above the
> >>>>>>>>>>> standard, as I stated previously.
> >>>>>>>>>>>
> >>>>>>>>>>> How do you feel about adding a per-port policy
> >>>>>>>>>>> that enables a bridge to send priority-tagged
> >>>>>>>>>>> frames instead of untagged frames when egress
> >>>>>>>>>>> policy for the port is untagged? With this
> >>>>>>>>>>> change, we can transmit frames for a given vlan
> >>>>>>>>>>> as either all untagged, all priority-tagged or
> >>>>>>>>>>> all VLAN-tagged.
> >>>>>>>>>>
> >>>>>>>>>> That would work.  What I am thinking is that we do
> >>>>>>>>>> it by special casing the vid 0 egress policy
> >>>>>>>>>> specification.  Let it be untagged by default and
> >>>>>>>>>> if it is tagged, then we preserve the priority
> >>>>>>>>>> field and forward it on.
> >>>>>>>>>>
> >>>>>>>>>> This keeps the API stable and doesn't require
> >>>>>>>>>> user/admin from knowing exactly what happens.
> >>>>>>>>>> Default operation conforms to the spec and allows
> >>>>>>>>>> simple change to make it backward-compatible.
> >>>>>>>>>>
> >>>>>>>>>> What do you think.  I've done a simple prototype of
> >>>>>>>>>> this an it seems to work with the VMs I am testing
> >>>>>>>>>> with.
> >>>>>>>>>
> >>>>>>>>> Are you saying that - by default, set the 0th bit of
> >>>>>>>>> untagged_bitmap; and - if we unset the 0th bit and
> >>>>>>>>> set the "vid"th bit, we transmit frames classified as
> >>>>>>>>> belonging to VLAN "vid" as priority-tagged?
> >>>>>>>>>
> >>>>>>>>> If so, though it's attractive to keep current API,
> >>>>>>>>> I'm worried about if it could be a bit confusing and
> >>>>>>>>> not intuitive for kernel/iproute2 developers that VID
> >>>>>>>>> 0 has a special meaning only in the egress policy.
> >>>>>>>>> Wouldn't it be better to adding a new member to
> >>>>>>>>> struct net_port_vlans instead of using VID 0 of
> >>>>>>>>> untagged_bitmap?
> >>>>>>>>>
> >>>>>>>>> Or are you saying that we use a new flag in struct
> >>>>>>>>> net_port_vlans but use the BRIDGE_VLAN_INFO_UNTAGGED
> >>>>>>>>> bit with VID 0 in netlink to set the flag?
> >>>>>>>>>
> >>>>>>>>> Even in that case, I'm afraid that it might be
> >>>>>>>>> confusing for developers for the same reason. We are
> >>>>>>>>> going to prohibit to specify VID with 0 (and 4095) in
> >>>>>>>>> adding/deleting a FDB entry or a vlan filtering
> >>>>>>>>> entry, but it would allow us to use VID 0 only when a
> >>>>>>>>> vlan filtering entry is configured. I am thinking a
> >>>>>>>>> new nlattr is a straightforward approach to
> >>>>>>>>> configure it.
> >>>>>>>>
> >>>>>>>> By making this an explicit attribute it makes vid 0 a
> >>>>>>>> special case for any automatic tool that would
> >>>>>>>> provision such filtering.  Seeing vid 0 would mean that
> >>>>>>>> these tools would have to know that this would have to
> >>>>>>>> be translated to a different attribute instead of
> >>>>>>>> setting the policy values.
> >>>>>>>
> >>>>>>> Yes, I agree with you that we can do it by the way you
> >>>>>>> explained. What I don't understand is the advantage of
> >>>>>>> using vid 0 over another way such as adding a new
> >>>>>>> nlattr. I think we can indicate transmitting
> >>>>>>> priority-tags explicitly by such a nlattr. Using vid 0
> >>>>>>> seems to be easier to implement than a new nlattr, but,
> >>>>>>> for me, it looks less intuitive and more difficult to
> >>>>>>> maintain because we have to care about vid 0 instead of
> >>>>>>> simply ignoring it.
> >>>>>>>
> >>>>>>
> >>>>>> The point I am trying to make is that regardless of the
> >>>>>> approach someone has to know what to do when enabling
> >>>>>> priority tagged frames.  You proposal would require the
> >>>>>> administrator or config tool to have that knowledge.
> >>>>>> Example is: Admin does: bridge vlan set priority on dev
> >>>>>> eth0 Automated app: if (vid == 0) /* Turn on priority
> >>>>>> tagged frame support */
> >>>>>>
> >>>>>> My proposal would require the bridge filtering
> >>>>>> implementation to have it. user tool: bridge vlan add vid 0
> >>>>>> tagged Automated app:  No special case.
> >>>>>>
> >>>>>> IMO its better to have 1 piece code handling the special
> >>>>>> case then putting it multiple places.
> >>>>>
> >>>>> Thank you for the detailed explanation. Now I understand your
> >>>>> intention.
> >>>>>
> >>>>> I have one question about your proposal. I guess the way to
> >>>>> enable priority-tagged is something like bridge vlan add vid
> >>>>> 10 dev eth0 bridge vlan add vid 10 dev vnet0 pvid untagged
> >>>>> bridge vlan add vid 0 dev vnet0 tagged where vnet0 has sub
> >>>>> interface vnet0.0.
> >>>>>
> >>>>> Here the admin have to know the egress policy is applied to a
> >>>>> frame twice in a certain order when it is transmitted from
> >>>>> the port vnet0 attached, that is, first, a frame with vid 10
> >>>>> get untagged, and then, an untagged frame get
> >>>>> priority-tagged.
> >>>>>
> >>>>> This behavior looks difficult to know without previous
> >>>>> knowledge. Any good idea to avoid such a need for the admin's
> >>>>> additional knowledge?
> >>>>
> >>>> To me, the fact that there is vnet0.0 (or typically, there is
> >>>> eth0.0 in the guest or on the remote host) already tells the
> >>>> admin vlan 0 has to be tagged.  The fact that we codify this in
> >>>> the policy makes it explicit.
> >>>
> >>> My worry is that the admin might not be able to guess how to use
> >>> bridge commands to enable priority-tag without any additional
> >>> hint in "man bridge", "bridge vlan help", etc. I actually
> >>> couldn't hit upon such a usage before seeing example commands you
> >>> gave, because I had never think the egress policy could be
> >>> applied twice.
> >>>
> >>>>
> >>>> However, I can see strong argument to be made for an addition
> >>>> egress policy attribute that could be for instance:
> >>>>
> >>>> bridge vlan add vid 10 dev eth0 pvid bridge vlan add vid 10 dev
> >>>> vnet0 pvid untagged prio_tag
> >>>>
> >>>> But this has the same connotations as wrt to egress policy.
> >>>> The 2 policies are applied: (1) untag the frame. (2) add
> >>>> priority_tag.
> >>>>
> >>>> (2) only happens if initial fame received on eth0 was priority
> >>>> tagged.
> >>>
> >>> If we do so, we will not be able to communicate using vlan 0
> >>> interface under a certain circumstance. Eth0 can be receive mixed
> >>> untagged and priority-tagged frames according to the network
> >>> element it is connected to: for example, Open vSwitch can send
> >>> such two kinds of frames from the same port even if original
> >>> incoming frames belong to the same vlan.
> >>
> >> Which priority would you assign to the frame that was received
> >> untagged?
> >
> > Untagged frame's priority is by default 0, so I think 0 is likely.
> >
> > 802.1Q 6.9.1 i) The received priority value and the drop_eligible
> > parameter value are the values in the M_UNITDATA.indication.
> >
> > M_UNITDATA is passed from ISS.
> >
> > 802.1Q 6.7.1 The priority parameter provided in a data indication
> > primitive shall take the value of the Default User Priority parameter
> > for the Port through which the MAC frame was received. The default
> > value of this parameter is 0, it may be set by management in which
> > case the capability to set it to any of the values 0 through 7 shall
> > be provided.
> >
> >>
> >>> In this situation, we can only receive frames that is
> >>> priority-tagged when received on eth0.
> >>
> >> Not sure I understand.  Let's look at this config: bridge vlan add
> >> vid 10 dev eth0 pvid bridge vlan add vid 10 dev vnet0 pvid untagged
> >> prio_tag
> >>
> >> Here, eth0 is allowed to receive vid 10 tagged, untagged, and
> >> prio_tagged (if we look at the patch 2 from this set). Now, frame
> >> is forwarded to vnet0. 1) if the frame had vid 10 in the tag or was
> >> untagged, it should probably be sent untagged. 2) if the frame had
> >> a priority tag, it should probably be sent as such.
> >>
> >> Now, I think a case could be made that if the frame had any
> >> priority markings in the vlan header, we should try to preserve
> >> those markings if prio_tag is turned on.  We can assume value of 0
> >> means not set.
> >
> > If we don't insert prio_tag when PCP is 0, we might receive mixed
> > priority-tagged and untagged frames on eth0.
> 
> Right, and that's what you were trying to handle in your patch:
> 
> > +		/* PVID is set on this port.  Any untagged or priority-tagged +
> > * ingress frame is considered to belong to this vlan. */
> 
> So, in this case we are prepared to handle the "mixed" scenario on ingress.
> 
> > Even if we are sending frames from eth0.0 with some priority other
> > than 0, we could receive frames with priority 0 or untagged on the
> > other side of the bridge.
> > For example, if we receive untagged arp reply on the bridge port, we
> > migit not be able to communicate with such an end station, because
> > untagged reply will not be passed to eth0.0.
> 
> So the ARP request was sent tagged, but the reply came back untagged?

Yes, it can happen.
These are problematic cases.

Example 1:
            prio_tagged         prio_tagged
+------------+ ---> +------------+ ---> +----------+
|guest eth0.0|------|host1 Bridge|------|host2 eth0|
+------------+ <--- +------------+ <--- +----------+
             untagged            untagged

Note: Host2 eth0, which is an interface on Linux, can receive
priority-tagged frames if it doesn't have vlan 0 interface (eth0.0).


Example 2:

            prio_tagged         prio_tagged       untagged
+------------+ ---> +------------+ ---> +---------+ ---> +-----+
|guest eth0.0|------|host1 Bridge|------|HW switch|------|host2|
+------------+ <--- +------------+ <--- +---------+ <--- +-----+
             untagged            untagged       prio/untagged

Note: 802.1Q conformed HW switch transmits only untagged or VLAN-tagged
frames.

> 
> How does that work when the end station is attached directly to the
> HW switch instead of a linux bridge?
> 
> The station configures eth0.0 and sends priority-tagged traffic to
> the HW switch.  If the HW switch sends back untagged traffic, then
> the untagged traffic will never reach eth0.0.

Currently we cannot communicate using eth0.0 via directly connected
802.1Q conformed switch, because we never receive priority-tagged frames
from the switch.
It is not a problem of Linux bridge and is why I wondered whether it
should be fixed by bridge or vlan 0 interface.

> 
> >
> >>
> >>> IMO, if prio_tag is configured, the bridge should send any
> >>> untagged frame as priority-tagged regardless of whatever it is on
> >>> eth0.
> >>
> >> Which priority would you use, 0?  You are not guaranteed to
> >> properly deliver the traffic then for a configuration such as: VM:
> >> eth0: 10.0.0.1/24 eth0.0: 10.0.1.1/24
> >
> > I'd like to use priority 0 for untagged frames.
> >
> > I am assuming that one of our goals is at least that eth0.0 comes to
> > be able to communicate with another end station. It seems to be hard
> > to use both eth0 and eth0.0 simultaneously.
> 
> I understand, but I don't agree that we should always tag.
> 
> Consider config:
> 
>     hw switch <---> (eth0: Linux Bridge: eth1) <--- (em1.0:end station)
> 
> If the end station sends priority-tagged traffic it should receive
> priority tagged traffic back.  Otherwise, untagged traffic may be 
> dropped by the end station.  This is true whether it is connected to
> the hw switch or Linux bridge.

Though such a behavior is generally not necessary as far as I can read
802.1Q spec, it is essential for vlan 0 interface on Linux, I think.
My proposal aims to resolve it at least when we use Linux bridge.

Example configuration:
	bridge vlan add vid 10 dev eth1 pvid untagged
	bridge vlan add vid 10 dev eth0
	bridge vlan set prio_tag on dev eth1

Intended behavior:

        VID10-tagged                     prio_tagged
+---------+ <--- +------------------------+ <--- +-----------------+
|hw switch|------|eth0: Linux Bridge: eth1|------|em1.0:end station|
+---------+ ---> +------------------------+ ---> +-----------------+
        VID10-tagged                     prio_tagged
                              (always if egress policy untagged)

Thanks,

Toshiaki Makita

> 
> -vlad
> 
> >
> > Thanks,
> >
> > Toshiaki Makita
> >
> >>
> >> -vlad
> >>
> >>>
> >>>>
> >>>> I think I am ok with either approach.  Explicit vid 0 policy is
> >>>> easier for automatic provisioning.   The flag based one is
> >>>> easier for admin/ manual provisioning.
> >>>
> >>> Supposing we have to add something to help or man in any case, I
> >>> think flag based is better. The format below seems to suitable
> >>> for a per-port policy. bridge vlan set prio_tag on dev vnet0
> >>>
> >>> Thanks,
> >>>
> >>> Toshiaki Makita
> >>>
> >>>>
> >>>> -vlad.
> >>>>
> >>>> -vlad
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Thanks -vlad
> >>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Toshiaki Makita
> >>>>>>>
> >>>>>>>>
> >>>>>>>> How it is implemented internally in the kernel isn't as
> >>>>>>>> big of an issue. We can do it as a separate flag or as
> >>>>>>>> part of existing policy.
> >>>>>>>>
> >>>>>>>> -vlad
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Toshiaki Makita
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> -vlad
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Toshiaki Makita
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -- To unsubscribe from this list: send the
> >>>>>>>>>>>>>> line "unsubscribe netdev" in the body of a
> >>>>>>>>>>>>>> message to majordomo@vger.kernel.org More
> >>>>>>>>>>>>>> majordomo info at
> >>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -- To unsubscribe from this list: send the line
> >>>>>>>>>>> "unsubscribe netdev" in the body of a message to
> >>>>>>>>>>> majordomo@vger.kernel.org More majordomo info at
> >>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >
> >

^ permalink raw reply

* Re: [PATCH] ipv6: udp packets following an UFO enqueued packet need also be handled by UFO
From: Hannes Frederic Sowa @ 2013-10-01 12:09 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, yoshfuji, davem, kuznet, jmorris, kaber, herbert,
	eric.dumazet
In-Reply-To: <20131001105837.GA1424@minipsycho.brq.redhat.com>

On Tue, Oct 01, 2013 at 12:58:37PM +0200, Jiri Pirko wrote:
> >> 
> >> What if non-ufo-path-created skb is peeked?
> >
> >You mean like:
> >first append a frame < mtu
> >second frame > mtu so it gets handles by ip6_ufo_append?
> >
> >Currently I don't see a problem with it but I may be wrong. What is
> >your suspicion?
> 
> Well the skb will have gso_size==0 and ip_summed==CHECKSUM_NONE
> Because of that it will be not threated as ufo skb.
> 
> Following patch fixes it:
> 
> Subject: [patch net] ip6_output: do skb ufo init for peeked non ufo skb as well
> 
> Now, if user application does:
> sendto len<mtu flag MSG_MORE
> sendto len>mtu flag 0
> The skb is not treated as fragmented one because it is not initialized
> that way. So move the initialization to fix this.
> 
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>

My understanding is that the frame is only a valid gso frame iff the first skb
queued up is setup as an UFO frame. In other cases we only need to make sure
we append to the fragment list without updating the gso field and the skb will
get linearized (if needed) later in the output path.

This seems not to matter for virtio_net and loopback because we seem to pass
the skb as is. But the remaining ethernet card which sets NETIF_F_UFO is
neterion and we have to verify if this assumption is correct, there.

This is also the way the IPv4 stack deals with UFO.

Either way, this needs a look from Eric Dumazet and Herbert Xu, so I added
them in Cc.

Greetings,

  Hannes

^ permalink raw reply

* RE: [PATCH net RFC 2/2] ixgbe: fix sleep bug caused by napi_disable inside local_bh_disable()d context
From: Yuval Mintz @ 2013-10-01 12:11 UTC (permalink / raw)
  To: Keller, Jacob E, Francois Romieu
  Cc: netdev@vger.kernel.org, Duyck, Alexander H, Hyong-Youb Kim,
	Dmitry Kravkov, Amir Vadai, Eliezer Tamir
In-Reply-To: <02874ECE860811409154E81DA85FBB58565F5E52@ORSMSX104.amr.corp.intel.com>

> > > I have to move the local_bh_disable in order to put napi_disable
> > outside
> > > of the call since napi_disable could sleep, causing a scheduling while
> > > atomic BUG.
> >
> > I am in violent agreement with this part.
> > --
> > Ueimor
> 
> Regards,
> Jake
> --

It seem like we've hit the same issue with the bnx2x driver.
Is there anything new about the RFC?

Thanks,
Yuval

^ permalink raw reply

* Re: [PATCH] ipv6: udp packets following an UFO enqueued packet need also be handled by UFO
From: Hannes Frederic Sowa @ 2013-10-01 12:32 UTC (permalink / raw)
  To: Jiri Pirko, netdev, yoshfuji, davem, kuznet, jmorris, kaber,
	herbert, eric.dumazet
In-Reply-To: <20131001120907.GH10771@order.stressinduktion.org>

On Tue, Oct 01, 2013 at 02:09:07PM +0200, Hannes Frederic Sowa wrote:
> On Tue, Oct 01, 2013 at 12:58:37PM +0200, Jiri Pirko wrote:
> > >> 
> > >> What if non-ufo-path-created skb is peeked?
> > >
> > >You mean like:
> > >first append a frame < mtu
> > >second frame > mtu so it gets handles by ip6_ufo_append?
> > >
> > >Currently I don't see a problem with it but I may be wrong. What is
> > >your suspicion?
> > 
> > Well the skb will have gso_size==0 and ip_summed==CHECKSUM_NONE
> > Because of that it will be not threated as ufo skb.
> > 
> > Following patch fixes it:
> > 
> > Subject: [patch net] ip6_output: do skb ufo init for peeked non ufo skb as well
> > 
> > Now, if user application does:
> > sendto len<mtu flag MSG_MORE
> > sendto len>mtu flag 0
> > The skb is not treated as fragmented one because it is not initialized
> > that way. So move the initialization to fix this.
> > 
> > Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> 
> My understanding is that the frame is only a valid gso frame iff the first skb
> queued up is setup as an UFO frame. In other cases we only need to make sure
> we append to the fragment list without updating the gso field and the skb will
> get linearized (if needed) later in the output path.
> 
> This seems not to matter for virtio_net and loopback because we seem to pass
> the skb as is. But the remaining ethernet card which sets NETIF_F_UFO is
> neterion and we have to verify if this assumption is correct, there.

Hm, ip_append_page seems to violate my assumption. I need to read up on the
code later today.

^ permalink raw reply

* RE: Question regarding failure utilizing bonding mode 5 (balance-tlb)
From: Yuval Mintz @ 2013-10-01 12:56 UTC (permalink / raw)
  To: Veaceslav Falico; +Cc: Jay Vosburgh, netdev@vger.kernel.org, Ariel Elior
In-Reply-To: <20130930212440.GC24830@redhat.com>

> On Mon, Sep 30, 2013 at 11:30:40AM +0000, Yuval Mintz wrote:
> >> > >Again, I think the permanent address is restored only when the bond
> >> > >releases the slave, which I don't think happens when the slave is
> unloaded.
> >> >
> >> > 	Given a bond0 with two slaves, eth0 and eth1, in tlb mode, eth0
> >> > being the active,
> >> >
> >> > 	1) "ip link set dev eth0 down" which will fail over to eth1
> >> > 		(swapping the contents of their dev_addr fields).
> >> >
> >> > 	2) "ip link set dev eth0 up" eth0 comes back up, reprograms its
> >> > 		MAC to the wrong thing (what was in dev_addr).
> >> >
> >> > 	3) repeat steps 1 and 2 for eth1
> >> >
> >> > 	Is this correct?
> >> >
> >>
> >> Yes, sorry for the earlier confusion.
> >> I think in the case described `alb_swap_mac_addr()' will be called,
> >> replacing eth0 and eth1's dev_addr, causing eth0 to have dev_addr
> >> which defers from  the bond device's. Once eth0 reloads, it will use
> >> the different MAC address for configuring FW/HW.
> >
> >Hi,
> >
> >Did you by any chance had the time to look at this issue?
> 
> Hi Yuval,
> 
> Sorry for getting into the discussion - but I've tried to understand the
> problem and, possibly, find a fix.
> 
> I'm not sure that I completely understand it, and I don't have currently
> hardware on which to test it (though I might have it in the nearest
> future), so, again, I really am not sure that I won't suggest something
> completely stupid.
> 
> Anyway, that being said, I hope that the following patch might fix the
> problem. I've described the bug and the fix in the changelog, and the code
> is pretty self-explanitory.
> 
> And even if the patch fixes the issue - I'm not sure that it's the proper
> and correct way to do it. But it's definitely worth a try... So, if it's
> possible, could you please test this patch and see if it fixes it?
> 
> Warning: I've just compile-tested it.
> 
> So, FWIW...
> 

Like you, I don't know if yours is the proper way of fixing the issue - but it did
seem to fix it (the scenario that was described, at least)

Tested-by: Yuval Mintz <yuvalmin@broadcom.com>

Thanks,
Yuval

>  From 87e6c584b0ae0f0261610d60cf83778feb9c1edb Mon Sep 17
> 00:00:00 2001
> From: Veaceslav Falico <vfalico@redhat.com>
> Date: Mon, 30 Sep 2013 23:14:23 +0200
> Subject: [PATCH] bonding: ensure that TLB mode's active slave has correct
> mac filter
> 
> Currently, in TLB mode we change mac addresses only by memcpy-ing the to
> net_device->dev_addr, without actually setting them via
> dev_set_mac_address(). This permits us to receive all the traffic always on
> one mac address.
> 
> However, in case the interface flips, some drivers might enforce the
> mac filtering for its FW/HW based on current ->dev_addr, and thus we won't
> be able to receive traffic on that interface, in case it will be selected
> as active in TLB mode.
> 
> Fix it by setting the mac address forcefully on every new active slave that
> we select in TLB mode.
> 
> CC: Jay Vosburgh <fubar@us.ibm.com>
> CC: Andy Gospodarek <andy@greyhouse.net>
> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
> ---
>   drivers/net/bonding/bond_alb.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index e960418..576ccea 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -1699,6 +1699,23 @@ void bond_alb_handle_active_change(struct
> bonding *bond, struct slave *new_slave
> 
>   	ASSERT_RTNL();
> 
> +	/* in TLB mode, the slave might flip down/up with the old dev_addr,
> +	 * and thus filter bond->dev_addr's packets, so force bond's mac
> +	 */
> +	if (bond->params.mode == BOND_MODE_TLB) {
> +		struct sockaddr sa;
> +		u8 tmp_addr[ETH_ALEN];
> +
> +		memcpy(tmp_addr, new_slave->dev->dev_addr, ETH_ALEN);
> +
> +		memcpy(sa.sa_data, bond->dev->dev_addr, bond->dev-
> >addr_len);
> +		sa.sa_family = bond->dev->type;
> +		/* we don't care if it can't change its mac, best effort */
> +		dev_set_mac_address(new_slave->dev, &sa);
> +
> +		memcpy(new_slave->dev->dev_addr, tmp_addr, ETH_ALEN);
> +	}
> +
>   	/* curr_active_slave must be set before calling alb_swap_mac_addr
> */
>   	if (swap_slave) {
>   		/* swap mac address */
> --
> 1.8.4
> 
> 
> >
> >Thanks,
> >Yuval
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Question regarding failure utilizing bonding mode 5 (balance-tlb)
From: Veaceslav Falico @ 2013-10-01 12:58 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: Jay Vosburgh, netdev@vger.kernel.org, Ariel Elior
In-Reply-To: <979A8436335E3744ADCD3A9F2A2B68A52ADFFFCF@SJEXCHMB10.corp.ad.broadcom.com>

On Tue, Oct 01, 2013 at 12:56:43PM +0000, Yuval Mintz wrote:
>> On Mon, Sep 30, 2013 at 11:30:40AM +0000, Yuval Mintz wrote:
>> >> > >Again, I think the permanent address is restored only when the bond
>> >> > >releases the slave, which I don't think happens when the slave is
>> unloaded.
>> >> >
>> >> > 	Given a bond0 with two slaves, eth0 and eth1, in tlb mode, eth0
>> >> > being the active,
>> >> >
>> >> > 	1) "ip link set dev eth0 down" which will fail over to eth1
>> >> > 		(swapping the contents of their dev_addr fields).
>> >> >
>> >> > 	2) "ip link set dev eth0 up" eth0 comes back up, reprograms its
>> >> > 		MAC to the wrong thing (what was in dev_addr).
>> >> >
>> >> > 	3) repeat steps 1 and 2 for eth1
>> >> >
>> >> > 	Is this correct?
>> >> >
>> >>
>> >> Yes, sorry for the earlier confusion.
>> >> I think in the case described `alb_swap_mac_addr()' will be called,
>> >> replacing eth0 and eth1's dev_addr, causing eth0 to have dev_addr
>> >> which defers from  the bond device's. Once eth0 reloads, it will use
>> >> the different MAC address for configuring FW/HW.
>> >
>> >Hi,
>> >
>> >Did you by any chance had the time to look at this issue?
>>
>> Hi Yuval,
>>
>> Sorry for getting into the discussion - but I've tried to understand the
>> problem and, possibly, find a fix.
>>
>> I'm not sure that I completely understand it, and I don't have currently
>> hardware on which to test it (though I might have it in the nearest
>> future), so, again, I really am not sure that I won't suggest something
>> completely stupid.
>>
>> Anyway, that being said, I hope that the following patch might fix the
>> problem. I've described the bug and the fix in the changelog, and the code
>> is pretty self-explanitory.
>>
>> And even if the patch fixes the issue - I'm not sure that it's the proper
>> and correct way to do it. But it's definitely worth a try... So, if it's
>> possible, could you please test this patch and see if it fixes it?
>>
>> Warning: I've just compile-tested it.
>>
>> So, FWIW...
>>
>
>Like you, I don't know if yours is the proper way of fixing the issue - but it did
>seem to fix it (the scenario that was described, at least)
>
>Tested-by: Yuval Mintz <yuvalmin@broadcom.com>

Thank you!

I'll then try now to dig it a big futher, and if it happens that this fix is
really the one we need - I'll use your Tested-by, hope that you're ok with
that. Otherwise I'll send a new patch(set) with you in the CC.

Thanks again!

>
>Thanks,
>Yuval
>
>>  From 87e6c584b0ae0f0261610d60cf83778feb9c1edb Mon Sep 17
>> 00:00:00 2001
>> From: Veaceslav Falico <vfalico@redhat.com>
>> Date: Mon, 30 Sep 2013 23:14:23 +0200
>> Subject: [PATCH] bonding: ensure that TLB mode's active slave has correct
>> mac filter
>>
>> Currently, in TLB mode we change mac addresses only by memcpy-ing the to
>> net_device->dev_addr, without actually setting them via
>> dev_set_mac_address(). This permits us to receive all the traffic always on
>> one mac address.
>>
>> However, in case the interface flips, some drivers might enforce the
>> mac filtering for its FW/HW based on current ->dev_addr, and thus we won't
>> be able to receive traffic on that interface, in case it will be selected
>> as active in TLB mode.
>>
>> Fix it by setting the mac address forcefully on every new active slave that
>> we select in TLB mode.
>>
>> CC: Jay Vosburgh <fubar@us.ibm.com>
>> CC: Andy Gospodarek <andy@greyhouse.net>
>> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
>> ---
>>   drivers/net/bonding/bond_alb.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>> index e960418..576ccea 100644
>> --- a/drivers/net/bonding/bond_alb.c
>> +++ b/drivers/net/bonding/bond_alb.c
>> @@ -1699,6 +1699,23 @@ void bond_alb_handle_active_change(struct
>> bonding *bond, struct slave *new_slave
>>
>>   	ASSERT_RTNL();
>>
>> +	/* in TLB mode, the slave might flip down/up with the old dev_addr,
>> +	 * and thus filter bond->dev_addr's packets, so force bond's mac
>> +	 */
>> +	if (bond->params.mode == BOND_MODE_TLB) {
>> +		struct sockaddr sa;
>> +		u8 tmp_addr[ETH_ALEN];
>> +
>> +		memcpy(tmp_addr, new_slave->dev->dev_addr, ETH_ALEN);
>> +
>> +		memcpy(sa.sa_data, bond->dev->dev_addr, bond->dev-
>> >addr_len);
>> +		sa.sa_family = bond->dev->type;
>> +		/* we don't care if it can't change its mac, best effort */
>> +		dev_set_mac_address(new_slave->dev, &sa);
>> +
>> +		memcpy(new_slave->dev->dev_addr, tmp_addr, ETH_ALEN);
>> +	}
>> +
>>   	/* curr_active_slave must be set before calling alb_swap_mac_addr
>> */
>>   	if (swap_slave) {
>>   		/* swap mac address */
>> --
>> 1.8.4
>>
>>
>> >
>> >Thanks,
>> >Yuval
>> >
>> >--
>> >To unsubscribe from this list: send the line "unsubscribe netdev" in
>> >the body of a message to majordomo@vger.kernel.org
>> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply

* [PATCH 1/2] net, vxlan Fix compile warning [v4]
From: Prarit Bhargava @ 2013-10-01 13:24 UTC (permalink / raw)
  To: netdev; +Cc: Prarit Bhargava, jpirko

Fix a unintialized variable warning.

drivers/net/vxlan.c: In function ‘vxlan_sock_add’:
drivers/net/vxlan.c:2240:11: error: ‘sock’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  vs->sock = sock;
           ^
drivers/net/vxlan.c:2217:17: note: ‘sock’ was declared here
  struct socket *sock;
                 ^
[v2]: davem suggested resolving this by making create_v{4,6}_sock() return an err
pointer.
[v3]: Ben Hutchings pointed out a missed conversion

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: jpirko@redhat.com
---
 drivers/net/vxlan.c |   31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index d1292fe..cdc78b4 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2182,7 +2182,7 @@ static void vxlan_del_work(struct work_struct *work)
  * could be used for both IPv4 and IPv6 communications, but
  * users may set bindv6only=1.
  */
-static int create_v6_sock(struct net *net, __be16 port, struct socket **psock)
+static struct socket *create_v6_sock(struct net *net, __be16 port)
 {
 	struct sock *sk;
 	struct socket *sock;
@@ -2195,7 +2195,7 @@ static int create_v6_sock(struct net *net, __be16 port, struct socket **psock)
 	rc = sock_create_kern(AF_INET6, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (rc < 0) {
 		pr_debug("UDPv6 socket create failed\n");
-		return rc;
+		return ERR_PTR(rc);
 	}
 
 	/* Put in proper namespace */
@@ -2210,28 +2210,27 @@ static int create_v6_sock(struct net *net, __be16 port, struct socket **psock)
 		pr_debug("bind for UDPv6 socket %pI6:%u (%d)\n",
 			 &vxlan_addr.sin6_addr, ntohs(vxlan_addr.sin6_port), rc);
 		sk_release_kernel(sk);
-		return rc;
+		return ERR_PTR(rc);
 	}
 	/* At this point, IPv6 module should have been loaded in
 	 * sock_create_kern().
 	 */
 	BUG_ON(!ipv6_stub);
 
-	*psock = sock;
 	/* Disable multicast loopback */
 	inet_sk(sk)->mc_loop = 0;
-	return 0;
+	return sock;
 }
 
 #else
 
-static int create_v6_sock(struct net *net, __be16 port, struct socket **psock)
+static struct socket *create_v6_sock(struct net *net, __be16 port)
 {
-		return -EPFNOSUPPORT;
+		return ERR_PTR(-EPFNOSUPPORT);
 }
 #endif
 
-static int create_v4_sock(struct net *net, __be16 port, struct socket **psock)
+static struct socket *create_v4_sock(struct net *net, __be16 port)
 {
 	struct sock *sk;
 	struct socket *sock;
@@ -2246,7 +2245,7 @@ static int create_v4_sock(struct net *net, __be16 port, struct socket **psock)
 	rc = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_UDP, &sock);
 	if (rc < 0) {
 		pr_debug("UDP socket create failed\n");
-		return rc;
+		return ERR_PTR(rc);
 	}
 
 	/* Put in proper namespace */
@@ -2259,13 +2258,12 @@ static int create_v4_sock(struct net *net, __be16 port, struct socket **psock)
 		pr_debug("bind for UDP socket %pI4:%u (%d)\n",
 			 &vxlan_addr.sin_addr, ntohs(vxlan_addr.sin_port), rc);
 		sk_release_kernel(sk);
-		return rc;
+		return ERR_PTR(rc);
 	}
 
-	*psock = sock;
 	/* Disable multicast loopback */
 	inet_sk(sk)->mc_loop = 0;
-	return 0;
+	return sock;
 }
 
 /* Create new listen socket if needed */
@@ -2276,7 +2274,6 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 	struct vxlan_sock *vs;
 	struct socket *sock;
 	struct sock *sk;
-	int rc = 0;
 	unsigned int h;
 
 	vs = kmalloc(sizeof(*vs), GFP_KERNEL);
@@ -2289,12 +2286,12 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 	INIT_WORK(&vs->del_work, vxlan_del_work);
 
 	if (ipv6)
-		rc = create_v6_sock(net, port, &sock);
+		sock = create_v6_sock(net, port);
 	else
-		rc = create_v4_sock(net, port, &sock);
-	if (rc < 0) {
+		sock = create_v4_sock(net, port);
+	if (IS_ERR(sock)) {
 		kfree(vs);
-		return ERR_PTR(rc);
+		return ERR_CAST(sock);
 	}
 
 	vs->sock = sock;
-- 
1.7.9.3

^ permalink raw reply related

* [PATCH 2/2] net, mellanox mlx4 Fix compile warnings [v4]
From: Prarit Bhargava @ 2013-10-01 13:24 UTC (permalink / raw)
  To: netdev; +Cc: Prarit Bhargava, dledford, amirv, davem, ogerlitz, jackm
In-Reply-To: <1380633877-24034-1-git-send-email-prarit@redhat.com>

Fix unitialized variable warnings.

drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function ‘mlx4_HW2SW_CQ_wrapper’:
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2551:16: error: ‘cq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  atomic_dec(&cq->mtt->ref_count);
                ^
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function ‘mlx4_HW2SW_SRQ_wrapper’:
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2734:17: error: ‘srq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  atomic_dec(&srq->mtt->ref_count);

[v2]: davem suggested making cq_res_start_move_to() return 'cq' as an error
pointer instead of setting 'cq' by reference.  I also did the same for
srq.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: dledford@redhat.com
Cc: amirv@mellanox.com
Cc: davem@davemloft.net
Cc: ogerlitz@mellanox.com
Cc: jackm@dev.mellanox.co.il
---
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   46 ++++++++++----------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index dd68763..343206b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -1073,8 +1073,9 @@ static int eq_res_start_move_to(struct mlx4_dev *dev, int slave, int index,
 	return err;
 }
 
-static int cq_res_start_move_to(struct mlx4_dev *dev, int slave, int cqn,
-				enum res_cq_states state, struct res_cq **cq)
+static struct res_cq *cq_res_start_move_to(struct mlx4_dev *dev,
+						  int slave, int cqn,
+						  enum res_cq_states state)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_resource_tracker *tracker = &priv->mfunc.master.res_tracker;
@@ -1117,18 +1118,19 @@ static int cq_res_start_move_to(struct mlx4_dev *dev, int slave, int cqn,
 			r->com.from_state = r->com.state;
 			r->com.to_state = state;
 			r->com.state = RES_CQ_BUSY;
-			if (cq)
-				*cq = r;
+		} else {
+			r = ERR_PTR(err);
 		}
 	}
 
 	spin_unlock_irq(mlx4_tlock(dev));
 
-	return err;
+	return r;
 }
 
-static int srq_res_start_move_to(struct mlx4_dev *dev, int slave, int index,
-				 enum res_cq_states state, struct res_srq **srq)
+static struct res_srq *srq_res_start_move_to(struct mlx4_dev *dev, int slave,
+					     int index,
+					     enum res_cq_states state)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_resource_tracker *tracker = &priv->mfunc.master.res_tracker;
@@ -1167,14 +1169,14 @@ static int srq_res_start_move_to(struct mlx4_dev *dev, int slave, int index,
 			r->com.from_state = r->com.state;
 			r->com.to_state = state;
 			r->com.state = RES_SRQ_BUSY;
-			if (srq)
-				*srq = r;
+		} else {
+			r = ERR_PTR(err);
 		}
 	}
 
 	spin_unlock_irq(mlx4_tlock(dev));
 
-	return err;
+	return r;
 }
 
 static void res_abort_move(struct mlx4_dev *dev, int slave,
@@ -2530,9 +2532,9 @@ int mlx4_SW2HW_CQ_wrapper(struct mlx4_dev *dev, int slave,
 	struct res_cq *cq;
 	struct res_mtt *mtt;
 
-	err = cq_res_start_move_to(dev, slave, cqn, RES_CQ_HW, &cq);
-	if (err)
-		return err;
+	cq = cq_res_start_move_to(dev, slave, cqn, RES_CQ_HW);
+	if (IS_ERR(cq))
+		return PTR_ERR(cq);
 	err = get_res(dev, slave, mtt_base, RES_MTT, &mtt);
 	if (err)
 		goto out_move;
@@ -2565,9 +2567,9 @@ int mlx4_HW2SW_CQ_wrapper(struct mlx4_dev *dev, int slave,
 	int cqn = vhcr->in_modifier;
 	struct res_cq *cq;
 
-	err = cq_res_start_move_to(dev, slave, cqn, RES_CQ_ALLOCATED, &cq);
-	if (err)
-		return err;
+	cq = cq_res_start_move_to(dev, slave, cqn, RES_CQ_ALLOCATED);
+	if (IS_ERR(cq))
+		return PTR_ERR(cq);
 	err = mlx4_DMA_wrapper(dev, slave, vhcr, inbox, outbox, cmd);
 	if (err)
 		goto out_move;
@@ -2709,9 +2711,9 @@ int mlx4_SW2HW_SRQ_wrapper(struct mlx4_dev *dev, int slave,
 	if (srqn != (be32_to_cpu(srqc->state_logsize_srqn) & 0xffffff))
 		return -EINVAL;
 
-	err = srq_res_start_move_to(dev, slave, srqn, RES_SRQ_HW, &srq);
-	if (err)
-		return err;
+	srq = srq_res_start_move_to(dev, slave, srqn, RES_SRQ_ALLOCATED);
+	if (IS_ERR(srq))
+		return PTR_ERR(srq);
 	err = get_res(dev, slave, mtt_base, RES_MTT, &mtt);
 	if (err)
 		goto ex_abort;
@@ -2748,9 +2750,9 @@ int mlx4_HW2SW_SRQ_wrapper(struct mlx4_dev *dev, int slave,
 	int srqn = vhcr->in_modifier;
 	struct res_srq *srq;
 
-	err = srq_res_start_move_to(dev, slave, srqn, RES_SRQ_ALLOCATED, &srq);
-	if (err)
-		return err;
+	srq = srq_res_start_move_to(dev, slave, srqn, RES_SRQ_ALLOCATED);
+	if (IS_ERR(srq))
+		return PTR_ERR(srq);
 	err = mlx4_DMA_wrapper(dev, slave, vhcr, inbox, outbox, cmd);
 	if (err)
 		goto ex_abort;
-- 
1.7.9.3

^ permalink raw reply related

* Re: Established sockets remain open after iface down or address lost
From: Chris Verges @ 2013-10-01 13:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1380203383.3165.172.camel@edumazet-glaptop>

On Thu, Sep 26, 2013 at 06:49:43AM -0700, Eric Dumazet wrote:
> On Wed, 2013-09-25 at 23:04 -0700, Chris Verges wrote:
> >   (6) Physically disconnect the USB/Ethernet adapter from the USB
> >       bus.
> >   (7) Linux removes the 'eth0' interface and associated IP address.
> > 
> > At this point, the socket _still_ shows as ESTABLISHED under
> > netstat.
> > 
> > This is the paradox.  Why is the blocking read not interrupted with
> > a socket error to indicate that the socket is no longer viable?
> 
> Because TCP layer is not sensitive to such temporary events.  You can
> plug again your iface, and IP is valid again.  Why should we give a
> permanent error for such case ?

Yes, the interface could be reconnected ... or it could not be.
Consider an embedded device where a PPP-based radio module is powered
off for a decent amount of time (hours).

  +--------+          +-----------------+
  | Client |----------| Embedded Server |
  +--------+          +-----------------+

The client establishes a connection to the server.  It requests some
data and gets a response.  The socket remains open.  The server then
decides, through some asynchronous process, that the radio needs to be
duty cycled.  So the radio is turned off.

The client attempts to make another request to the device, but
determines that the connection is dead through the normal retry
mechanisms.  It's write() operation returns something like EPIPE.  So on
the client's side, the connection is dead.

But on the server side, the socket is still open and waiting for some
more data.  The interface and IP address and even the remote client are
long gone, but the socket still persists and uses system resources.

This is primarily an issue when the server binds/listens/accepts without
using a socket pool to process sockets after the accept().  That is, the
server processes only one socket.  If the socket resource is held by
this now-dead connection, there is generally no way to reset it without
lengthy or costly keepalives (depending on whether the keepalive timer
is set long or short, respectively.)

> If network communication is cut somewhere, TCP is not supposed to
> immediately react. Normal timeouts and retransmits take place. 

I agree in the sense that "somewhere" is between the remote station
(inclusive) and the local station (exclusive.)  I would argue that the
local station could be aware of its own state changes and may choose to
respond accordingly.

I can see the arguments for why the existing behavior would be viewed
favorably.  From this particular real-world scenario that I am
encountering, I can also see why a modified behavior would be useful.
Would you consider accepting a patch that adds a new socket option to
optionally control this?  The effect would be to cause the socket to
automatically close (interrupting any blocking reads) if the underlying
address used by the socket is unregistered from the stack.  Default
behavior would be maintained.

Thanks,
Chris

^ permalink raw reply

* RE: [PATCH 07/11] igb: Avoid uninitialized advertised variable in eee_set_cur
From: Wyborny, Carolyn @ 2013-10-01 15:00 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel@vger.kernel.org
  Cc: Andi Kleen, Kirsher, Jeffrey T, netdev@vger.kernel.org
In-Reply-To: <1380572952-30729-8-git-send-email-andi@firstfloor.org>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Andi Kleen
> Sent: Monday, September 30, 2013 1:29 PM
> To: linux-kernel@vger.kernel.org
> Cc: Andi Kleen; Kirsher, Jeffrey T; netdev@vger.kernel.org
> Subject: [PATCH 07/11] igb: Avoid uninitialized advertised variable in
> eee_set_cur
> 
> From: Andi Kleen <ak@linux.intel.com>
> 
> eee_get_cur assumes that the output data is already zeroed. It can read-modify-
> write the advertised field:
> 
>               if (ipcnfg & E1000_IPCNFG_EEE_100M_AN)
> 2594			edata->advertised |= ADVERTISED_100baseT_Full;
> 
> This is ok for the normal ethtool eee_get call, which always zeroes the input
> data before.
> 
> But eee_set_cur also calls eee_get_cur and it did not zero the input field. Later
> on it then compares agsinst the field, which can contain partial stack garbage.
> 
> Zero the input field in eee_set_cur() too.
> 
> Cc: jeffrey.t.kirsher@intel.com
> Cc: netdev@vger.kernel.org
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c
> b/drivers/net/ethernet/intel/igb/igb_ethtool.c
> index 48cbc83..41e37ff 100644
> --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
> @@ -2652,6 +2652,8 @@ static int igb_set_eee(struct net_device *netdev,
>  	    (hw->phy.media_type != e1000_media_type_copper))
>  		return -EOPNOTSUPP;
> 
> +	memset(&eee_curr, 0, sizeof(struct ethtool_eee));
> +
>  	ret_val = igb_get_eee(netdev, &eee_curr);
>  	if (ret_val)
>  		return ret_val;
> --
> 1.8.3.1
> 

ACK

Thanks,

Carolyn 

^ permalink raw reply

* Re: [PATCH 2/2] net, mellanox mlx4 Fix compile warnings [v4]
From: Jack Morgenstein @ 2013-10-01 15:17 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: netdev, dledford, amirv, davem, ogerlitz
In-Reply-To: <1380633877-24034-2-git-send-email-prarit@redhat.com>

On Tue,  1 Oct 2013 09:24:37 -0400
Prarit Bhargava <prarit@redhat.com> wrote:

Hi,

You missed some needed changes (You did not get compile warnings,
because indeed "r" was initialized in the paths below.  However,
in these error paths, "err" was ignored.
You should be setting "r" to the error return values:

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index
343206b..c4a0a32 100644 ---
a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -1085,9
+1085,9 @@ static struct res_cq *cq_res_start_move_to(struct mlx4_dev
*dev,
 	spin_lock_irq(mlx4_tlock(dev));
 	r = res_tracker_lookup(&tracker->res_tree[RES_CQ], cqn);
 	if (!r)
-               err = -ENOENT;
+               r = ERR_PTR(-ENOENT);
        else if (r->com.owner != slave)
-               err = -EPERM;
+               r = ERR_PTR(-EPERM);
        else {
                switch (state) {
                case RES_CQ_BUSY:
@@ -1140,9 +1140,9 @@ static struct res_srq
*srq_res_start_move_to(struct mlx4_dev *dev, int slave,
 	spin_lock_irq(mlx4_tlock(dev));
 	r = res_tracker_lookup(&tracker->res_tree[RES_SRQ], index);
 	if (!r)
-               err = -ENOENT;
+               r = ERR_PTR(-ENOENT);
        else if (r->com.owner != slave)
-               err = -EPERM;
+               r = ERR_PTR(-EPERM);
        else {
                switch (state) {
                case RES_SRQ_BUSY:

There are also other calls which need to be changed in a similar
fashion (eq_res_start_move_to(), and one or two others -- I'm
surprised that these others did not also generate uninitialized-var
warnings). If we change cq and srq, we should also change the others.

Since we are in the middle of submitting patches which touch the file
"resource_tracker.c", I would really like to hold off on these warning
fixes for a bit, and I'll handle the changes for all the functions at
once to conform (correctly!) to the format suggested by Dave Miller.

 -Jack

> Fix unitialized variable warnings.
> 
> drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function
> ‘mlx4_HW2SW_CQ_wrapper’:
> drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2551:16: error:
> ‘cq’ may be used uninitialized in this function
> [-Werror=maybe-uninitialized] atomic_dec(&cq->mtt->ref_count); ^
> drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function
> ‘mlx4_HW2SW_SRQ_wrapper’:
> drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2734:17: error:
> ‘srq’ may be used uninitialized in this function
> [-Werror=maybe-uninitialized] atomic_dec(&srq->mtt->ref_count);
> 
> [v2]: davem suggested making cq_res_start_move_to() return 'cq' as an
> error pointer instead of setting 'cq' by reference.  I also did the
> same for srq.
> 
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Cc: dledford@redhat.com
> Cc: amirv@mellanox.com
> Cc: davem@davemloft.net
> Cc: ogerlitz@mellanox.com
> Cc: jackm@dev.mellanox.co.il
> ---
>  .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   46
> ++++++++++---------- 1 file changed, 24 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index
> dd68763..343206b 100644 ---
> a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -1073,8
> +1073,9 @@ static int eq_res_start_move_to(struct mlx4_dev *dev, int
> slave, int index, return err; }
>  
> -static int cq_res_start_move_to(struct mlx4_dev *dev, int slave, int
> cqn,
> -				enum res_cq_states state, struct
> res_cq **cq) +static struct res_cq *cq_res_start_move_to(struct
> mlx4_dev *dev,
> +						  int slave, int cqn,
> +						  enum res_cq_states
> state) {
>  	struct mlx4_priv *priv = mlx4_priv(dev);
>  	struct mlx4_resource_tracker *tracker =
> &priv->mfunc.master.res_tracker; @@ -1117,18 +1118,19 @@ static int
> cq_res_start_move_to(struct mlx4_dev *dev, int slave, int cqn,
> r->com.from_state = r->com.state; r->com.to_state = state;
>  			r->com.state = RES_CQ_BUSY;
> -			if (cq)
> -				*cq = r;
> +		} else {
> +			r = ERR_PTR(err);
>  		}
>  	}
>  
>  	spin_unlock_irq(mlx4_tlock(dev));
>  
> -	return err;
> +	return r;
>  }
>  
> -static int srq_res_start_move_to(struct mlx4_dev *dev, int slave,
> int index,
> -				 enum res_cq_states state, struct
> res_srq **srq) +static struct res_srq *srq_res_start_move_to(struct
> mlx4_dev *dev, int slave,
> +					     int index,
> +					     enum res_cq_states
> state) {
>  	struct mlx4_priv *priv = mlx4_priv(dev);
>  	struct mlx4_resource_tracker *tracker =
> &priv->mfunc.master.res_tracker; @@ -1167,14 +1169,14 @@ static int
> srq_res_start_move_to(struct mlx4_dev *dev, int slave, int index,
> r->com.from_state = r->com.state; r->com.to_state = state;
>  			r->com.state = RES_SRQ_BUSY;
> -			if (srq)
> -				*srq = r;
> +		} else {
> +			r = ERR_PTR(err);
>  		}
>  	}
>  
>  	spin_unlock_irq(mlx4_tlock(dev));
>  
> -	return err;
> +	return r;
>  }
>  
>  static void res_abort_move(struct mlx4_dev *dev, int slave,
> @@ -2530,9 +2532,9 @@ int mlx4_SW2HW_CQ_wrapper(struct mlx4_dev *dev,
> int slave, struct res_cq *cq;
>  	struct res_mtt *mtt;
>  
> -	err = cq_res_start_move_to(dev, slave, cqn, RES_CQ_HW, &cq);
> -	if (err)
> -		return err;
> +	cq = cq_res_start_move_to(dev, slave, cqn, RES_CQ_HW);
> +	if (IS_ERR(cq))
> +		return PTR_ERR(cq);
>  	err = get_res(dev, slave, mtt_base, RES_MTT, &mtt);
>  	if (err)
>  		goto out_move;
> @@ -2565,9 +2567,9 @@ int mlx4_HW2SW_CQ_wrapper(struct mlx4_dev *dev,
> int slave, int cqn = vhcr->in_modifier;
>  	struct res_cq *cq;
>  
> -	err = cq_res_start_move_to(dev, slave, cqn,
> RES_CQ_ALLOCATED, &cq);
> -	if (err)
> -		return err;
> +	cq = cq_res_start_move_to(dev, slave, cqn, RES_CQ_ALLOCATED);
> +	if (IS_ERR(cq))
> +		return PTR_ERR(cq);
>  	err = mlx4_DMA_wrapper(dev, slave, vhcr, inbox, outbox, cmd);
>  	if (err)
>  		goto out_move;
> @@ -2709,9 +2711,9 @@ int mlx4_SW2HW_SRQ_wrapper(struct mlx4_dev
> *dev, int slave, if (srqn != (be32_to_cpu(srqc->state_logsize_srqn) &
> 0xffffff)) return -EINVAL;
>  
> -	err = srq_res_start_move_to(dev, slave, srqn, RES_SRQ_HW,
> &srq);
> -	if (err)
> -		return err;
> +	srq = srq_res_start_move_to(dev, slave, srqn,
> RES_SRQ_ALLOCATED);
> +	if (IS_ERR(srq))
> +		return PTR_ERR(srq);
>  	err = get_res(dev, slave, mtt_base, RES_MTT, &mtt);
>  	if (err)
>  		goto ex_abort;
> @@ -2748,9 +2750,9 @@ int mlx4_HW2SW_SRQ_wrapper(struct mlx4_dev
> *dev, int slave, int srqn = vhcr->in_modifier;
>  	struct res_srq *srq;
>  
> -	err = srq_res_start_move_to(dev, slave, srqn,
> RES_SRQ_ALLOCATED, &srq);
> -	if (err)
> -		return err;
> +	srq = srq_res_start_move_to(dev, slave, srqn,
> RES_SRQ_ALLOCATED);
> +	if (IS_ERR(srq))
> +		return PTR_ERR(srq);
>  	err = mlx4_DMA_wrapper(dev, slave, vhcr, inbox, outbox, cmd);
>  	if (err)
>  		goto ex_abort;

^ permalink raw reply

* Re: [PATCH 2/2] net, mellanox mlx4 Fix compile warnings [v4]
From: Prarit Bhargava @ 2013-10-01 15:25 UTC (permalink / raw)
  To: Jack Morgenstein; +Cc: netdev, dledford, amirv, davem, ogerlitz
In-Reply-To: <20131001181742.72e603d1@jpm-OptiPlex-GX620>



On 10/01/2013 11:17 AM, Jack Morgenstein wrote:
> Since we are in the middle of submitting patches which touch the file
> "resource_tracker.c", I would really like to hold off on these warning
> fixes for a bit, and I'll handle the changes for all the functions at
> once to conform (correctly!) to the format suggested by Dave Miller.
> 
>  -Jack

Hey Jack, np.  Thanks for looking.

P.

^ permalink raw reply

* Re: [PATCH ethtool 1/2] Revert "Fix reporting of VLAN tag offload flags on Linux < 2.6.37"
From: Ben Hutchings @ 2013-10-01 15:35 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers
In-Reply-To: <1379978698.2485.85.camel@bwh-desktop.uk.level5networks.com>

On Tue, 2013-09-24 at 00:24 +0100, Ben Hutchings wrote:
> This reverts commit 1cddbe64cfc66b58988c85086d6df3a77c0c61a5.
> It causes 'ethtool -K' to pass the VLAN tag offload flags to
> ETHTOOL_SFLAGS when they are not supported and will be rejected.
> ---
>  ethtool.c | 41 -----------------------------------------
>  1 file changed, 41 deletions(-)

I've now applied this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH ethtool 2/2] Hide state of VLAN tag offload and LRO if the kernel is too old
From: Ben Hutchings @ 2013-10-01 15:36 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers
In-Reply-To: <1379978791.2485.87.camel@bwh-desktop.uk.level5networks.com>

On Tue, 2013-09-24 at 00:26 +0100, Ben Hutchings wrote:
> Starting with Linux 2.6.37 and ethtool 2.6.36 it was possible to show
> the state of VLAN tag offload (using ETHTOOL_GFLAGS).  But the state
> would always be shown as 'off' for older kernel versions, even though
> VLAN tag offload had been implemented long before this.
> 
> In ethtool 3.4.2 I attempted to fix this by also reading the state of
> VLAN tag offload from the 'features' attribute in sysfs.  But this had
> to be reverted because it causes 'ethtool -K' to pass the flags back
> into ETHTOOL_SFLAGS.
> 
> Instead, hide the VLAN tag offload flags if the kernel is older than
> 2.6.37.
> 
> Similarly, LRO was implemented some time before it was exposed through
> ETHTOOL_GFLAGS in Linux 2.6.24.  So hide the LRO flag if the kernel
> is older than that.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
>  ethtool.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 46 insertions(+), 12 deletions(-)

I've now applied this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH ethtool] sfc: Add support for EF10 registers
From: Ben Hutchings @ 2013-10-01 15:36 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers
In-Reply-To: <1379952298.2485.35.camel@bwh-desktop.uk.level5networks.com>

On Mon, 2013-09-23 at 17:04 +0100, Ben Hutchings wrote:
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
> This works in conjunction with the recently submitted driver change
> 'sfc: Add EF10 registers to register dump'.
> 
> Ben.
> 
>  sfc.c | 159 +++++++++++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 109 insertions(+), 50 deletions(-)

I've now applied this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH ethtool] sfc: Stop using bitfields in register definition structures
From: Ben Hutchings @ 2013-10-01 15:37 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers, David Laight
In-Reply-To: <1379960098.2485.48.camel@bwh-desktop.uk.level5networks.com>

On Mon, 2013-09-23 at 19:14 +0100, Ben Hutchings wrote:
> This doesn't make any difference to the static data size, and
> decreases the code size a little.
> 
> Reported-by: David Laight <David.Laight@ACULAB.COM>
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
>  sfc.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)

I've now applied this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Established sockets remain open after iface down or address lost
From: Rick Jones @ 2013-10-01 15:44 UTC (permalink / raw)
  To: Chris Verges
  Cc: Eric Dumazet, davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20131001132707.GA7442@cverges-dev-lnx.sentient-energy.com>

On 10/01/2013 06:27 AM, Chris Verges wrote:
> The client establishes a connection to the server.  It requests some
> data and gets a response.  The socket remains open.  The server then
> decides, through some asynchronous process, that the radio needs to be
> duty cycled.  So the radio is turned off.
>
> The client attempts to make another request to the device, but
> determines that the connection is dead through the normal retry
> mechanisms.  It's write() operation returns something like EPIPE.  So on
> the client's side, the connection is dead.
>
> But on the server side, the socket is still open and waiting for some
> more data.  The interface and IP address and even the remote client are
> long gone, but the socket still persists and uses system resources.

The protocol between client and server needs to have an 
application-layer "keepalive" mechanism added, and then the server will 
be able to detect a dangling connection without need of any further 
kernel modifications.

If that is not possible, the server can/should set SO_KEEPALIVE and 
perhaps tweak the TCP keepalive settings.  Not as good (IMO) as an 
application-layer keepalive because it only shows that the connection is 
good as far as TCP, but I suppose it could do in a pinch.

rick jones

^ permalink raw reply

* Re: [net-next  9/9] igb: Add ethtool support to configure number of channels
From: Ben Hutchings @ 2013-10-01 15:59 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Laura Mihaela Vasilescu, netdev, gospo, sassmann
In-Reply-To: <1380627236-3190-10-git-send-email-jeffrey.t.kirsher@intel.com>

On Tue, 2013-10-01 at 04:33 -0700, Jeff Kirsher wrote:
> From: Laura Mihaela Vasilescu <laura.vasilescu@rosedu.org>
> 
> This patch adds the ethtool callbacks necessary to configure the
> number of RSS queues.
> 
> The maximum number of queues is in accordance with the datasheets.
> 
> Signed-off-by: Laura Mihaela Vasilescu <laura.vasilescu@rosedu.org>
> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
[...]
> --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
[...]
> +static void igb_get_channels(struct net_device *netdev,
> +			     struct ethtool_channels *ch)
[...]
> +static int igb_set_channels(struct net_device *netdev,
> +			    struct ethtool_channels *ch)

These functions look fine to me.

[...]
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -7838,4 +7838,26 @@ s32 igb_write_i2c_byte(struct e1000_hw *hw, u8 byte_offset,
>  		return E1000_SUCCESS;
>  
>  }
> +
> +int igb_reinit_queues(struct igb_adapter *adapter)
> +{
> +	struct net_device *netdev = adapter->netdev;
> +	struct pci_dev *pdev = adapter->pdev;
> +	int err = 0;
> +
> +	if (netif_running(netdev))
> +		igb_close(netdev);
> +
> +	igb_clear_interrupt_scheme(adapter);
> +
> +	if (igb_init_interrupt_scheme(adapter, true)) {
> +		dev_err(&pdev->dev, "Unable to allocate memory for queues\n");
> +		return -ENOMEM;
> +	}
> +
> +	if (netif_running(netdev))
> +		err = igb_open(netdev);
> +
> +	return err;
> +}
>  /* igb_main.c */

In case this fails, is the interface in a consistent state where is it
safe to reconfigure the interface again or to unbind the driver?

If it fails, and the interface was up, shouldn't it call dev_close() so
that it's obviously down and the user can then try to bring it up again?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH net 1/2] sit: allow to use rtnl ops on fb tunnel
From: Nicolas Dichtel @ 2013-10-01 16:04 UTC (permalink / raw)
  To: davem; +Cc: netdev, steffen.klassert, pshelar, Nicolas Dichtel
In-Reply-To: <524AB26B.1070700@6wind.com>

rtnl ops where introduced by ba3e3f50a0e5 ("sit: advertise tunnel param via
rtnl"), but I forget to assign rtnl ops to fb tunnels.

Now that it is done, we must remove the explicit call to
unregister_netdevice_queue(), because  the fallback tunnel is added to the queue
in sit_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this
is valid since commit 5e6700b3bf98 ("sit: add support of x-netns")).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/sit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index afd5605aea7c..19269453a8ea 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1666,6 +1666,7 @@ static int __net_init sit_init_net(struct net *net)
 		goto err_alloc_dev;
 	}
 	dev_net_set(sitn->fb_tunnel_dev, net);
+	sitn->fb_tunnel_dev->rtnl_link_ops = &sit_link_ops;
 	/* FB netdevice is special: we have one, and only one per netns.
 	 * Allowing to move it to another netns is clearly unsafe.
 	 */
@@ -1700,7 +1701,6 @@ static void __net_exit sit_exit_net(struct net *net)
 
 	rtnl_lock();
 	sit_destroy_tunnels(sitn, &list);
-	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net 2/2] ip6tnl: allow to use rtnl ops on fb tunnel
From: Nicolas Dichtel @ 2013-10-01 16:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, steffen.klassert, pshelar, Nicolas Dichtel
In-Reply-To: <1380643500-5018-1-git-send-email-nicolas.dichtel@6wind.com>

rtnl ops where introduced by c075b13098b3 ("ip6tnl: advertise tunnel param via
rtnl"), but I forget to assign rtnl ops to fb tunnels.

Now that it is done, we must remove the explicit call to
unregister_netdevice_queue(), because  the fallback tunnel is added to the queue
in ip6_tnl_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this
is valid since commit 0bd8762824e7 ("ip6tnl: add x-netns support")).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_tunnel.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 2d8f4829575b..a791552e0422 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1731,8 +1731,6 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 		}
 	}
 
-	t = rtnl_dereference(ip6n->tnls_wc[0]);
-	unregister_netdevice_queue(t->dev, &list);
 	unregister_netdevice_many(&list);
 }
 
@@ -1752,6 +1750,7 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	if (!ip6n->fb_tnl_dev)
 		goto err_alloc_dev;
 	dev_net_set(ip6n->fb_tnl_dev, net);
+	ip6n->fb_tnl_dev->rtnl_link_ops = &ip6_link_ops;
 	/* FB netdevice is special: we have one, and only one per netns.
 	 * Allowing to move it to another netns is clearly unsafe.
 	 */
-- 
1.8.2.1

^ permalink raw reply related

* Re: Established sockets remain open after iface down or address lost
From: Chris Verges @ 2013-10-01 16:08 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <524AEDD1.9010709@hp.com>

On Tue, Oct 01, 2013 at 08:44:17AM -0700, Rick Jones wrote:
> The protocol between client and server needs to have an
> application-layer "keepalive" mechanism added, and then the server
> will be able to detect a dangling connection without need of any
> further kernel modifications.
> 
> If that is not possible, the server can/should set SO_KEEPALIVE and
> perhaps tweak the TCP keepalive settings.  Not as good (IMO) as an
> application-layer keepalive because it only shows that the connection
> is good as far as TCP, but I suppose it could do in a pinch.

I agree that some form of keepalives would solve the problem where
blocking reads need to be interrupted.  However, this creates traffic
across the link -- directly proportional to the keepalive interval.

The underlying physical layer is such that we pay for all traffic going
across it -- including any keepalives at either the application or TCP
layers.  Paying for this keepalive traffic when the link is operational
is not desired.

Thanks for the suggestion.  I, too, am hoping that a kernel mod isn't
required to do what is being asked.  But I'm also willing to put in the
work if needed.  My hope on engaging with the netdev list early in this
process is to (1) figure out whether this has already been fully solved
and (2) determine whether there would be interest in this patch for
general use.

Thanks,
Chris

^ permalink raw reply

* [PATCH] pkt_sched: fq: rate limiting improvements
From: Eric Dumazet @ 2013-10-01 16:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Steinar H. Gunderson

From: Eric Dumazet <edumazet@google.com>

FQ rate limiting suffers from two problems, reported
by Steinar :

1) FQ enforces a delay when flow quantum is exhausted in order
to reduce cpu overhead. But if packets are small, current
delay computation is slightly wrong, and observed rates can
be too high.

Steinar had this problem because he disabled TSO and GSO,
and default FQ quantum is 2*1514.

(Of course, I wish recent TSO auto sizing changes will help
to not having to disable TSO in the first place)

2) maxrate was not used for forwarded flows (skbs not attached
to a socket)

Tested:

tc qdisc add dev eth0 root est 1sec 4sec fq maxrate 8Mbit
netperf -H lpq84 -l 1000 &
sleep 10 ; tc -s qdisc show dev eth0
qdisc fq 8003: root refcnt 32 limit 10000p flow_limit 100p buckets 1024
 quantum 3028 initial_quantum 15140 maxrate 8000Kbit 
 Sent 16819357 bytes 11258 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 7831Kbit 653pps backlog 7570b 5p requeues 0 
  44 flows (43 inactive, 1 throttled), next packet delay 2977352 ns
  0 gc, 0 highprio, 5545 throttled

lpq83:~# tcpdump -p -i eth0 host lpq84 -c 12
09:02:52.079484 IP lpq83 > lpq84: . 1389536928:1389538376(1448) ack 3808678021 win 457 <nop,nop,timestamp 961812 572609068>
09:02:52.079499 IP lpq83 > lpq84: . 1448:2896(1448) ack 1 win 457 <nop,nop,timestamp 961812 572609068>
09:02:52.079906 IP lpq84 > lpq83: . ack 2896 win 16384 <nop,nop,timestamp 572609080 961812>
09:02:52.082568 IP lpq83 > lpq84: . 2896:4344(1448) ack 1 win 457 <nop,nop,timestamp 961815 572609071>
09:02:52.082581 IP lpq83 > lpq84: . 4344:5792(1448) ack 1 win 457 <nop,nop,timestamp 961815 572609071>
09:02:52.083017 IP lpq84 > lpq83: . ack 5792 win 16384 <nop,nop,timestamp 572609083 961815>
09:02:52.085678 IP lpq83 > lpq84: . 5792:7240(1448) ack 1 win 457 <nop,nop,timestamp 961818 572609074>
09:02:52.085693 IP lpq83 > lpq84: . 7240:8688(1448) ack 1 win 457 <nop,nop,timestamp 961818 572609074>
09:02:52.086117 IP lpq84 > lpq83: . ack 8688 win 16384 <nop,nop,timestamp 572609086 961818>
09:02:52.088792 IP lpq83 > lpq84: . 8688:10136(1448) ack 1 win 457 <nop,nop,timestamp 961821 572609077>
09:02:52.088806 IP lpq83 > lpq84: . 10136:11584(1448) ack 1 win 457 <nop,nop,timestamp 961821 572609077>
09:02:52.089217 IP lpq84 > lpq83: . ack 11584 win 16384 <nop,nop,timestamp 572609090 961821>

Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c |   45 ++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index fc6de56..a2fef8b 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -420,6 +420,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 	struct fq_flow_head *head;
 	struct sk_buff *skb;
 	struct fq_flow *f;
+	u32 rate;
 
 	skb = fq_dequeue_head(sch, &q->internal);
 	if (skb)
@@ -468,28 +469,34 @@ begin:
 	f->time_next_packet = now;
 	f->credit -= qdisc_pkt_len(skb);
 
-	if (f->credit <= 0 &&
-	    q->rate_enable &&
-	    skb->sk && skb->sk->sk_state != TCP_TIME_WAIT) {
-		u32 rate = skb->sk->sk_pacing_rate ?: q->flow_default_rate;
+	if (f->credit > 0 || !q->rate_enable)
+		goto out;
 
-		rate = min(rate, q->flow_max_rate);
-		if (rate) {
-			u64 len = (u64)qdisc_pkt_len(skb) * NSEC_PER_SEC;
-
-			do_div(len, rate);
-			/* Since socket rate can change later,
-			 * clamp the delay to 125 ms.
-			 * TODO: maybe segment the too big skb, as in commit
-			 * e43ac79a4bc ("sch_tbf: segment too big GSO packets")
-			 */
-			if (unlikely(len > 125 * NSEC_PER_MSEC)) {
-				len = 125 * NSEC_PER_MSEC;
-				q->stat_pkts_too_long++;
-			}
+	if (skb->sk && skb->sk->sk_state != TCP_TIME_WAIT) {
+		rate = skb->sk->sk_pacing_rate ?: q->flow_default_rate;
 
-			f->time_next_packet = now + len;
+		rate = min(rate, q->flow_max_rate);
+	} else {
+		rate = q->flow_max_rate;
+		if (rate == ~0U)
+			goto out;
+	}
+	if (rate) {
+		u32 plen = max(qdisc_pkt_len(skb), q->quantum);
+		u64 len = (u64)plen * NSEC_PER_SEC;
+
+		do_div(len, rate);
+		/* Since socket rate can change later,
+		 * clamp the delay to 125 ms.
+		 * TODO: maybe segment the too big skb, as in commit
+		 * e43ac79a4bc ("sch_tbf: segment too big GSO packets")
+		 */
+		if (unlikely(len > 125 * NSEC_PER_MSEC)) {
+			len = 125 * NSEC_PER_MSEC;
+			q->stat_pkts_too_long++;
 		}
+
+		f->time_next_packet = now + len;
 	}
 out:
 	qdisc_bstats_update(sch, skb);

^ permalink raw reply related

* Re: [PATCH v2] ll_temac: Reset dma descriptors indexes on ndo_open
From: David Miller @ 2013-10-01 16:32 UTC (permalink / raw)
  To: ricardo.ribalda; +Cc: joe, jg1.han, gregkh, wfp5p, netdev, linux-kernel
In-Reply-To: <1380608230-29183-1-git-send-email-ricardo.ribalda@gmail.com>

From: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Date: Tue,  1 Oct 2013 08:17:10 +0200

> The dma descriptors indexes are only initialized on the probe function.
> 
> If a packet is on the buffer when temac_stop is called, the dma
> descriptors indexes can be left on a incorrect state where no other
> package can be sent.
> 
> So an interface could be left in an usable state after ifdow/ifup.
> 
> This patch makes sure that the descriptors indexes are in a proper
> status when the device is open.
> 
> Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: Established sockets remain open after iface down or address lost
From: Alexey Kuznetsov @ 2013-10-01 16:33 UTC (permalink / raw)
  To: Chris Verges; +Cc: David Miller, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20130926060433.GA9170@cverges-dev-lnx.sentient-energy.com>

Hello!

> P.S.  I apologize in advance if I missed this answer in the netdev archives.

FYI googling f.e. "netdev tcp remove local address" instantly  finds
all that you want to know. Namrly, subj: Re: "TCP shutdown behaviour
when deleting local IP addresses"

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox