* Re: [PATCH 7/18] netfilter: nfnetlink_log: Move away from NLMSG_PUT().
From: Ben Hutchings @ 2012-06-27 16:49 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20120626.220223.1090653207727010874.davem@davemloft.net>
On Tue, 2012-06-26 at 22:02 -0700, David Miller wrote:
> And use nlmsg_data() while we're here too.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> net/netfilter/nfnetlink_log.c | 29 ++++++++++++++++-------------
> 1 file changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
> index 3c3cfc0..169ab59 100644
> --- a/net/netfilter/nfnetlink_log.c
> +++ b/net/netfilter/nfnetlink_log.c
> @@ -326,18 +326,20 @@ __nfulnl_send(struct nfulnl_instance *inst)
> {
> int status = -1;
>
> - if (inst->qlen > 1)
> - NLMSG_PUT(inst->skb, 0, 0,
> - NLMSG_DONE,
> - sizeof(struct nfgenmsg));
> -
> + if (inst->qlen > 1) {
> + struct nlmsghdr *nlh = nlmsg_put(inst->skb, 0, 0,
> + NLMSG_DONE,
> + sizeof(struct nfgenmsg),
> + 0);
> + if (!nlh)
> + goto out;
> + }
> status = nfnetlink_unicast(inst->skb, &init_net, inst->peer_pid,
> MSG_DONTWAIT);
>
> inst->qlen = 0;
> inst->skb = NULL;
> -
> -nlmsg_failure:
> +out:
> return status;
> }
[...]
It looks like this also leaks the skb on failure. At least,
__nfulnl_flush(inst) is expected to dipose of inst->skb.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH v2] sctp: be more restrictive in transport selection on bundled sacks
From: Neil Horman @ 2012-06-27 17:28 UTC (permalink / raw)
To: Vlad Yasevich; +Cc: netdev, David S. Miller
In-Reply-To: <4FEB2262.1030803@gmail.com>
On Wed, Jun 27, 2012 at 11:10:26AM -0400, Vlad Yasevich wrote:
> On 06/27/2012 10:23 AM, Neil Horman wrote:
> >It was noticed recently that when we send data on a transport, its possible that
> >we might bundle a sack that arrived on a different transport. While this isn't
> >a major problem, it does go against the SHOULD requirement in section 6.4 of RFC
> >2960:
> >
> > An endpoint SHOULD transmit reply chunks (e.g., SACK, HEARTBEAT ACK,
> > etc.) to the same destination transport address from which it
> > received the DATA or control chunk to which it is replying. This
> > rule should also be followed if the endpoint is bundling DATA chunks
> > together with the reply chunk.
> >
> >This patch seeks to correct that. It restricts the bundling of sack operations
> >to only those transports which have moved the ctsn of the association forward
> >since the last sack. By doing this we guarantee that we only bundle outbound
> >saks on a transport that has received a chunk since the last sack. This brings
> >us into stricter compliance with the RFC.
> >
> >Vlad had initially suggested that we strictly allow only sack bundling on the
> >transport that last moved the ctsn forward. While this makes sense, I was
> >concerned that doing so prevented us from bundling in the case where we had
> >received chunks that moved the ctsn on multiple transports. In those cases, the
> >RFC allows us to select any of the transports having received chunks to bundle
> >the sack on. so I've modified the approach to allow for that, by adding a state
> >variable to each transport that tracks weather it has moved the ctsn since the
> >last sack. This I think keeps our behavior (and performance), close enough to
> >our current profile that I think we can do this without a sysctl knob to
> >enable/disable it.
> >
> >Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
> >CC: Vlad Yaseivch<vyasevich@gmail.com>
> >CC: David S. Miller<davem@davemloft.net>
> >Reported-by: Michele Baldessari<michele@redhat.com>
> >Reported-by: sorin serban<sserban@redhat.com>
> >
> >---
> >Change Notes:
> >V2)
> > * Removed unused variable as per Dave M. Request
> > * Delayed rwnd adjustment until we are sure we will sack (Vlad Y.)
> >---
> > include/net/sctp/structs.h | 5 ++++-
> > include/net/sctp/tsnmap.h | 3 ++-
> > net/sctp/output.c | 10 +++++++---
> > net/sctp/sm_make_chunk.c | 7 +++++++
> > net/sctp/sm_sideeffect.c | 2 +-
> > net/sctp/tsnmap.c | 5 ++++-
> > net/sctp/ulpevent.c | 3 ++-
> > net/sctp/ulpqueue.c | 2 +-
> > 8 files changed, 28 insertions(+), 9 deletions(-)
> >
> >diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> >index e4652fe..712bf09 100644
> >--- a/include/net/sctp/structs.h
> >+++ b/include/net/sctp/structs.h
> >@@ -910,7 +910,10 @@ struct sctp_transport {
> > pmtu_pending:1,
> >
> > /* Is this structure kfree()able? */
> >- malloced:1;
> >+ malloced:1,
> >+
> >+ /* Has this transport moved the ctsn since we last sacked */
> >+ moved_ctsn:1;
> >
> > struct flowi fl;
> >
> >diff --git a/include/net/sctp/tsnmap.h b/include/net/sctp/tsnmap.h
> >index e7728bc..2c5d2b4 100644
> >--- a/include/net/sctp/tsnmap.h
> >+++ b/include/net/sctp/tsnmap.h
> >@@ -117,7 +117,8 @@ void sctp_tsnmap_free(struct sctp_tsnmap *map);
> > int sctp_tsnmap_check(const struct sctp_tsnmap *, __u32 tsn);
> >
> > /* Mark this TSN as seen. */
> >-int sctp_tsnmap_mark(struct sctp_tsnmap *, __u32 tsn);
> >+int sctp_tsnmap_mark(struct sctp_tsnmap *, __u32 tsn,
> >+ struct sctp_transport *trans);
> >
> > /* Mark this TSN and all lower as seen. */
> > void sctp_tsnmap_skip(struct sctp_tsnmap *map, __u32 tsn);
> >diff --git a/net/sctp/output.c b/net/sctp/output.c
> >index f1b7d4b..6c8cb9e 100644
> >--- a/net/sctp/output.c
> >+++ b/net/sctp/output.c
> >@@ -240,17 +240,21 @@ static sctp_xmit_t sctp_packet_bundle_sack(struct sctp_packet *pkt,
> > */
> > if (sctp_chunk_is_data(chunk)&& !pkt->has_sack&&
> > !pkt->has_cookie_echo) {
> >- struct sctp_association *asoc;
> > struct timer_list *timer;
> >- asoc = pkt->transport->asoc;
> >+ struct sctp_association *asoc = pkt->transport->asoc;
> >+
> > timer =&asoc->timers[SCTP_EVENT_TIMEOUT_SACK];
> >
> > /* If the SACK timer is running, we have a pending SACK */
> > if (timer_pending(timer)) {
> > struct sctp_chunk *sack;
> >- asoc->a_rwnd = asoc->rwnd;
> >+
> >+ if (chunk->transport&& !chunk->transport->moved_ctsn)
> >+ return retval;
> >+
>
> I didn't think of this yesterday, but I think it would be much
> better to use pkt->transport here since you are adding the chunk to
> the packet and it will go out on the transport of the packet. You
> are also guaranteed that pkt->transport is set.
>
I don't think it really matters, as the chunk transport is used to lookup the
packet that we append to, and if the chunk transport is unset, its somewhat
questionable as to weather we should bundle, but if packet->transport is set,
its probably worth it to avoid the extra conditional.
> > sack = sctp_make_sack(asoc);
> > if (sack) {
> >+ asoc->a_rwnd = asoc->rwnd;
> > retval = sctp_packet_append_chunk(pkt, sack);
> > asoc->peer.sack_needed = 0;
> > if (del_timer(timer))
> >diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> >index a85eeeb..805babe 100644
> >--- a/net/sctp/sm_make_chunk.c
> >+++ b/net/sctp/sm_make_chunk.c
> >@@ -736,6 +736,7 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
> > __u16 num_gabs, num_dup_tsns;
> > struct sctp_tsnmap *map = (struct sctp_tsnmap *)&asoc->peer.tsn_map;
> > struct sctp_gap_ack_block gabs[SCTP_MAX_GABS];
> >+ struct sctp_transport *trans;
> >
> > memset(gabs, 0, sizeof(gabs));
> > ctsn = sctp_tsnmap_get_ctsn(map);
> >@@ -805,6 +806,12 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
> > sctp_addto_chunk(retval, sizeof(__u32) * num_dup_tsns,
> > sctp_tsnmap_get_dups(map));
> >
> >+ /*
> >+ * Once we have a sack generated, clear the moved_tsn information
> >+ * from all the transports
> >+ */
> >+ list_for_each_entry(trans,&asoc->peer.transport_addr_list, transports)
> >+ trans->moved_ctsn = 0;
> > nodata:
> > return retval;
> > }
> >diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> >index c96d1a8..8716da1 100644
> >--- a/net/sctp/sm_sideeffect.c
> >+++ b/net/sctp/sm_sideeffect.c
> >@@ -1268,7 +1268,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
> > case SCTP_CMD_REPORT_TSN:
> > /* Record the arrival of a TSN. */
> > error = sctp_tsnmap_mark(&asoc->peer.tsn_map,
> >- cmd->obj.u32);
> >+ cmd->obj.u32, NULL);
> > break;
> >
> > case SCTP_CMD_REPORT_FWDTSN:
> >diff --git a/net/sctp/tsnmap.c b/net/sctp/tsnmap.c
> >index f1e40ceb..619c638 100644
> >--- a/net/sctp/tsnmap.c
> >+++ b/net/sctp/tsnmap.c
> >@@ -114,7 +114,8 @@ int sctp_tsnmap_check(const struct sctp_tsnmap *map, __u32 tsn)
> >
> >
> > /* Mark this TSN as seen. */
> >-int sctp_tsnmap_mark(struct sctp_tsnmap *map, __u32 tsn)
> >+int sctp_tsnmap_mark(struct sctp_tsnmap *map, __u32 tsn,
> >+ struct sctp_transport *trans)
> > {
> > u16 gap;
> >
> >@@ -133,6 +134,8 @@ int sctp_tsnmap_mark(struct sctp_tsnmap *map, __u32 tsn)
> > */
> > map->max_tsn_seen++;
> > map->cumulative_tsn_ack_point++;
> >+ if (trans)
> >+ trans->moved_ctsn = 1;
> > map->base_tsn++;
> > } else {
> > /* Either we already have a gap, or about to record a gap, so
> >diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
> >index 8a84017..33d8947 100644
> >--- a/net/sctp/ulpevent.c
> >+++ b/net/sctp/ulpevent.c
> >@@ -715,7 +715,8 @@ struct sctp_ulpevent *sctp_ulpevent_make_rcvmsg(struct sctp_association *asoc,
> > * can mark it as received so the tsn_map is updated correctly.
> > */
> > if (sctp_tsnmap_mark(&asoc->peer.tsn_map,
> >- ntohl(chunk->subh.data_hdr->tsn)))
> >+ ntohl(chunk->subh.data_hdr->tsn),
> >+ chunk->transport))
> > goto fail_mark;
> >
> > /* First calculate the padding, so we don't inadvertently
> >diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c
> >index f2d1de7..f5a6a4f 100644
> >--- a/net/sctp/ulpqueue.c
> >+++ b/net/sctp/ulpqueue.c
> >@@ -1051,7 +1051,7 @@ void sctp_ulpq_renege(struct sctp_ulpq *ulpq, struct sctp_chunk *chunk,
> > if (chunk&& (freed>= needed)) {
> > __u32 tsn;
> > tsn = ntohl(chunk->subh.data_hdr->tsn);
> >- sctp_tsnmap_mark(&asoc->peer.tsn_map, tsn);
> >+ sctp_tsnmap_mark(&asoc->peer.tsn_map, tsn, chunk->transport);
> > sctp_ulpq_tail_data(ulpq, chunk, gfp);
> >
> > sctp_ulpq_partial_delivery(ulpq, chunk, gfp);
>
> Also, I think you need to reset this bit in sctp_transport_reset().
> Consider a potential association restart after SACKs have been
> received.
>
Yeah, thats true. I'll add that in.
Thanks!
Neil
> -vlad
>
^ permalink raw reply
* [PATCH v2 0/4] netdev/phy: 10G PHY support.
From: David Daney @ 2012-06-27 17:33 UTC (permalink / raw)
To: Grant Likely, Rob Herring,
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, David S. Miller,
netdev-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
afleming-Re5JQEeQqe8AvxtiuMwx3w,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Daney
From: David Daney <david.daney-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
The only non-cosmetic change from v1 is to pass an additional argument
to get_phy_device() that indicates that the PHY uses 802.3 clause 45
signaling, previously I had been using a high order bit of the addr
parameter for this.
There are also changes from v1 in the code and comment formatting.
These should now be closer to what David Miller prefers.
>From v1:
The existing PHY driver infrastructure supports IEEE 802.3 Clause 22
PHYs used with 10/100/1000MB Ethernet. For 10G Ethernet, many PHYs
use 802.3 Clause 45. These patches attempt to add core support for
this as well as a driver for BCM87XX 10G PHY devices.
This is reworked from patches I send about 9 months ago:
http://marc.info/?l=linux-netdev&m=131844282403852
Several of the patches have device tree bindings in them, so the
device tree people get to enjoy them too.
David Daney (4):
netdev/phy: Handle IEEE802.3 clause 45 Ethernet PHYs
netdev/phy/of: Handle IEEE802.3 clause 45 Ethernet PHYs in
of_mdiobus_register()
netdev/phy/of: Add more methods for binding PHY devices to drivers.
netdev/phy: Add driver for Broadcom BCM87XX 10G Ethernet PHYs
.../devicetree/bindings/net/broadcom-bcm87xx.txt | 29 +++
Documentation/devicetree/bindings/net/phy.txt | 12 +-
drivers/net/phy/Kconfig | 5 +
drivers/net/phy/Makefile | 1 +
drivers/net/phy/bcm87xx.c | 238 ++++++++++++++++++++
drivers/net/phy/mdio_bus.c | 9 +-
drivers/net/phy/phy_device.c | 105 ++++++++-
drivers/of/of_mdio.c | 16 +-
include/linux/phy.h | 24 ++-
9 files changed, 424 insertions(+), 15 deletions(-)
create mode 100644 Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
create mode 100644 drivers/net/phy/bcm87xx.c
--
1.7.2.3
^ permalink raw reply
* [PATCH v2 1/4] netdev/phy: Handle IEEE802.3 clause 45 Ethernet PHYs
From: David Daney @ 2012-06-27 17:33 UTC (permalink / raw)
To: Grant Likely, Rob Herring, devicetree-discuss, David S. Miller,
netdev
Cc: linux-kernel, linux-mips, afleming, David Daney
In-Reply-To: <1340818418-10382-1-git-send-email-ddaney.cavm@gmail.com>
From: David Daney <david.daney@cavium.com>
The IEEE802.3 clause 45 MDIO bus protocol allows for directly
addressing PHY registers using a 21 bit address, and is used by many
10G Ethernet PHYS. Already existing is the ability of MDIO bus
drivers to use clause 45, with the MII_ADDR_C45 flag. Here we add
struct phy_c45_device_ids to hold the device identifier registers
present in clause 45. struct phy_device gets a couple of new fields:
c45_ids to hold the identifiers and is_c45 to signal that it is clause
45.
get_phy_device() gets a new parameter is_c45 to indicate that the PHY
device should use the clause 45 protocol, and its callers are adjusted
to pass false. The follow-on patch to of_mdio.c will pass true where
appropriate.
EXPORT phy_device_create() so that the follow-on patch to of_mdio.c
can use it to create phy devices for PHYs, that have non-standard
device identifier registers, based on the device tree bindings.
Signed-off-by: David Daney <david.daney@cavium.com>
---
drivers/net/phy/mdio_bus.c | 2 +-
drivers/net/phy/phy_device.c | 105 ++++++++++++++++++++++++++++++++++++++---
drivers/of/of_mdio.c | 2 +-
include/linux/phy.h | 18 +++++++-
4 files changed, 116 insertions(+), 11 deletions(-)
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 31470b0..2cee6d2 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -232,7 +232,7 @@ struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr)
struct phy_device *phydev;
int err;
- phydev = get_phy_device(bus, addr);
+ phydev = get_phy_device(bus, addr, false);
if (IS_ERR(phydev) || phydev == NULL)
return phydev;
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 18ab0da..ef4cdee 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -152,8 +152,8 @@ int phy_scan_fixups(struct phy_device *phydev)
}
EXPORT_SYMBOL(phy_scan_fixups);
-static struct phy_device* phy_device_create(struct mii_bus *bus,
- int addr, int phy_id)
+struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
+ bool is_c45, struct phy_c45_device_ids *c45_ids)
{
struct phy_device *dev;
@@ -174,8 +174,11 @@ static struct phy_device* phy_device_create(struct mii_bus *bus,
dev->autoneg = AUTONEG_ENABLE;
+ dev->is_c45 = is_c45;
dev->addr = addr;
dev->phy_id = phy_id;
+ if (c45_ids)
+ dev->c45_ids = *c45_ids;
dev->bus = bus;
dev->dev.parent = bus->parent;
dev->dev.bus = &mdio_bus_type;
@@ -200,20 +203,99 @@ static struct phy_device* phy_device_create(struct mii_bus *bus,
return dev;
}
+EXPORT_SYMBOL(phy_device_create);
+
+/**
+ * get_phy_c45_ids - reads the specified addr for its 802.3-c45 IDs.
+ * @bus: the target MII bus
+ * @addr: PHY address on the MII bus
+ * @phy_id: where to store the ID retrieved.
+ * @c45_ids: where to store the c45 ID information.
+ *
+ * If the PHY devices-in-package appears to be valid, it and the
+ * corresponding identifiers are stored in @c45_ids, zero is stored
+ * in @phy_id. Otherwise 0xffffffff is stored in @phy_id. Returns
+ * zero on success.
+ *
+ */
+static int get_phy_c45_ids(struct mii_bus *bus, int addr, u32 *phy_id,
+ struct phy_c45_device_ids *c45_ids) {
+ int phy_reg;
+ int i, reg_addr;
+ const int num_ids = ARRAY_SIZE(c45_ids->device_ids);
+
+ /* Find first non-zero Devices In package. Device
+ * zero is reserved, so don't probe it.
+ */
+ for (i = 1;
+ i < num_ids && c45_ids->devices_in_package == 0;
+ i++) {
+ reg_addr = MII_ADDR_C45 | i << 16 | 6;
+ phy_reg = mdiobus_read(bus, addr, reg_addr);
+ if (phy_reg < 0)
+ return -EIO;
+ c45_ids->devices_in_package = (phy_reg & 0xffff) << 16;
+
+ reg_addr = MII_ADDR_C45 | i << 16 | 5;
+ phy_reg = mdiobus_read(bus, addr, reg_addr);
+ if (phy_reg < 0)
+ return -EIO;
+ c45_ids->devices_in_package |= (phy_reg & 0xffff);
+
+ /* If mostly Fs, there is no device there,
+ * let's get out of here.
+ */
+ if ((c45_ids->devices_in_package & 0x1fffffff) == 0x1fffffff) {
+ *phy_id = 0xffffffff;
+ return 0;
+ }
+ }
+
+ /* Now probe Device Identifiers for each device present. */
+ for (i = 1; i < num_ids; i++) {
+ if (!(c45_ids->devices_in_package & (1 << i)))
+ continue;
+
+ reg_addr = MII_ADDR_C45 | i << 16 | MII_PHYSID1;
+ phy_reg = mdiobus_read(bus, addr, reg_addr);
+ if (phy_reg < 0)
+ return -EIO;
+ c45_ids->device_ids[i] = (phy_reg & 0xffff) << 16;
+
+ reg_addr = MII_ADDR_C45 | i << 16 | MII_PHYSID2;
+ phy_reg = mdiobus_read(bus, addr, reg_addr);
+ if (phy_reg < 0)
+ return -EIO;
+ c45_ids->device_ids[i] |= (phy_reg & 0xffff);
+ }
+ *phy_id = 0;
+ return 0;
+}
/**
* get_phy_id - reads the specified addr for its ID.
* @bus: the target MII bus
* @addr: PHY address on the MII bus
* @phy_id: where to store the ID retrieved.
+ * @is_c45: If true the PHY uses the 802.3 clause 45 protocol
+ * @c45_ids: where to store the c45 ID information.
+ *
+ * Description: In the case of a 802.3-c22 PHY, reads the ID registers
+ * of the PHY at @addr on the @bus, stores it in @phy_id and returns
+ * zero on success.
+ *
+ * In the case of a 802.3-c45 PHY, get_phy_c45_ids() is invoked, and
+ * its return value is in turn returned.
*
- * Description: Reads the ID registers of the PHY at @addr on the
- * @bus, stores it in @phy_id and returns zero on success.
*/
-static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id)
+static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id,
+ bool is_c45, struct phy_c45_device_ids *c45_ids)
{
int phy_reg;
+ if (is_c45)
+ return get_phy_c45_ids(bus, addr, phy_id, c45_ids);
+
/* Grab the bits from PHYIR1, and put them
* in the upper half */
phy_reg = mdiobus_read(bus, addr, MII_PHYSID1);
@@ -238,17 +320,19 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id)
* get_phy_device - reads the specified PHY device and returns its @phy_device struct
* @bus: the target MII bus
* @addr: PHY address on the MII bus
+ * @is_c45: If true the PHY uses the 802.3 clause 45 protocol
*
* Description: Reads the ID registers of the PHY at @addr on the
* @bus, then allocates and returns the phy_device to represent it.
*/
-struct phy_device * get_phy_device(struct mii_bus *bus, int addr)
+struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45)
{
struct phy_device *dev = NULL;
u32 phy_id;
+ struct phy_c45_device_ids c45_ids = {0};
int r;
- r = get_phy_id(bus, addr, &phy_id);
+ r = get_phy_id(bus, addr, &phy_id, is_c45, &c45_ids);
if (r)
return ERR_PTR(r);
@@ -256,7 +340,7 @@ struct phy_device * get_phy_device(struct mii_bus *bus, int addr)
if ((phy_id & 0x1fffffff) == 0x1fffffff)
return NULL;
- dev = phy_device_create(bus, addr, phy_id);
+ dev = phy_device_create(bus, addr, phy_id, is_c45, &c45_ids);
return dev;
}
@@ -449,6 +533,11 @@ static int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
/* Assume that if there is no driver, that it doesn't
* exist, and we should use the genphy driver. */
if (NULL == d->driver) {
+ if (phydev->is_c45) {
+ pr_err("No driver for phy %x\n", phydev->phy_id);
+ return -ENODEV;
+ }
+
d->driver = &genphy_driver.driver;
err = d->driver->probe(d);
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 2574abd..6c24cad 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -79,7 +79,7 @@ int of_mdiobus_register(struct mii_bus *mdio, struct device_node *np)
mdio->irq[addr] = PHY_POLL;
}
- phy = get_phy_device(mdio, addr);
+ phy = get_phy_device(mdio, addr, false);
if (!phy || IS_ERR(phy)) {
dev_err(&mdio->dev, "error probing PHY at address %i\n",
addr);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index c291cae..597d05d 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -243,6 +243,15 @@ enum phy_state {
PHY_RESUMING
};
+/**
+ * struct phy_c45_device_ids - 802.3-c45 Device Identifiers
+ * @devices_in_package: Bit vector of devices present.
+ * @device_ids: The device identifer for each present device.
+ */
+struct phy_c45_device_ids {
+ u32 devices_in_package;
+ u32 device_ids[8];
+};
/* phy_device: An instance of a PHY
*
@@ -250,6 +259,8 @@ enum phy_state {
* bus: Pointer to the bus this PHY is on
* dev: driver model device structure for this PHY
* phy_id: UID for this device found during discovery
+ * c45_ids: 802.3-c45 Device Identifers if is_c45.
+ * is_c45: Set to true if this phy uses clause 45 addressing.
* state: state of the PHY for management purposes
* dev_flags: Device-specific flags used by the PHY driver.
* addr: Bus address of PHY
@@ -285,6 +296,9 @@ struct phy_device {
u32 phy_id;
+ struct phy_c45_device_ids c45_ids;
+ bool is_c45;
+
enum phy_state state;
u32 dev_flags;
@@ -480,7 +494,9 @@ static inline int phy_write(struct phy_device *phydev, u32 regnum, u16 val)
return mdiobus_write(phydev->bus, phydev->addr, regnum, val);
}
-struct phy_device* get_phy_device(struct mii_bus *bus, int addr);
+struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
+ bool is_c45, struct phy_c45_device_ids *c45_ids);
+struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45);
int phy_device_register(struct phy_device *phy);
int phy_init_hw(struct phy_device *phydev);
struct phy_device * phy_attach(struct net_device *dev,
--
1.7.2.3
^ permalink raw reply related
* [PATCH v2 2/4] netdev/phy/of: Handle IEEE802.3 clause 45 Ethernet PHYs in of_mdiobus_register()
From: David Daney @ 2012-06-27 17:33 UTC (permalink / raw)
To: Grant Likely, Rob Herring, devicetree-discuss, David S. Miller,
netdev
Cc: linux-kernel, linux-mips, afleming, David Daney
In-Reply-To: <1340818418-10382-1-git-send-email-ddaney.cavm@gmail.com>
From: David Daney <david.daney@cavium.com>
Define two new "compatible" values for Ethernet
PHYs. "ethernet-phy-ieee802.3-c22" and "ethernet-phy-ieee802.3-c45"
are used to indicate a PHY uses the corresponding protocol.
If a PHY is "compatible" with "ethernet-phy-ieee802.3-c45", we
indicate this so that get_phy_device() can properly probe the device.
If get_phy_device() fails, it was probably due to failing the probe of
the PHY identifier registers. Since we have the device tree telling
us the PHY exists, go ahead and add it anyhow with a phy_id of zero.
There may be a driver match based on the "compatible" property.
Signed-off-by: David Daney <david.daney@cavium.com>
---
Documentation/devicetree/bindings/net/phy.txt | 12 +++++++++++-
drivers/of/of_mdio.c | 16 ++++++++++++----
2 files changed, 23 insertions(+), 5 deletions(-)
diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt
index bb8c742..7cd18fb 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -14,10 +14,20 @@ Required properties:
- linux,phandle : phandle for this node; likely referenced by an
ethernet controller node.
+Optional Properties:
+
+- compatible: Compatible list, may contain
+ "ethernet-phy-ieee802.3-c22" or "ethernet-phy-ieee802.3-c45" for
+ PHYs that implement IEEE802.3 clause 22 or IEEE802.3 clause 45
+ specifications. If neither of these are specified, the default is to
+ assume clause 22. The compatible list may also contain other
+ elements.
+
Example:
ethernet-phy@0 {
- linux,phandle = <2452000>
+ compatible = "ethernet-phy-ieee802.3-c22";
+ linux,phandle = <2452000>;
interrupt-parent = <40000>;
interrupts = <35 1>;
reg = <0>;
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 6c24cad..8e6c25f 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -57,6 +57,7 @@ int of_mdiobus_register(struct mii_bus *mdio, struct device_node *np)
const __be32 *paddr;
u32 addr;
int len;
+ bool is_c45;
/* A PHY must have a reg property in the range [0-31] */
paddr = of_get_property(child, "reg", &len);
@@ -79,11 +80,18 @@ int of_mdiobus_register(struct mii_bus *mdio, struct device_node *np)
mdio->irq[addr] = PHY_POLL;
}
- phy = get_phy_device(mdio, addr, false);
+ is_c45 = of_device_is_compatible(child,
+ "ethernet-phy-ieee802.3-c45");
+ phy = get_phy_device(mdio, addr, is_c45);
+
if (!phy || IS_ERR(phy)) {
- dev_err(&mdio->dev, "error probing PHY at address %i\n",
- addr);
- continue;
+ phy = phy_device_create(mdio, addr, 0, false, NULL);
+ if (!phy || IS_ERR(phy)) {
+ dev_err(&mdio->dev,
+ "error creating PHY at address %i\n",
+ addr);
+ continue;
+ }
}
/* Associate the OF node with the device structure so it
--
1.7.2.3
^ permalink raw reply related
* [PATCH v2 3/4] netdev/phy/of: Add more methods for binding PHY devices to drivers.
From: David Daney @ 2012-06-27 17:33 UTC (permalink / raw)
To: Grant Likely, Rob Herring, devicetree-discuss, David S. Miller,
netdev
Cc: linux-kernel, linux-mips, afleming, David Daney
In-Reply-To: <1340818418-10382-1-git-send-email-ddaney.cavm@gmail.com>
From: David Daney <david.daney@cavium.com>
Allow PHY drivers to supply their own device matching function
(match_phy_device()), or to be matched OF compatible properties.
PHYs following IEEE802.3 clause 45 have more than one device
identifier constants, which breaks the default device matching code.
Other 10G PHYs don't follow the standard manufacturer/device
identifier register layout standards, but they do use the standard
MDIO bus protocols for register access. Both of these require
adjustments to the PHY driver to device matching code.
If the there is an of_node associated with such a PHY, we can match it
to its driver using the "compatible" properties, just as we do with
certain platform devices. If the "compatible" property match fails,
first check if there is a driver supplied matching function, and if
not fall back to the existing identifier matching rules.
Signed-off-by: David Daney <david.daney@cavium.com>
---
drivers/net/phy/mdio_bus.c | 7 +++++++
include/linux/phy.h | 6 ++++++
2 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 2cee6d2..170eb41 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -25,6 +25,7 @@
#include <linux/init.h>
#include <linux/delay.h>
#include <linux/device.h>
+#include <linux/of_device.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/skbuff.h>
@@ -308,6 +309,12 @@ static int mdio_bus_match(struct device *dev, struct device_driver *drv)
struct phy_device *phydev = to_phy_device(dev);
struct phy_driver *phydrv = to_phy_driver(drv);
+ if (of_driver_match_device(dev, drv))
+ return 1;
+
+ if (phydrv->match_phy_device)
+ return phydrv->match_phy_device(phydev);
+
return ((phydrv->phy_id & phydrv->phy_id_mask) ==
(phydev->phy_id & phydrv->phy_id_mask));
}
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 597d05d..7eac80a 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -426,6 +426,12 @@ struct phy_driver {
/* Clears up any memory if needed */
void (*remove)(struct phy_device *phydev);
+ /* Returns true if this is a suitable driver for the given
+ * phydev. If NULL, matching is based on phy_id and
+ * phy_id_mask.
+ */
+ int (*match_phy_device)(struct phy_device *phydev);
+
/* Handles ethtool queries for hardware time stamping. */
int (*ts_info)(struct phy_device *phydev, struct ethtool_ts_info *ti);
--
1.7.2.3
^ permalink raw reply related
* [PATCH v2 4/4] netdev/phy: Add driver for Broadcom BCM87XX 10G Ethernet PHYs
From: David Daney @ 2012-06-27 17:33 UTC (permalink / raw)
To: Grant Likely, Rob Herring, devicetree-discuss, David S. Miller,
netdev
Cc: linux-kernel, linux-mips, afleming, David Daney
In-Reply-To: <1340818418-10382-1-git-send-email-ddaney.cavm@gmail.com>
From: David Daney <david.daney@cavium.com>
Add a driver for BCM8706 and BCM8727 devices. These are a 10Gig PHYs
which use MII_ADDR_C45 addressing. They are always 10G full duplex, so
there is no autonegotiation. All we do is report link state and send
interrupts when it changes.
If the PHY has a device tree of_node associated with it, the
"broadcom,c45-reg-init" property is used to supply register
initialization values when config_init() is called.
Signed-off-by: David Daney <david.daney@cavium.com>
---
.../devicetree/bindings/net/broadcom-bcm87xx.txt | 29 +++
drivers/net/phy/Kconfig | 5 +
drivers/net/phy/Makefile | 1 +
drivers/net/phy/bcm87xx.c | 238 ++++++++++++++++++++
4 files changed, 273 insertions(+), 0 deletions(-)
create mode 100644 Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
create mode 100644 drivers/net/phy/bcm87xx.c
diff --git a/Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt b/Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
new file mode 100644
index 0000000..7c86d5e
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
@@ -0,0 +1,29 @@
+The Broadcom BCM87XX devices are a family of 10G Ethernet PHYs. They
+have these bindings in addition to the standard PHY bindings.
+
+Compatible: Should contain "broadcom,bcm8706" or "broadcom,bcm8727" and
+ "ethernet-phy-ieee802.3-c45"
+
+Optional Properties:
+
+- broadcom,c45-reg-init : one of more sets of 4 cells. The first cell
+ is the MDIO Manageable Device (MMD) address, the second a register
+ address within the MMD, the third cell contains a mask to be ANDed
+ with the existing register value, and the fourth cell is ORed with
+ he result to yield the new register value. If the third cell has a
+ value of zero, no read of the existing value is performed.
+
+Example:
+
+ ethernet-phy@5 {
+ reg = <5>;
+ compatible = "broadcom,bcm8706", "ethernet-phy-ieee802.3-c45";
+ interrupt-parent = <&gpio>;
+ interrupts = <12 8>; /* Pin 12, active low */
+ /*
+ * Set PMD Digital Control Register for
+ * GPIO[1] Tx/Rx
+ * GPIO[0] R64 Sync Acquired
+ */
+ broadcom,c45-reg-init = <1 0xc808 0xff8f 0x70>;
+ };
diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 944cdfb..3090dc6 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -67,6 +67,11 @@ config BCM63XX_PHY
---help---
Currently supports the 6348 and 6358 PHYs.
+config BCM87XX_PHY
+ tristate "Driver for Broadcom BCM8706 and BCM8727 PHYs"
+ help
+ Currently supports the BCM8706 and BCM8727 10G Ethernet PHYs.
+
config ICPLUS_PHY
tristate "Drivers for ICPlus PHYs"
---help---
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index f51af68..6d2dc6c 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_SMSC_PHY) += smsc.o
obj-$(CONFIG_VITESSE_PHY) += vitesse.o
obj-$(CONFIG_BROADCOM_PHY) += broadcom.o
obj-$(CONFIG_BCM63XX_PHY) += bcm63xx.o
+obj-$(CONFIG_BCM87XX_PHY) += bcm87xx.o
obj-$(CONFIG_ICPLUS_PHY) += icplus.o
obj-$(CONFIG_REALTEK_PHY) += realtek.o
obj-$(CONFIG_LSI_ET1011C_PHY) += et1011c.o
diff --git a/drivers/net/phy/bcm87xx.c b/drivers/net/phy/bcm87xx.c
new file mode 100644
index 0000000..f5f0562
--- /dev/null
+++ b/drivers/net/phy/bcm87xx.c
@@ -0,0 +1,238 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2011 - 2012 Cavium, Inc.
+ */
+
+#include <linux/module.h>
+#include <linux/phy.h>
+#include <linux/of.h>
+
+#define PHY_ID_BCM8706 0x0143bdc1
+#define PHY_ID_BCM8727 0x0143bff0
+
+#define BCM87XX_PMD_RX_SIGNAL_DETECT (MII_ADDR_C45 | 0x1000a)
+#define BCM87XX_10GBASER_PCS_STATUS (MII_ADDR_C45 | 0x30020)
+#define BCM87XX_XGXS_LANE_STATUS (MII_ADDR_C45 | 0x40018)
+
+#define BCM87XX_LASI_CONTROL (MII_ADDR_C45 | 0x39002)
+#define BCM87XX_LASI_STATUS (MII_ADDR_C45 | 0x39005)
+
+#if IS_ENABLED(CONFIG_OF_MDIO)
+/* Set and/or override some configuration registers based on the
+ * marvell,reg-init property stored in the of_node for the phydev.
+ *
+ * broadcom,c45-reg-init = <devid reg mask value>,...;
+ *
+ * There may be one or more sets of <devid reg mask value>:
+ *
+ * devid: which sub-device to use.
+ * reg: the register.
+ * mask: if non-zero, ANDed with existing register value.
+ * value: ORed with the masked value and written to the regiser.
+ *
+ */
+static int bcm87xx_of_reg_init(struct phy_device *phydev)
+{
+ const __be32 *paddr;
+ const __be32 *paddr_end;
+ int len, ret;
+
+ if (!phydev->dev.of_node)
+ return 0;
+
+ paddr = of_get_property(phydev->dev.of_node,
+ "broadcom,c45-reg-init", &len);
+ if (!paddr)
+ return 0;
+
+ paddr_end = paddr + (len /= sizeof(*paddr));
+
+ ret = 0;
+
+ while (paddr + 3 < paddr_end) {
+ u16 devid = be32_to_cpup(paddr++);
+ u16 reg = be32_to_cpup(paddr++);
+ u16 mask = be32_to_cpup(paddr++);
+ u16 val_bits = be32_to_cpup(paddr++);
+ int val;
+ u32 regnum = MII_ADDR_C45 | (devid << 16) | reg;
+ val = 0;
+ if (mask) {
+ val = phy_read(phydev, regnum);
+ if (val < 0) {
+ ret = val;
+ goto err;
+ }
+ val &= mask;
+ }
+ val |= val_bits;
+
+ ret = phy_write(phydev, regnum, val);
+ if (ret < 0)
+ goto err;
+ }
+err:
+ return ret;
+}
+#else
+static int bcm87xx_of_reg_init(struct phy_device *phydev)
+{
+ return 0;
+}
+#endif /* CONFIG_OF_MDIO */
+
+static int bcm87xx_config_init(struct phy_device *phydev)
+{
+ phydev->supported = SUPPORTED_10000baseR_FEC;
+ phydev->advertising = ADVERTISED_10000baseR_FEC;
+ phydev->state = PHY_NOLINK;
+
+ bcm87xx_of_reg_init(phydev);
+
+ return 0;
+}
+
+static int bcm87xx_config_aneg(struct phy_device *phydev)
+{
+ return -EINVAL;
+}
+
+static int bcm87xx_read_status(struct phy_device *phydev)
+{
+ int rx_signal_detect;
+ int pcs_status;
+ int xgxs_lane_status;
+
+ rx_signal_detect = phy_read(phydev, BCM87XX_PMD_RX_SIGNAL_DETECT);
+ if (rx_signal_detect < 0)
+ return rx_signal_detect;
+
+ if ((rx_signal_detect & 1) == 0)
+ goto no_link;
+
+ pcs_status = phy_read(phydev, BCM87XX_10GBASER_PCS_STATUS);
+ if (pcs_status < 0)
+ return pcs_status;
+
+ if ((pcs_status & 1) == 0)
+ goto no_link;
+
+ xgxs_lane_status = phy_read(phydev, BCM87XX_XGXS_LANE_STATUS);
+ if (xgxs_lane_status < 0)
+ return xgxs_lane_status;
+
+ if ((xgxs_lane_status & 0x1000) == 0)
+ goto no_link;
+
+ phydev->speed = 10000;
+ phydev->link = 1;
+ phydev->duplex = 1;
+ return 0;
+
+no_link:
+ phydev->link = 0;
+ return 0;
+}
+
+static int bcm87xx_config_intr(struct phy_device *phydev)
+{
+ int reg, err;
+
+ reg = phy_read(phydev, BCM87XX_LASI_CONTROL);
+
+ if (reg < 0)
+ return reg;
+
+ if (phydev->interrupts == PHY_INTERRUPT_ENABLED)
+ reg |= 1;
+ else
+ reg &= ~1;
+
+ err = phy_write(phydev, BCM87XX_LASI_CONTROL, reg);
+ return err;
+}
+
+static int bcm87xx_did_interrupt(struct phy_device *phydev)
+{
+ int reg;
+
+ reg = phy_read(phydev, BCM87XX_LASI_STATUS);
+
+ if (reg < 0) {
+ dev_err(&phydev->dev,
+ "Error: Read of BCM87XX_LASI_STATUS failed: %d\n", reg);
+ return 0;
+ }
+ return (reg & 1) != 0;
+}
+
+static int bcm87xx_ack_interrupt(struct phy_device *phydev)
+{
+ /* Reading the LASI status clears it. */
+ bcm87xx_did_interrupt(phydev);
+ return 0;
+}
+
+static int bcm8706_match_phy_device(struct phy_device *phydev)
+{
+ return phydev->c45_ids.device_ids[4] == PHY_ID_BCM8706;
+}
+
+static int bcm8727_match_phy_device(struct phy_device *phydev)
+{
+ return phydev->c45_ids.device_ids[4] == PHY_ID_BCM8727;
+}
+
+static struct phy_driver bcm8706_driver = {
+ .phy_id = PHY_ID_BCM8706,
+ .phy_id_mask = 0xffffffff,
+ .name = "Broadcom BCM8706",
+ .flags = PHY_HAS_INTERRUPT,
+ .config_init = bcm87xx_config_init,
+ .config_aneg = bcm87xx_config_aneg,
+ .read_status = bcm87xx_read_status,
+ .ack_interrupt = bcm87xx_ack_interrupt,
+ .config_intr = bcm87xx_config_intr,
+ .did_interrupt = bcm87xx_did_interrupt,
+ .match_phy_device = bcm8706_match_phy_device,
+ .driver = { .owner = THIS_MODULE },
+};
+
+static struct phy_driver bcm8727_driver = {
+ .phy_id = PHY_ID_BCM8727,
+ .phy_id_mask = 0xffffffff,
+ .name = "Broadcom BCM8727",
+ .flags = PHY_HAS_INTERRUPT,
+ .config_init = bcm87xx_config_init,
+ .config_aneg = bcm87xx_config_aneg,
+ .read_status = bcm87xx_read_status,
+ .ack_interrupt = bcm87xx_ack_interrupt,
+ .config_intr = bcm87xx_config_intr,
+ .did_interrupt = bcm87xx_did_interrupt,
+ .match_phy_device = bcm8727_match_phy_device,
+ .driver = { .owner = THIS_MODULE },
+};
+
+static int __init bcm87xx_init(void)
+{
+ int ret;
+
+ ret = phy_driver_register(&bcm8706_driver);
+ if (ret)
+ goto err;
+
+ ret = phy_driver_register(&bcm8727_driver);
+err:
+ return ret;
+}
+module_init(bcm87xx_init);
+
+static void __exit bcm87xx_exit(void)
+{
+ phy_driver_unregister(&bcm8706_driver);
+ phy_driver_unregister(&bcm8727_driver);
+}
+module_exit(bcm87xx_exit);
--
1.7.2.3
^ permalink raw reply related
* Re: [PATCH] Build fix in drivers/net/wireless/ath/ath9k/main.c
From: John W. Linville @ 2012-06-27 18:27 UTC (permalink / raw)
To: Arvydas Sidorenko
Cc: mcgrof, jouni, vthiagar, senthilb, linux-wireless, ath9k-devel,
netdev, linux-kernel
In-Reply-To: <1340824779-5157-1-git-send-email-asido4@gmail.com>
Thanks, but Sujith beat you to the same fix.
John
On Wed, Jun 27, 2012 at 09:19:39PM +0200, Arvydas Sidorenko wrote:
> Commit fad29cd2f59949581050a937786c2c9bc78b2f04 broke the build if
> no CONFIG_ATH9K_BTCOEX_SUPPORT is enabled.
>
> Signed-off-by: Arvydas Sidorenko <asido4@gmail.com>
> ---
> drivers/net/wireless/ath/ath9k/main.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
> index c14cf5a..e4e73f0 100644
> --- a/drivers/net/wireless/ath/ath9k/main.c
> +++ b/drivers/net/wireless/ath/ath9k/main.c
> @@ -151,8 +151,10 @@ static void __ath_cancel_work(struct ath_softc *sc)
> cancel_delayed_work_sync(&sc->tx_complete_work);
> cancel_delayed_work_sync(&sc->hw_pll_work);
>
> +#ifdef CONFIG_ATH9K_BTCOEX_SUPPORT
> if (ath9k_hw_mci_is_enabled(sc->sc_ah))
> cancel_work_sync(&sc->mci_work);
> +#endif
> }
>
> static void ath_cancel_work(struct ath_softc *sc)
> --
> 1.7.8.6
>
>
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Eric Dumazet @ 2012-06-27 19:03 UTC (permalink / raw)
To: Tom Parkin; +Cc: netdev, David.Laight, James Chapman
In-Reply-To: <1340798457-28270-1-git-send-email-tparkin@katalix.com>
On Wed, 2012-06-27 at 13:00 +0100, Tom Parkin wrote:
> This patch fixes a race condition in l2tp when updating tunnel and
> session statistics. Previously it was possible for multiple threads
> to concurrently call u64_stats_update*(), which lead to statistics
> readers blocking forever.
>
> This race was discovered on an AMD64 SMP machine running a 32bit
> kernel. Running "ip l2tp" while sending data over an Ethernet
> pseudowire resulted in an occasional soft lockup in
> u64_stats_fetch_begin() called from l2tp_nl_session_send().
>
> For safe lockless update of l2tp stats, data is now stored in per-cpu
> variables. These per-cpu datasets are then summed at read time via.
> an extra helper function l2tp_stats_copy() which has been added to
> l2tp_core.c.
>
Do we really need 64bits stats on 32bit arches for l2tp ?
> Signed-off-by: Tom Parkin <tparkin@katalix.com>
> Signed-off-by: James Chapman <jchapman@katalix.com>
> ---
> net/l2tp/l2tp_core.c | 286 ++++++++++++++++++++++++++++-------------------
> net/l2tp/l2tp_core.h | 44 ++++++--
> net/l2tp/l2tp_debugfs.c | 42 ++++---
> net/l2tp/l2tp_netlink.c | 64 ++++-------
> net/l2tp/l2tp_ppp.c | 75 ++++++++-----
> 5 files changed, 301 insertions(+), 210 deletions(-)
>
> diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
> index 32b2155..ab2ffc0 100644
> --- a/net/l2tp/l2tp_core.c
> +++ b/net/l2tp/l2tp_core.c
> @@ -320,6 +320,43 @@ struct l2tp_tunnel *l2tp_tunnel_find_nth(struct net *net, int nth)
> }
> EXPORT_SYMBOL_GPL(l2tp_tunnel_find_nth);
>
> +/*
> + * Sum tunnel/session statistics across all CPUs
> + */
> +int l2tp_stats_copy(struct l2tp_stats *cpustats, struct l2tp_stats *dest)
> +{
> + int i;
> + unsigned int start;
> +
> + if (!cpustats || !dest)
> + return 1;
> +
> + memset(dest, 0, sizeof(struct l2tp_stats));
> +
> + for_each_possible_cpu(i) {
> + struct l2tp_stats *stats = per_cpu_ptr(cpustats, i);
> +
> + do {
> + start = u64_stats_fetch_begin(&stats->tx.syncp);
> + dest->tx.packets += stats->tx.packets;
> + dest->tx.bytes += stats->tx.bytes;
> + dest->tx.errors += stats->tx.errors;
You cant do the sum in 'dest', since if loop is restarted, you'll have
accumulation of all values.
> + } while (u64_stats_fetch_retry(&stats->tx.syncp, start));
> +
> + do {
> + start = u64_stats_fetch_begin(&stats->rx.syncp);
> + dest->rx.packets += stats->rx.packets;
> + dest->rx.bytes += stats->rx.bytes;
> + dest->rx.errors += stats->rx.errors;
> + dest->rx.seq_discards += stats->rx.seq_discards;
> + dest->rx.oos_packets += stats->rx.oos_packets;
same problem
> + } while (u64_stats_fetch_retry(&stats->rx.syncp, start));
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(l2tp_stats_copy);
> +
>
> static void pppol2tp_copy_stats(struct pppol2tp_ioc_stats *dest,
> - struct l2tp_stats *stats)
> + struct l2tp_stats *cpustats)
> {
> - dest->tx_packets = stats->tx_packets;
> - dest->tx_bytes = stats->tx_bytes;
> - dest->tx_errors = stats->tx_errors;
> - dest->rx_packets = stats->rx_packets;
> - dest->rx_bytes = stats->rx_bytes;
> - dest->rx_seq_discards = stats->rx_seq_discards;
> - dest->rx_oos_packets = stats->rx_oos_packets;
> - dest->rx_errors = stats->rx_errors;
> + struct l2tp_stats tmp;
> +
> + if (0 != l2tp_stats_copy(cpustats, &tmp))
> + return;
if (l2tp_stats_copy(cpustats, &tmp) != 0)
return;
> +
> + dest->tx_packets = tmp.tx.packets;
> + dest->tx_bytes = tmp.tx.bytes;
> + dest->tx_errors = tmp.tx.errors;
> + dest->rx_packets = tmp.rx.packets;
> + dest->rx_bytes = tmp.rx.bytes;
> + dest->rx_seq_discards = tmp.rx.seq_discards;
> + dest->rx_oos_packets = tmp.rx.oos_packets;
> + dest->rx_errors = tmp.rx.errors;
>
^ permalink raw reply
* [PATCH] Build fix in drivers/net/wireless/ath/ath9k/main.c
From: Arvydas Sidorenko @ 2012-06-27 19:19 UTC (permalink / raw)
To: mcgrof
Cc: jouni, vthiagar, senthilb, linville, linux-wireless, ath9k-devel,
netdev, linux-kernel, asido4
Commit fad29cd2f59949581050a937786c2c9bc78b2f04 broke the build if
no CONFIG_ATH9K_BTCOEX_SUPPORT is enabled.
Signed-off-by: Arvydas Sidorenko <asido4@gmail.com>
---
drivers/net/wireless/ath/ath9k/main.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
index c14cf5a..e4e73f0 100644
--- a/drivers/net/wireless/ath/ath9k/main.c
+++ b/drivers/net/wireless/ath/ath9k/main.c
@@ -151,8 +151,10 @@ static void __ath_cancel_work(struct ath_softc *sc)
cancel_delayed_work_sync(&sc->tx_complete_work);
cancel_delayed_work_sync(&sc->hw_pll_work);
+#ifdef CONFIG_ATH9K_BTCOEX_SUPPORT
if (ath9k_hw_mci_is_enabled(sc->sc_ah))
cancel_work_sync(&sc->mci_work);
+#endif
}
static void ath_cancel_work(struct ath_softc *sc)
--
1.7.8.6
^ permalink raw reply related
* Re: [PATCH v2] sctp: be more restrictive in transport selection on bundled sacks
From: Vlad Yasevich @ 2012-06-27 19:44 UTC (permalink / raw)
To: Neil Horman; +Cc: netdev, David S. Miller, linux-sctp
In-Reply-To: <20120627172802.GA9250@neilslaptop.think-freely.org>
On 06/27/2012 01:28 PM, Neil Horman wrote:
> On Wed, Jun 27, 2012 at 11:10:26AM -0400, Vlad Yasevich wrote:
>> On 06/27/2012 10:23 AM, Neil Horman wrote:
>>> It was noticed recently that when we send data on a transport, its possible that
>>> we might bundle a sack that arrived on a different transport. While this isn't
>>> a major problem, it does go against the SHOULD requirement in section 6.4 of RFC
>>> 2960:
>>>
>>> An endpoint SHOULD transmit reply chunks (e.g., SACK, HEARTBEAT ACK,
>>> etc.) to the same destination transport address from which it
>>> received the DATA or control chunk to which it is replying. This
>>> rule should also be followed if the endpoint is bundling DATA chunks
>>> together with the reply chunk.
>>>
>>> This patch seeks to correct that. It restricts the bundling of sack operations
>>> to only those transports which have moved the ctsn of the association forward
>>> since the last sack. By doing this we guarantee that we only bundle outbound
>>> saks on a transport that has received a chunk since the last sack. This brings
>>> us into stricter compliance with the RFC.
>>>
>>> Vlad had initially suggested that we strictly allow only sack bundling on the
>>> transport that last moved the ctsn forward. While this makes sense, I was
>>> concerned that doing so prevented us from bundling in the case where we had
>>> received chunks that moved the ctsn on multiple transports. In those cases, the
>>> RFC allows us to select any of the transports having received chunks to bundle
>>> the sack on. so I've modified the approach to allow for that, by adding a state
>>> variable to each transport that tracks weather it has moved the ctsn since the
>>> last sack. This I think keeps our behavior (and performance), close enough to
>>> our current profile that I think we can do this without a sysctl knob to
>>> enable/disable it.
>>>
>>> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
>>> CC: Vlad Yaseivch<vyasevich@gmail.com>
>>> CC: David S. Miller<davem@davemloft.net>
>>> Reported-by: Michele Baldessari<michele@redhat.com>
>>> Reported-by: sorin serban<sserban@redhat.com>
>>>
>>> ---
>>> Change Notes:
>>> V2)
>>> * Removed unused variable as per Dave M. Request
>>> * Delayed rwnd adjustment until we are sure we will sack (Vlad Y.)
>>> ---
>>> include/net/sctp/structs.h | 5 ++++-
>>> include/net/sctp/tsnmap.h | 3 ++-
>>> net/sctp/output.c | 10 +++++++---
>>> net/sctp/sm_make_chunk.c | 7 +++++++
>>> net/sctp/sm_sideeffect.c | 2 +-
>>> net/sctp/tsnmap.c | 5 ++++-
>>> net/sctp/ulpevent.c | 3 ++-
>>> net/sctp/ulpqueue.c | 2 +-
>>> 8 files changed, 28 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
>>> index e4652fe..712bf09 100644
>>> --- a/include/net/sctp/structs.h
>>> +++ b/include/net/sctp/structs.h
>>> @@ -910,7 +910,10 @@ struct sctp_transport {
>>> pmtu_pending:1,
>>>
>>> /* Is this structure kfree()able? */
>>> - malloced:1;
>>> + malloced:1,
>>> +
>>> + /* Has this transport moved the ctsn since we last sacked */
>>> + moved_ctsn:1;
>>>
>>> struct flowi fl;
>>>
>>> diff --git a/include/net/sctp/tsnmap.h b/include/net/sctp/tsnmap.h
>>> index e7728bc..2c5d2b4 100644
>>> --- a/include/net/sctp/tsnmap.h
>>> +++ b/include/net/sctp/tsnmap.h
>>> @@ -117,7 +117,8 @@ void sctp_tsnmap_free(struct sctp_tsnmap *map);
>>> int sctp_tsnmap_check(const struct sctp_tsnmap *, __u32 tsn);
>>>
>>> /* Mark this TSN as seen. */
>>> -int sctp_tsnmap_mark(struct sctp_tsnmap *, __u32 tsn);
>>> +int sctp_tsnmap_mark(struct sctp_tsnmap *, __u32 tsn,
>>> + struct sctp_transport *trans);
>>>
>>> /* Mark this TSN and all lower as seen. */
>>> void sctp_tsnmap_skip(struct sctp_tsnmap *map, __u32 tsn);
>>> diff --git a/net/sctp/output.c b/net/sctp/output.c
>>> index f1b7d4b..6c8cb9e 100644
>>> --- a/net/sctp/output.c
>>> +++ b/net/sctp/output.c
>>> @@ -240,17 +240,21 @@ static sctp_xmit_t sctp_packet_bundle_sack(struct sctp_packet *pkt,
>>> */
>>> if (sctp_chunk_is_data(chunk)&& !pkt->has_sack&&
>>> !pkt->has_cookie_echo) {
>>> - struct sctp_association *asoc;
>>> struct timer_list *timer;
>>> - asoc = pkt->transport->asoc;
>>> + struct sctp_association *asoc = pkt->transport->asoc;
>>> +
>>> timer =&asoc->timers[SCTP_EVENT_TIMEOUT_SACK];
>>>
>>> /* If the SACK timer is running, we have a pending SACK */
>>> if (timer_pending(timer)) {
>>> struct sctp_chunk *sack;
>>> - asoc->a_rwnd = asoc->rwnd;
>>> +
>>> + if (chunk->transport&& !chunk->transport->moved_ctsn)
>>> + return retval;
>>> +
>>
>> I didn't think of this yesterday, but I think it would be much
>> better to use pkt->transport here since you are adding the chunk to
>> the packet and it will go out on the transport of the packet. You
>> are also guaranteed that pkt->transport is set.
>>
> I don't think it really matters, as the chunk transport is used to lookup the
> packet that we append to, and if the chunk transport is unset, its somewhat
> questionable as to weather we should bundle, but if packet->transport is set,
> its probably worth it to avoid the extra conditional.
>
Just looked at the code flow. chunk->transport may not be set until the
end of sctp_packet_append_chunk. For new data, transport may not be
set. For retransmitted data, transport is set to last transport data
was sent on. So, we could be looking at the wrong transport. What you
are trying to decided is if the current transport we will be used can
take the SACK, but you may not be looking at the current transport.
Looking at packet->transport is the correct thing to do.
-vlad
>>> sack = sctp_make_sack(asoc);
>>> if (sack) {
>>> + asoc->a_rwnd = asoc->rwnd;
>>> retval = sctp_packet_append_chunk(pkt, sack);
>>> asoc->peer.sack_needed = 0;
>>> if (del_timer(timer))
>>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>>> index a85eeeb..805babe 100644
>>> --- a/net/sctp/sm_make_chunk.c
>>> +++ b/net/sctp/sm_make_chunk.c
>>> @@ -736,6 +736,7 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
>>> __u16 num_gabs, num_dup_tsns;
>>> struct sctp_tsnmap *map = (struct sctp_tsnmap *)&asoc->peer.tsn_map;
>>> struct sctp_gap_ack_block gabs[SCTP_MAX_GABS];
>>> + struct sctp_transport *trans;
>>>
>>> memset(gabs, 0, sizeof(gabs));
>>> ctsn = sctp_tsnmap_get_ctsn(map);
>>> @@ -805,6 +806,12 @@ struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
>>> sctp_addto_chunk(retval, sizeof(__u32) * num_dup_tsns,
>>> sctp_tsnmap_get_dups(map));
>>>
>>> + /*
>>> + * Once we have a sack generated, clear the moved_tsn information
>>> + * from all the transports
>>> + */
>>> + list_for_each_entry(trans,&asoc->peer.transport_addr_list, transports)
>>> + trans->moved_ctsn = 0;
>>> nodata:
>>> return retval;
>>> }
>>> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
>>> index c96d1a8..8716da1 100644
>>> --- a/net/sctp/sm_sideeffect.c
>>> +++ b/net/sctp/sm_sideeffect.c
>>> @@ -1268,7 +1268,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
>>> case SCTP_CMD_REPORT_TSN:
>>> /* Record the arrival of a TSN. */
>>> error = sctp_tsnmap_mark(&asoc->peer.tsn_map,
>>> - cmd->obj.u32);
>>> + cmd->obj.u32, NULL);
>>> break;
>>>
>>> case SCTP_CMD_REPORT_FWDTSN:
>>> diff --git a/net/sctp/tsnmap.c b/net/sctp/tsnmap.c
>>> index f1e40ceb..619c638 100644
>>> --- a/net/sctp/tsnmap.c
>>> +++ b/net/sctp/tsnmap.c
>>> @@ -114,7 +114,8 @@ int sctp_tsnmap_check(const struct sctp_tsnmap *map, __u32 tsn)
>>>
>>>
>>> /* Mark this TSN as seen. */
>>> -int sctp_tsnmap_mark(struct sctp_tsnmap *map, __u32 tsn)
>>> +int sctp_tsnmap_mark(struct sctp_tsnmap *map, __u32 tsn,
>>> + struct sctp_transport *trans)
>>> {
>>> u16 gap;
>>>
>>> @@ -133,6 +134,8 @@ int sctp_tsnmap_mark(struct sctp_tsnmap *map, __u32 tsn)
>>> */
>>> map->max_tsn_seen++;
>>> map->cumulative_tsn_ack_point++;
>>> + if (trans)
>>> + trans->moved_ctsn = 1;
>>> map->base_tsn++;
>>> } else {
>>> /* Either we already have a gap, or about to record a gap, so
>>> diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
>>> index 8a84017..33d8947 100644
>>> --- a/net/sctp/ulpevent.c
>>> +++ b/net/sctp/ulpevent.c
>>> @@ -715,7 +715,8 @@ struct sctp_ulpevent *sctp_ulpevent_make_rcvmsg(struct sctp_association *asoc,
>>> * can mark it as received so the tsn_map is updated correctly.
>>> */
>>> if (sctp_tsnmap_mark(&asoc->peer.tsn_map,
>>> - ntohl(chunk->subh.data_hdr->tsn)))
>>> + ntohl(chunk->subh.data_hdr->tsn),
>>> + chunk->transport))
>>> goto fail_mark;
>>>
>>> /* First calculate the padding, so we don't inadvertently
>>> diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c
>>> index f2d1de7..f5a6a4f 100644
>>> --- a/net/sctp/ulpqueue.c
>>> +++ b/net/sctp/ulpqueue.c
>>> @@ -1051,7 +1051,7 @@ void sctp_ulpq_renege(struct sctp_ulpq *ulpq, struct sctp_chunk *chunk,
>>> if (chunk&& (freed>= needed)) {
>>> __u32 tsn;
>>> tsn = ntohl(chunk->subh.data_hdr->tsn);
>>> - sctp_tsnmap_mark(&asoc->peer.tsn_map, tsn);
>>> + sctp_tsnmap_mark(&asoc->peer.tsn_map, tsn, chunk->transport);
>>> sctp_ulpq_tail_data(ulpq, chunk, gfp);
>>>
>>> sctp_ulpq_partial_delivery(ulpq, chunk, gfp);
>>
>> Also, I think you need to reset this bit in sctp_transport_reset().
>> Consider a potential association restart after SACKs have been
>> received.
>>
> Yeah, thats true. I'll add that in.
>
> Thanks!
> Neil
>
>> -vlad
>>
^ permalink raw reply
* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Florian Westphal @ 2012-06-27 19:50 UTC (permalink / raw)
To: David Miller
Cc: brouer, eric.dumazet, hans.schillstrom, subramanian.vijay,
dave.taht, netdev, ncardwell, therbert, mph
In-Reply-To: <20120626.235423.588696200884989114.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Wed, 27 Jun 2012 08:32:13 +0200
>
> > Using it as default, might be "dangerous" and open an attack vector
> > on SYN cookies in Linux.
>
> If it's dangerous for syncookies then it's just as dangerous for
> the routing hash and the socket hashes where we use it already.
> Therefore, this sounds like a baseless claim to me.
I doubt using jhash is safe for syncookies.
There a several differences to other uses in kernel:
- all hash input except u32 cookie_secret[2] is known
- we transmit hash result (i.e, its visible to 3rd party)
- we do not re-seed the secret, ever
it should be quite easy to recompute cookie_secret[] from known syncookie
values?
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Rick Jones @ 2012-06-27 20:21 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Tom Parkin, netdev, David.Laight, James Chapman
In-Reply-To: <1340823810.26242.81.camel@edumazet-glaptop>
On 06/27/2012 12:03 PM, Eric Dumazet wrote:
> On Wed, 2012-06-27 at 13:00 +0100, Tom Parkin wrote:
>> This patch fixes a race condition in l2tp when updating tunnel and
>> session statistics. Previously it was possible for multiple threads
>> to concurrently call u64_stats_update*(), which lead to statistics
>> readers blocking forever.
>>
>> This race was discovered on an AMD64 SMP machine running a 32bit
>> kernel. Running "ip l2tp" while sending data over an Ethernet
>> pseudowire resulted in an occasional soft lockup in
>> u64_stats_fetch_begin() called from l2tp_nl_session_send().
>>
>> For safe lockless update of l2tp stats, data is now stored in per-cpu
>> variables. These per-cpu datasets are then summed at read time via.
>> an extra helper function l2tp_stats_copy() which has been added to
>> l2tp_core.c.
>>
>
> Do we really need 64bits stats on 32bit arches for l2tp ?
It is a question of the speed of the communications more than the
bitness of the processor no?
rick jones
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Eric Dumazet @ 2012-06-27 20:39 UTC (permalink / raw)
To: Rick Jones; +Cc: Tom Parkin, netdev, David.Laight, James Chapman
In-Reply-To: <4FEB6B64.5060708@hp.com>
On Wed, 2012-06-27 at 13:21 -0700, Rick Jones wrote:
> It is a question of the speed of the communications more than the
> bitness of the processor no?
Why ?
In 2012 or 2013, 64bits kernels are the norm, and 32bit the exception.
Should we add complex code to l2tp just for being able to run it on
32bit kernels with 64bit stats ?
Considering this code is buggy at the v1 & v2, I am really wondering.
All sane SNMP applications are ready to cope with 32bits counters
wrapping.
Machines that could wrap the 32bit counter several times per second are
probably running on 64bit kernels.
Also percpu stats are overkill unless a device is really meant to be
used in // by many cpus.
^ permalink raw reply
* Server Rental Promotion
From: borislamsv2 @ 2012-06-27 20:18 UTC (permalink / raw)
Dear All,
We have our own datacenter in Hong Kong & provide email/application/web rental service to clients.We are APNIC member & provide clean IP to clients.
Dell? PowerEdge? EnterpriseRack Mount Server
-Intel(R) Xeon(R) E3-1240 Processor (3.3GHz, 8M Cache, Turbo, 4C/8T, 80W)
-8GB RAM, 2x4GB, 1333MHz, DDR-3, Dual Ranked UDIMMs
-500GB, 3.5", 6Gbps SAS x 2
-Raid 1 Mirroring Protection
Every Dedicated Server Hosting Solution Also Includes:
Software Specification
- CentOS / Fedora / Debian / FreeBSD / Ubuntu / Redhat Linux
- Full root-level access
- Data Center Facilities
- Shared Local & International Bandwidth
- 2 IP Addresses Allocation
- Un-interruptible Power Supply (UPS) backed up by private diesel generator
- FM200¡§based fire suppression system
- 24x7 CRAC Air Conditioning and Humidity Control
- 24x7 Security Control
- 24x7 Remote Hand Service
Pls send us email for further information.Thanks,
Boris
boris@dedicatedserver.co.hk
If you do not wish to further receive this event message, email "borislamsv2@gmail.com" to unsubscribe this message or remove your email from the list.
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Stephen Hemminger @ 2012-06-27 20:50 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Rick Jones, Tom Parkin, netdev, David.Laight, James Chapman
In-Reply-To: <1340829541.26242.90.camel@edumazet-glaptop>
On Wed, 27 Jun 2012 22:39:01 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> All sane SNMP applications are ready to cope with 32bits counters
> wrapping.
Actually that statement depends on the data rate. SNMP daemons work
by polling at periodic intervals. The limit for detecting roll over depends
on the rate and the interval. I believe the ubiquitous net-snmp code uses
something a 30 second polling interval for lots of it's caches. This means
it rolls over too fast at 10G. Polling faster can help but net-snmp is
a pig about updates.
I just realized the whole x32 (running 32 bit apps on 64 bit kernel) is broken
for things like /proc/net/dev where 64 bit kernel will give 64 bit values and
the 32 bit app (like net-snmp) is expecting unsigned long (32 bits).
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Ben Greear @ 2012-06-27 20:58 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Eric Dumazet, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <20120627135034.7db7d0eb@nehalam.linuxnetplumber.net>
On 06/27/2012 01:50 PM, Stephen Hemminger wrote:
> On Wed, 27 Jun 2012 22:39:01 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> All sane SNMP applications are ready to cope with 32bits counters
>> wrapping.
>
> Actually that statement depends on the data rate. SNMP daemons work
> by polling at periodic intervals. The limit for detecting roll over depends
> on the rate and the interval. I believe the ubiquitous net-snmp code uses
> something a 30 second polling interval for lots of it's caches. This means
> it rolls over too fast at 10G. Polling faster can help but net-snmp is
> a pig about updates.
>
> I just realized the whole x32 (running 32 bit apps on 64 bit kernel) is broken
> for things like /proc/net/dev where 64 bit kernel will give 64 bit values and
> the 32 bit app (like net-snmp) is expecting unsigned long (32 bits).
It's worse than that: Even on 64-bit kernels, counters that are returned by
netlink and /proc/net/dev as 64-bit may still wrap themselves at 32-bit
intervals.
I found that I just have to be very paranoid, and assume that if a '64-bit'
number wraps its high 32-bits between polls then it is really just a 32-bit
number and do that wrap properly (and poll more often).
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Rick Jones @ 2012-06-27 21:00 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Tom Parkin, netdev, David.Laight, James Chapman
In-Reply-To: <1340829541.26242.90.camel@edumazet-glaptop>
On 06/27/2012 01:39 PM, Eric Dumazet wrote:
> On Wed, 2012-06-27 at 13:21 -0700, Rick Jones wrote:
>
>> It is a question of the speed of the communications more than the
>> bitness of the processor no?
>
> Why ?
The desire to have a greater than 32 bit counter stems from how quickly
it will wrap, and how quickly it will wrap depends on the speed of
communication.
> In 2012 or 2013, 64bits kernels are the norm, and 32bit the exception.
In servers and laptops I would be inclined to agree. Elsewhere I am not
so sure. And while I don't know of its status, there is at least one
hit on an RFC where, back in 2001, when a great many things were 32 bit,
various 64-bit counters were added for L2TP Domain statistics:
http://tools.ietf.org/html/draft-ietf-l2tpext-l2tp-mib-03
(I think that became http://tools.ietf.org/html/rfc3371 in 2002)
So, it is possible people were simply painting a bikeshed, but 10 years
ago, when the prevalence of 64 bit processors was not nearly so great,
folks involved in L2TP MIB definitions found it worthwhile to make
various counters 64-bit.
> Should we add complex code to l2tp just for being able to run it on
> 32bit kernels with 64bit stats ?
>
> Considering this code is buggy at the v1 & v2, I am really wondering.
>
> All sane SNMP applications are ready to cope with 32bits counters
> wrapping.
Once (maybe twice?) but no more than that.
> Machines that could wrap the 32bit counter several times per second are
> probably running on 64bit kernels.
I don't think it requires wrapping a 32 bit counter several times a
second to warrant a > 32 bit counter.
> Also percpu stats are overkill unless a device is really meant to be
> used in // by many cpus.
I take it "//" is used as a shorthand for parallel?
rick
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Eric Dumazet @ 2012-06-27 21:20 UTC (permalink / raw)
To: Ben Greear
Cc: Stephen Hemminger, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <4FEB73EF.9090702@candelatech.com>
On Wed, 2012-06-27 at 13:58 -0700, Ben Greear wrote:
> It's worse than that: Even on 64-bit kernels, counters that are returned by
> netlink and /proc/net/dev as 64-bit may still wrap themselves at 32-bit
> intervals.
Really ?
Thats incredible you dont send a bug report then.
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Ben Greear @ 2012-06-27 21:31 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <1340832022.26242.131.camel@edumazet-glaptop>
On 06/27/2012 02:20 PM, Eric Dumazet wrote:
> On Wed, 2012-06-27 at 13:58 -0700, Ben Greear wrote:
>
>> It's worse than that: Even on 64-bit kernels, counters that are returned by
>> netlink and /proc/net/dev as 64-bit may still wrap themselves at 32-bit
>> intervals.
>
> Really ?
>
> Thats incredible you dont send a bug report then.
I bitched and moaned a while back...didn't get a lot
of positive feedback :)
For an example, see the VLAN code. rx-errors and tx-dropped are only 32-bit
counters. Now, in the real world, we wouldn't expect those counters to
increase at high rates, but they are still 32-bit counters masquerading
as 64, and they could wrap after a while, so any code that expected a wrap
to mean a 64-bit wrap would be wrong.
At the time I was last complaining, there were lots more cases
of this that I was fighting with, but I don't recall exactly what they
were. Once my user-space code got paranoid enough, it was able to
at least mostly deal with 32 and 64 wraps.
static struct rtnl_link_stats64 *vlan_dev_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
{
if (vlan_dev_priv(dev)->vlan_pcpu_stats) {
struct vlan_pcpu_stats *p;
u32 rx_errors = 0, tx_dropped = 0, collisions = 0;
int i;
for_each_possible_cpu(i) {
u64 rxpackets, rxbytes, rxmulticast, txpackets, txbytes;
unsigned int start;
p = per_cpu_ptr(vlan_dev_priv(dev)->vlan_pcpu_stats, i);
do {
start = u64_stats_fetch_begin_bh(&p->syncp);
rxpackets = p->rx_packets;
rxbytes = p->rx_bytes;
rxmulticast = p->rx_multicast;
txpackets = p->tx_packets;
txbytes = p->tx_bytes;
} while (u64_stats_fetch_retry_bh(&p->syncp, start));
stats->rx_packets += rxpackets;
stats->rx_bytes += rxbytes;
stats->multicast += rxmulticast;
stats->tx_packets += txpackets;
stats->tx_bytes += txbytes;
/* rx_errors & tx_dropped are u32 */
rx_errors += p->rx_errors;
tx_dropped += p->tx_dropped;
collisions += p->collisions;
}
stats->rx_errors = rx_errors;
stats->tx_dropped = tx_dropped;
stats->collisions = collisions;
}
return stats;
}
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Eric Dumazet @ 2012-06-27 21:32 UTC (permalink / raw)
To: Ben Greear
Cc: Stephen Hemminger, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <1340832022.26242.131.camel@edumazet-glaptop>
On Wed, 2012-06-27 at 23:20 +0200, Eric Dumazet wrote:
> On Wed, 2012-06-27 at 13:58 -0700, Ben Greear wrote:
>
> > It's worse than that: Even on 64-bit kernels, counters that are returned by
> > netlink and /proc/net/dev as 64-bit may still wrap themselves at 32-bit
> > intervals.
>
> Really ?
>
> Thats incredible you dont send a bug report then.
A bug report to the application author, not the kernel.
/proc/net/dev is an ASCII file, and nothing gives the width of a field.
Therefore, an application should cope with all cases (counter being a
32bit or 64bit integer), wrapping included.
Note that this has little to do with the application or kernel being
32/64 bit code.
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Eric Dumazet @ 2012-06-27 21:35 UTC (permalink / raw)
To: Ben Greear
Cc: Stephen Hemminger, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <4FEB7BB9.1080905@candelatech.com>
On Wed, 2012-06-27 at 14:31 -0700, Ben Greear wrote:
> For an example, see the VLAN code. rx-errors and tx-dropped are only 32-bit
> counters. Now, in the real world, we wouldn't expect those counters to
> increase at high rates, but they are still 32-bit counters masquerading
> as 64, and they could wrap after a while, so any code that expected a wrap
> to mean a 64-bit wrap would be wrong.
>
> At the time I was last complaining, there were lots more cases
> of this that I was fighting with, but I don't recall exactly what they
> were. Once my user-space code got paranoid enough, it was able to
> at least mostly deal with 32 and 64 wraps.
Good, you now know how to deal correctly with these things.
Using 64bit fields for tx_dropped is what I call kernel bloat.
If only all drops were correctly accounted...
I believe 50% of drivers are buggy in this aspect.
^ permalink raw reply
* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Eric Dumazet @ 2012-06-27 21:39 UTC (permalink / raw)
To: Florian Westphal
Cc: David Miller, brouer, hans.schillstrom, subramanian.vijay,
dave.taht, netdev, ncardwell, therbert, mph
In-Reply-To: <20120627195032.GI1269@breakpoint.cc>
On Wed, 2012-06-27 at 21:50 +0200, Florian Westphal wrote:
> I doubt using jhash is safe for syncookies.
>
> There a several differences to other uses in kernel:
> - all hash input except u32 cookie_secret[2] is known
> - we transmit hash result (i.e, its visible to 3rd party)
> - we do not re-seed the secret, ever
>
> it should be quite easy to recompute cookie_secret[] from known syncookie
> values?
We could re-seed the secrets every MSL seconds a bit like in
tcp_cookie_generator()
This would require check_tcp_syn_cookie() doing two checks (most recent
seed, and previous one if first check failed)
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Ben Greear @ 2012-06-27 21:40 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <1340832771.26242.160.camel@edumazet-glaptop>
On 06/27/2012 02:32 PM, Eric Dumazet wrote:
> On Wed, 2012-06-27 at 23:20 +0200, Eric Dumazet wrote:
>> On Wed, 2012-06-27 at 13:58 -0700, Ben Greear wrote:
>>
>>> It's worse than that: Even on 64-bit kernels, counters that are returned by
>>> netlink and /proc/net/dev as 64-bit may still wrap themselves at 32-bit
>>> intervals.
>>
>> Really ?
>>
>> Thats incredible you dont send a bug report then.
>
> A bug report to the application author, not the kernel.
Yeah, application author was me.
> /proc/net/dev is an ASCII file, and nothing gives the width of a field.
>
> Therefore, an application should cope with all cases (counter being a
> 32bit or 64bit integer), wrapping included.
>
> Note that this has little to do with the application or kernel being
> 32/64 bit code.
Notice that the netlink stats are claiming 64-bit and they are not
(always) 64-bit.
That is a nice binary API that is still wrapping before its time
in many cases. There may be good performance reasons for keeping
some counters at 32-bit, but it plays merry hell with anyone wanting
to optimize an application to poll less often for stats that are
supposedly 64-bit.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH v2] l2tp: use per-cpu variables for u64_stats updates
From: Eric Dumazet @ 2012-06-27 21:50 UTC (permalink / raw)
To: Ben Greear
Cc: Stephen Hemminger, Rick Jones, Tom Parkin, netdev, David.Laight,
James Chapman
In-Reply-To: <4FEB7DC0.9090404@candelatech.com>
On Wed, 2012-06-27 at 14:40 -0700, Ben Greear wrote:
> Notice that the netlink stats are claiming 64-bit and they are not
> (always) 64-bit.
There is no such claim.
netlink provides a 64bit generic binary holder, even for legacy drivers
still using 'unsigned long' stats, so obviously 32bit values on 32bit
kernels.
> That is a nice binary API that is still wrapping before its time
> in many cases. There may be good performance reasons for keeping
> some counters at 32-bit, but it plays merry hell with anyone wanting
> to optimize an application to poll less often for stats that are
> supposedly 64-bit.
We dont want hundred of patches to bring 64bit stats on legacy drivers.
Nobody cares, because its way too late to try to 'fix' this.
If you want your application running on linux-2.6.x, or linux-3.0, you
are forced to correctly detect each counter being 32 or 64, not because
netlink API is 64bit, but because a driver provides a 32 or 64 bit
value.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox