* Re: [PATCH net-next-2.6 2/6] sfc: Implement generic features interface
From: Ben Hutchings @ 2011-04-03 20:27 UTC (permalink / raw)
To: Michał Mirosław; +Cc: David Miller, netdev, linux-net-drivers
In-Reply-To: <20110403201322.GA13122@rere.qmqm.pl>
On Sun, 2011-04-03 at 22:13 +0200, Michał Mirosław wrote:
> On Sun, Apr 03, 2011 at 08:51:21PM +0100, Ben Hutchings wrote:
> > Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> > ---
> > drivers/net/sfc/efx.c | 17 ++++++++-
> > drivers/net/sfc/ethtool.c | 78 ------------------------------------------
> > drivers/net/sfc/net_driver.h | 2 -
> > drivers/net/sfc/rx.c | 2 +-
> > 4 files changed, 16 insertions(+), 83 deletions(-)
> >
> [cut patch]
>
> Looks ok to me.
>
> BTW, I noticed that TSO6 is not enabled in vlan_features. Is this intentional?
Well spotted. It's not intentional.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Congratulations"
From: MSMPROMO @ 2011-04-03 19:52 UTC (permalink / raw)
Dear Winner,
Congratulations!!! Your email address has won you 450,000.GBP on this year
GSM-WEB Promo if intrested,Kindly provide us with these requirements
below:
Name:
Address:
Country:
Sex:
Age:
Phone Numbers:
Identity Proof:
For claims call:+447031894678 or Email: msmpromo@gmail.com
Thanks,
Gary Williams
GSM-WEB
^ permalink raw reply
* Re: [PATCH net-next-2.6 2/6] sfc: Implement generic features interface
From: Michał Mirosław @ 2011-04-03 20:13 UTC (permalink / raw)
To: Ben Hutchings; +Cc: David Miller, netdev, linux-net-drivers
In-Reply-To: <1301860281.2935.25.camel@localhost>
On Sun, Apr 03, 2011 at 08:51:21PM +0100, Ben Hutchings wrote:
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
> drivers/net/sfc/efx.c | 17 ++++++++-
> drivers/net/sfc/ethtool.c | 78 ------------------------------------------
> drivers/net/sfc/net_driver.h | 2 -
> drivers/net/sfc/rx.c | 2 +-
> 4 files changed, 16 insertions(+), 83 deletions(-)
>
[cut patch]
Looks ok to me.
BTW, I noticed that TSO6 is not enabled in vlan_features. Is this intentional?
Best Regards,
Michał Mirosław
^ permalink raw reply
* [PATCH net-next-2.6 6/6] sfc: Implement ethtool_ops::set_phys_id instead of ethtool_ops::phys_id
From: Ben Hutchings @ 2011-04-03 19:55 UTC (permalink / raw)
To: David Miller
Cc: netdev, linux-net-drivers, Stephen Hemminger,
Michał Mirosław
In-Reply-To: <1301859889.2935.23.camel@localhost>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/sfc/ethtool.c | 28 ++++++++++++++++++----------
1 files changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index 0d55439..644f7c1 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -178,19 +178,27 @@ static struct efx_ethtool_stat efx_ethtool_stats[] = {
*/
/* Identify device by flashing LEDs */
-static int efx_ethtool_phys_id(struct net_device *net_dev, u32 count)
+static int efx_ethtool_phys_id(struct net_device *net_dev,
+ enum ethtool_phys_id_state state)
{
struct efx_nic *efx = netdev_priv(net_dev);
+ enum efx_led_mode mode;
- do {
- efx->type->set_id_led(efx, EFX_LED_ON);
- schedule_timeout_interruptible(HZ / 2);
-
- efx->type->set_id_led(efx, EFX_LED_OFF);
- schedule_timeout_interruptible(HZ / 2);
- } while (!signal_pending(current) && --count != 0);
+ switch (state) {
+ case ETHTOOL_ID_ON:
+ mode = EFX_LED_ON;
+ break;
+ case ETHTOOL_ID_OFF:
+ mode = EFX_LED_OFF;
+ break;
+ case ETHTOOL_ID_INACTIVE:
+ mode = EFX_LED_DEFAULT;
+ break;
+ default:
+ return -EINVAL;
+ }
- efx->type->set_id_led(efx, EFX_LED_DEFAULT);
+ efx->type->set_id_led(efx, mode);
return 0;
}
@@ -1007,7 +1015,7 @@ const struct ethtool_ops efx_ethtool_ops = {
.get_sset_count = efx_ethtool_get_sset_count,
.self_test = efx_ethtool_self_test,
.get_strings = efx_ethtool_get_strings,
- .phys_id = efx_ethtool_phys_id,
+ .set_phys_id = efx_ethtool_phys_id,
.get_ethtool_stats = efx_ethtool_get_stats,
.get_wol = efx_ethtool_get_wol,
.set_wol = efx_ethtool_set_wol,
--
1.5.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next-2.6 5/6] ethtool: Change ETHTOOL_PHYS_ID implementation to allow dropping RTNL
From: Ben Hutchings @ 2011-04-03 19:53 UTC (permalink / raw)
To: David Miller
Cc: netdev, linux-net-drivers, Stephen Hemminger,
Michał Mirosław
In-Reply-To: <1301859889.2935.23.camel@localhost>
The ethtool ETHTOOL_PHYS_ID command runs for an arbitrarily long
period of time, holding the RTNL lock. This blocks routing updates,
device enumeration, and various important operations that one might
want to keep running while hunting for the flashing LED.
We need to drop the RTNL lock during this operation, but currently the
core implementation is a thin wrapper around a driver operation and
drivers may well depend upon holding the lock.
Define a new driver operation 'set_phys_id' with an argument that sets
the ID indicator on/off/inactive/active (the last optional, for any
driver or firmware that prefers to handle blinking asynchronously).
When this is defined, the ethtool core drops the lock while waiting
and only acquires it around calls to this operation.
Deprecate the 'phys_id' operation in favour of this. It can be
removed once all in-tree drivers are converted.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
include/linux/ethtool.h | 30 +++++++++++++++++++++++++++++-
net/core/ethtool.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 73 insertions(+), 3 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 6da626e..ed39d90 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -663,6 +663,22 @@ struct ethtool_rx_ntuple_list {
unsigned int count;
};
+/**
+ * enum ethtool_phys_id_state - indicator state for physical identification
+ * @ETHTOOL_ID_INACTIVE: Physical ID indicator should be deactivated
+ * @ETHTOOL_ID_ACTIVE: Physical ID indicator should be activated
+ * @ETHTOOL_ID_ON: LED should be turned on (used iff %ETHTOOL_ID_ACTIVE
+ * is not supported)
+ * @ETHTOOL_ID_OFF: LED should be turned off (used iff %ETHTOOL_ID_ACTIVE
+ * is not supported)
+ */
+enum ethtool_phys_id_state {
+ ETHTOOL_ID_INACTIVE,
+ ETHTOOL_ID_ACTIVE,
+ ETHTOOL_ID_ON,
+ ETHTOOL_ID_OFF
+};
+
struct net_device;
/* Some generic methods drivers may use in their ethtool_ops */
@@ -741,7 +757,18 @@ bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
* segmentation offload on or off. Returns a negative error code or zero.
* @self_test: Run specified self-tests
* @get_strings: Return a set of strings that describe the requested objects
- * @phys_id: Identify the physical device, e.g. by flashing an LED
+ * @set_phys_id: Identify the physical devices, e.g. by flashing an LED
+ * attached to it. The implementation may update the indicator
+ * asynchronously or synchronously, but in either case it must return
+ * quickly. It is initially called with the argument %ETHTOOL_ID_ACTIVE,
+ * and must either activate asynchronous updates or return -%EINVAL.
+ * If it returns -%EINVAL then it will be called again at intervals with
+ * argument %ETHTOOL_ID_ON or %ETHTOOL_ID_OFF and must set the state of
+ * the indicator accordingly. Finally, it is called with the argument
+ * %ETHTOOL_ID_INACTIVE and must deactivate the indicator. Returns a
+ * negative error code or zero.
+ * @phys_id: Deprecated in favour of @set_phys_id.
+ * Identify the physical device, e.g. by flashing an LED
* attached to it until interrupted by a signal or the given time
* (in seconds) elapses. If the given time is zero, use a default
* time limit. Returns a negative error code or zero. Being
@@ -827,6 +854,7 @@ struct ethtool_ops {
int (*set_tso)(struct net_device *, u32);
void (*self_test)(struct net_device *, struct ethtool_test *, u64 *);
void (*get_strings)(struct net_device *, u32 stringset, u8 *);
+ int (*set_phys_id)(struct net_device *, enum ethtool_phys_id_state);
int (*phys_id)(struct net_device *, u32);
void (*get_ethtool_stats)(struct net_device *,
struct ethtool_stats *, u64 *);
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 74ead9e..d1c729d 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -21,6 +21,8 @@
#include <linux/uaccess.h>
#include <linux/vmalloc.h>
#include <linux/slab.h>
+#include <linux/rtnetlink.h>
+#include <linux/sched.h>
/*
* Some useful ethtool_ops methods that're device independent.
@@ -1618,14 +1620,54 @@ out:
static int ethtool_phys_id(struct net_device *dev, void __user *useraddr)
{
struct ethtool_value id;
+ int rc;
- if (!dev->ethtool_ops->phys_id)
+ if (!dev->ethtool_ops->set_phys_id && !dev->ethtool_ops->phys_id)
return -EOPNOTSUPP;
if (copy_from_user(&id, useraddr, sizeof(id)))
return -EFAULT;
- return dev->ethtool_ops->phys_id(dev, id.data);
+ if (!dev->ethtool_ops->set_phys_id)
+ /* Do it the old way */
+ return dev->ethtool_ops->phys_id(dev, id.data);
+
+ rc = dev->ethtool_ops->set_phys_id(dev, ETHTOOL_ID_ACTIVE);
+ if (rc && rc != -EINVAL)
+ return rc;
+
+ dev_hold(dev);
+ rtnl_unlock();
+
+ if (rc == 0) {
+ /* Driver will handle this itself */
+ schedule_timeout_interruptible(
+ id.data ? id.data : MAX_SCHEDULE_TIMEOUT);
+ } else {
+ /* Driver expects to be called periodically */
+ do {
+ rtnl_lock();
+ rc = dev->ethtool_ops->set_phys_id(dev, ETHTOOL_ID_ON);
+ rtnl_unlock();
+ if (rc)
+ break;
+ schedule_timeout_interruptible(HZ / 2);
+
+ rtnl_lock();
+ rc = dev->ethtool_ops->set_phys_id(dev, ETHTOOL_ID_OFF);
+ rtnl_unlock();
+ if (rc)
+ break;
+ schedule_timeout_interruptible(HZ / 2);
+ } while (!signal_pending(current) &&
+ (id.data == 0 || --id.data != 0));
+ }
+
+ rtnl_lock();
+ dev_put(dev);
+
+ (void)dev->ethtool_ops->set_phys_id(dev, ETHTOOL_ID_INACTIVE);
+ return rc;
}
static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
--
1.5.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next-2.6 4/6] ethtool: Fill out and update comment for struct ethtool_ops
From: Ben Hutchings @ 2011-04-03 19:52 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1301859889.2935.23.camel@localhost>
Briefly document all operations (except get_rx_ntuple), including
whether they may return an error code and whether they are deprecated.
Also mention some things that should be handled by the ethtool core
rather than by drivers.
Briefly document general requirements for callers.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
include/linux/ethtool.h | 121 +++++++++++++++++++++++++++++++++++------------
1 files changed, 90 insertions(+), 31 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ab12f84..6da626e 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -683,22 +683,28 @@ void ethtool_ntuple_flush(struct net_device *dev);
bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
/**
- * struct ethtool_ops - Alter and report network device settings
- * @get_settings: Get device-specific settings.
- * @get_settings is passed an ðtool_cmd to fill in. It returns
- * an negative errno or zero.
- * @set_settings: Set device-specific settings.
- * @set_settings is passed an ðtool_cmd and should attempt to set
- * all the settings this device supports. It may return an error value
- * if something goes wrong (otherwise 0).
- * @get_drvinfo: Report driver information
+ * struct ethtool_ops - optional netdev operations
+ * @get_settings: Get various device settings including Ethernet link
+ * settings. Returns a negative error code or zero.
+ * @set_settings: Set various device settings including Ethernet link
+ * settings. Returns a negative error code or zero.
+ * @get_drvinfo: Report driver/device information. Should only set the
+ * @driver, @version, @fw_version and @bus_info fields. If not
+ * implemented, the @driver and @bus_info fields will be filled in
+ * according to the netdev's parent device.
+ * @get_regs_len: Get buffer length required for @get_regs
* @get_regs: Get device registers
* @get_wol: Report whether Wake-on-Lan is enabled
- * @set_wol: Turn Wake-on-Lan on or off
- * @get_msglevel: Report driver message level
+ * @set_wol: Turn Wake-on-Lan on or off. Returns a negative error code
+ * or zero.
+ * @get_msglevel: Report driver message level. This should be the value
+ * of the @msg_enable field used by netif logging functions.
* @set_msglevel: Set driver message level
- * @nway_reset: Restart autonegotiation
- * @get_link: Get link status
+ * @nway_reset: Restart autonegotiation. Returns a negative error code
+ * or zero.
+ * @get_link: Report whether physical link is up. Will only be called if
+ * the netdev is up. Should usually be set to ethtool_op_get_link(),
+ * which uses netif_carrier_ok().
* @get_eeprom: Read data from the device EEPROM.
* Should fill in the magic field. Don't need to check len for zero
* or wraparound. Fill in the data argument with the eeprom values
@@ -708,28 +714,81 @@ bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
* Should validate the magic field. Don't need to check len for zero
* or wraparound. Update len to the amount written. Returns an error
* or zero.
- * @get_coalesce: Get interrupt coalescing parameters
- * @set_coalesce: Set interrupt coalescing parameters
+ * @get_coalesce: Get interrupt coalescing parameters. Returns a negative
+ * error code or zero.
+ * @set_coalesce: Set interrupt coalescing parameters. Returns a negative
+ * error code or zero.
* @get_ringparam: Report ring sizes
- * @set_ringparam: Set ring sizes
+ * @set_ringparam: Set ring sizes. Returns a negative error code or zero.
* @get_pauseparam: Report pause parameters
- * @set_pauseparam: Set pause parameters
- * @get_rx_csum: Report whether receive checksums are turned on or off
- * @set_rx_csum: Turn receive checksum on or off
- * @get_tx_csum: Report whether transmit checksums are turned on or off
- * @set_tx_csum: Turn transmit checksums on or off
- * @get_sg: Report whether scatter-gather is enabled
- * @set_sg: Turn scatter-gather on or off
- * @get_tso: Report whether TCP segmentation offload is enabled
- * @set_tso: Turn TCP segmentation offload on or off
- * @get_ufo: Report whether UDP fragmentation offload is enabled
- * @set_ufo: Turn UDP fragmentation offload on or off
+ * @set_pauseparam: Set pause parameters. Returns a negative error code
+ * or zero.
+ * @get_rx_csum: Deprecated in favour of the netdev feature %NETIF_F_RXCSUM.
+ * Report whether receive checksums are turned on or off.
+ * @set_rx_csum: Deprecated in favour of the netdev op ndo_set_flags. Turn
+ * receive checksum on or off. Returns a negative error code or zero.
+ * @get_tx_csum: Deprecated as redundant. Report whether transmit checksums
+ * are turned on or off.
+ * @set_tx_csum: Deprecated in favour of the netdev op ndo_set_flags. Turn
+ * transmit checksums on or off. Returns a egative error code or zero.
+ * @get_sg: Deprecated as redundant. Report whether scatter-gather is
+ * enabled.
+ * @set_sg: Deprecated in favour of the netdev op ndo_set_flags. Turn
+ * scatter-gather on or off. Returns a negative error code or zero.
+ * @get_tso: Deprecated as redundant. Report whether TCP segmentation
+ * offload is enabled.
+ * @set_tso: Deprecated in favour of the netdev op ndo_set_flags. Turn TCP
+ * segmentation offload on or off. Returns a negative error code or zero.
* @self_test: Run specified self-tests
* @get_strings: Return a set of strings that describe the requested objects
- * @phys_id: Identify the device
- * @get_stats: Return statistics about the device
- * @get_flags: get 32-bit flags bitmap
- * @set_flags: set 32-bit flags bitmap
+ * @phys_id: Identify the physical device, e.g. by flashing an LED
+ * attached to it until interrupted by a signal or the given time
+ * (in seconds) elapses. If the given time is zero, use a default
+ * time limit. Returns a negative error code or zero. Being
+ * interrupted by a signal is not an error.
+ * @get_ethtool_stats: Return extended statistics about the device.
+ * This is only useful if the device maintains statistics not
+ * included in &struct rtnl_link_stats64.
+ * @begin: Function to be called before any other operation. Returns a
+ * negative error code or zero.
+ * @complete: Function to be called after any other operation except
+ * @begin. Will be called even if the other operation failed.
+ * @get_ufo: Deprecated as redundant. Report whether UDP fragmentation
+ * offload is enabled.
+ * @set_ufo: Deprecated in favour of the netdev op ndo_set_flags. Turn UDP
+ * fragmentation offload on or off. Returns a negative error code or zero.
+ * @get_flags: Deprecated as redundant. Report features included in
+ * &enum ethtool_flags that are enabled.
+ * @set_flags: Deprecated in favour of the netdev op ndo_set_flags. Turn
+ * features included in &enum ethtool_flags on or off. Returns a
+ * negative error code or zero.
+ * @get_priv_flags: Report driver-specific feature flags.
+ * @set_priv_flags: Set driver-specific feature flags. Returns a negative
+ * error code or zero.
+ * @get_sset_count: Get number of strings that @get_strings will write.
+ * @get_rxnfc: Get RX flow classification rules. Returns a negative
+ * error code or zero.
+ * @set_rxnfc: Set RX flow classification rules. Returns a negative
+ * error code or zero.
+ * @flash_device: Write a firmware image to device's flash memory.
+ * Returns a negative error code or zero.
+ * @reset: Reset (part of) the device, as specified by a bitmask of
+ * flags from &enum ethtool_reset_flags. Returns a negative
+ * error code or zero.
+ * @set_rx_ntuple: Set an RX n-tuple rule. Returns a negative error code
+ * or zero.
+ * @get_rx_ntuple: Deprecated.
+ * @get_rxfh_indir: Get the contents of the RX flow hash indirection table.
+ * Returns a negative error code or zero.
+ * @set_rxfh_indir: Set the contents of the RX flow hash indirection table.
+ * Returns a negative error code or zero.
+ *
+ * All operations are optional (i.e. the function pointer may be set
+ * to %NULL) and callers must take this into account. Callers must
+ * hold the RTNL, except that for @get_drvinfo the caller may or may
+ * not hold the RTNL.
+ *
+ * See the structures used by these operations for further documentation.
*/
struct ethtool_ops {
int (*get_settings)(struct net_device *, struct ethtool_cmd *);
--
1.5.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next-2.6 3/6] ethtool: Convert struct ethtool_ops comment to kernel-doc format
From: Ben Hutchings @ 2011-04-03 19:51 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1301859889.2935.23.camel@localhost>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
include/linux/ethtool.h | 80 ++++++++++++++++++++--------------------------
1 files changed, 35 insertions(+), 45 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index c8fcbdd..ab12f84 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -683,63 +683,53 @@ void ethtool_ntuple_flush(struct net_device *dev);
bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
/**
- * ðtool_ops - Alter and report network device settings
- * get_settings: Get device-specific settings
- * set_settings: Set device-specific settings
- * get_drvinfo: Report driver information
- * get_regs: Get device registers
- * get_wol: Report whether Wake-on-Lan is enabled
- * set_wol: Turn Wake-on-Lan on or off
- * get_msglevel: Report driver message level
- * set_msglevel: Set driver message level
- * nway_reset: Restart autonegotiation
- * get_link: Get link status
- * get_eeprom: Read data from the device EEPROM
- * set_eeprom: Write data to the device EEPROM
- * get_coalesce: Get interrupt coalescing parameters
- * set_coalesce: Set interrupt coalescing parameters
- * get_ringparam: Report ring sizes
- * set_ringparam: Set ring sizes
- * get_pauseparam: Report pause parameters
- * set_pauseparam: Set pause parameters
- * get_rx_csum: Report whether receive checksums are turned on or off
- * set_rx_csum: Turn receive checksum on or off
- * get_tx_csum: Report whether transmit checksums are turned on or off
- * set_tx_csum: Turn transmit checksums on or off
- * get_sg: Report whether scatter-gather is enabled
- * set_sg: Turn scatter-gather on or off
- * get_tso: Report whether TCP segmentation offload is enabled
- * set_tso: Turn TCP segmentation offload on or off
- * get_ufo: Report whether UDP fragmentation offload is enabled
- * set_ufo: Turn UDP fragmentation offload on or off
- * self_test: Run specified self-tests
- * get_strings: Return a set of strings that describe the requested objects
- * phys_id: Identify the device
- * get_stats: Return statistics about the device
- * get_flags: get 32-bit flags bitmap
- * set_flags: set 32-bit flags bitmap
- *
- * Description:
- *
- * get_settings:
+ * struct ethtool_ops - Alter and report network device settings
+ * @get_settings: Get device-specific settings.
* @get_settings is passed an ðtool_cmd to fill in. It returns
* an negative errno or zero.
- *
- * set_settings:
+ * @set_settings: Set device-specific settings.
* @set_settings is passed an ðtool_cmd and should attempt to set
* all the settings this device supports. It may return an error value
* if something goes wrong (otherwise 0).
- *
- * get_eeprom:
+ * @get_drvinfo: Report driver information
+ * @get_regs: Get device registers
+ * @get_wol: Report whether Wake-on-Lan is enabled
+ * @set_wol: Turn Wake-on-Lan on or off
+ * @get_msglevel: Report driver message level
+ * @set_msglevel: Set driver message level
+ * @nway_reset: Restart autonegotiation
+ * @get_link: Get link status
+ * @get_eeprom: Read data from the device EEPROM.
* Should fill in the magic field. Don't need to check len for zero
* or wraparound. Fill in the data argument with the eeprom values
* from offset to offset + len. Update len to the amount read.
* Returns an error or zero.
- *
- * set_eeprom:
+ * @set_eeprom: Write data to the device EEPROM.
* Should validate the magic field. Don't need to check len for zero
* or wraparound. Update len to the amount written. Returns an error
* or zero.
+ * @get_coalesce: Get interrupt coalescing parameters
+ * @set_coalesce: Set interrupt coalescing parameters
+ * @get_ringparam: Report ring sizes
+ * @set_ringparam: Set ring sizes
+ * @get_pauseparam: Report pause parameters
+ * @set_pauseparam: Set pause parameters
+ * @get_rx_csum: Report whether receive checksums are turned on or off
+ * @set_rx_csum: Turn receive checksum on or off
+ * @get_tx_csum: Report whether transmit checksums are turned on or off
+ * @set_tx_csum: Turn transmit checksums on or off
+ * @get_sg: Report whether scatter-gather is enabled
+ * @set_sg: Turn scatter-gather on or off
+ * @get_tso: Report whether TCP segmentation offload is enabled
+ * @set_tso: Turn TCP segmentation offload on or off
+ * @get_ufo: Report whether UDP fragmentation offload is enabled
+ * @set_ufo: Turn UDP fragmentation offload on or off
+ * @self_test: Run specified self-tests
+ * @get_strings: Return a set of strings that describe the requested objects
+ * @phys_id: Identify the device
+ * @get_stats: Return statistics about the device
+ * @get_flags: get 32-bit flags bitmap
+ * @set_flags: set 32-bit flags bitmap
*/
struct ethtool_ops {
int (*get_settings)(struct net_device *, struct ethtool_cmd *);
--
1.5.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next-2.6 2/6] sfc: Implement generic features interface
From: Ben Hutchings @ 2011-04-03 19:51 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers, Michał Mirosław
In-Reply-To: <1301859889.2935.23.camel@localhost>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/sfc/efx.c | 17 ++++++++-
drivers/net/sfc/ethtool.c | 78 ------------------------------------------
drivers/net/sfc/net_driver.h | 2 -
drivers/net/sfc/rx.c | 2 +-
4 files changed, 16 insertions(+), 83 deletions(-)
diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index d890679..98da250 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -1874,6 +1874,17 @@ static void efx_set_multicast_list(struct net_device *net_dev)
/* Otherwise efx_start_port() will do this */
}
+static int efx_set_features(struct net_device *net_dev, u32 data)
+{
+ struct efx_nic *efx = netdev_priv(net_dev);
+
+ /* If disabling RX n-tuple filtering, clear existing filters */
+ if (net_dev->features & ~data & NETIF_F_NTUPLE)
+ efx_filter_clear_rx(efx, EFX_FILTER_PRI_MANUAL);
+
+ return 0;
+}
+
static const struct net_device_ops efx_netdev_ops = {
.ndo_open = efx_net_open,
.ndo_stop = efx_net_stop,
@@ -1885,6 +1896,7 @@ static const struct net_device_ops efx_netdev_ops = {
.ndo_change_mtu = efx_change_mtu,
.ndo_set_mac_address = efx_set_mac_address,
.ndo_set_multicast_list = efx_set_multicast_list,
+ .ndo_set_features = efx_set_features,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = efx_netpoll,
#endif
@@ -2269,7 +2281,6 @@ static int efx_init_struct(struct efx_nic *efx, struct efx_nic_type *type,
strlcpy(efx->name, pci_name(pci_dev), sizeof(efx->name));
efx->net_dev = net_dev;
- efx->rx_checksum_enabled = true;
spin_lock_init(&efx->stats_lock);
mutex_init(&efx->mac_lock);
efx->mac_op = type->default_mac_ops;
@@ -2452,12 +2463,14 @@ static int __devinit efx_pci_probe(struct pci_dev *pci_dev,
return -ENOMEM;
net_dev->features |= (type->offload_features | NETIF_F_SG |
NETIF_F_HIGHDMA | NETIF_F_TSO |
- NETIF_F_GRO);
+ NETIF_F_GRO | NETIF_F_RXCSUM);
if (type->offload_features & NETIF_F_V6_CSUM)
net_dev->features |= NETIF_F_TSO6;
/* Mask for features that also apply to VLAN devices */
net_dev->vlan_features |= (NETIF_F_ALL_CSUM | NETIF_F_SG |
NETIF_F_HIGHDMA | NETIF_F_TSO);
+ /* All offloads can be toggled */
+ net_dev->hw_features = net_dev->features & ~NETIF_F_HIGHDMA;
efx = netdev_priv(net_dev);
pci_set_drvdata(pci_dev, efx);
SET_NETDEV_DEV(net_dev, &pci_dev->dev);
diff --git a/drivers/net/sfc/ethtool.c b/drivers/net/sfc/ethtool.c
index 807178e..0d55439 100644
--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -518,72 +518,6 @@ static void efx_ethtool_get_stats(struct net_device *net_dev,
}
}
-static int efx_ethtool_set_tso(struct net_device *net_dev, u32 enable)
-{
- struct efx_nic *efx __attribute__ ((unused)) = netdev_priv(net_dev);
- u32 features;
-
- features = NETIF_F_TSO;
- if (efx->type->offload_features & NETIF_F_V6_CSUM)
- features |= NETIF_F_TSO6;
-
- if (enable)
- net_dev->features |= features;
- else
- net_dev->features &= ~features;
-
- return 0;
-}
-
-static int efx_ethtool_set_tx_csum(struct net_device *net_dev, u32 enable)
-{
- struct efx_nic *efx = netdev_priv(net_dev);
- u32 features = efx->type->offload_features & NETIF_F_ALL_CSUM;
-
- if (enable)
- net_dev->features |= features;
- else
- net_dev->features &= ~features;
-
- return 0;
-}
-
-static int efx_ethtool_set_rx_csum(struct net_device *net_dev, u32 enable)
-{
- struct efx_nic *efx = netdev_priv(net_dev);
-
- /* No way to stop the hardware doing the checks; we just
- * ignore the result.
- */
- efx->rx_checksum_enabled = !!enable;
-
- return 0;
-}
-
-static u32 efx_ethtool_get_rx_csum(struct net_device *net_dev)
-{
- struct efx_nic *efx = netdev_priv(net_dev);
-
- return efx->rx_checksum_enabled;
-}
-
-static int efx_ethtool_set_flags(struct net_device *net_dev, u32 data)
-{
- struct efx_nic *efx = netdev_priv(net_dev);
- u32 supported = (efx->type->offload_features &
- (ETH_FLAG_RXHASH | ETH_FLAG_NTUPLE));
- int rc;
-
- rc = ethtool_op_set_flags(net_dev, data, supported);
- if (rc)
- return rc;
-
- if (!(data & ETH_FLAG_NTUPLE))
- efx_filter_clear_rx(efx, EFX_FILTER_PRI_MANUAL);
-
- return 0;
-}
-
static void efx_ethtool_self_test(struct net_device *net_dev,
struct ethtool_test *test, u64 *data)
{
@@ -1070,18 +1004,6 @@ const struct ethtool_ops efx_ethtool_ops = {
.set_ringparam = efx_ethtool_set_ringparam,
.get_pauseparam = efx_ethtool_get_pauseparam,
.set_pauseparam = efx_ethtool_set_pauseparam,
- .get_rx_csum = efx_ethtool_get_rx_csum,
- .set_rx_csum = efx_ethtool_set_rx_csum,
- .get_tx_csum = ethtool_op_get_tx_csum,
- /* Need to enable/disable IPv6 too */
- .set_tx_csum = efx_ethtool_set_tx_csum,
- .get_sg = ethtool_op_get_sg,
- .set_sg = ethtool_op_set_sg,
- .get_tso = ethtool_op_get_tso,
- /* Need to enable/disable TSO-IPv6 too */
- .set_tso = efx_ethtool_set_tso,
- .get_flags = ethtool_op_get_flags,
- .set_flags = efx_ethtool_set_flags,
.get_sset_count = efx_ethtool_get_sset_count,
.self_test = efx_ethtool_self_test,
.get_strings = efx_ethtool_get_strings,
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index 215d5c5..f0f8ca5 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -681,7 +681,6 @@ struct efx_filter_state;
* @port_inhibited: If set, the netif_carrier is always off. Hold the mac_lock
* @port_initialized: Port initialized?
* @net_dev: Operating system network device. Consider holding the rtnl lock
- * @rx_checksum_enabled: RX checksumming enabled
* @stats_buffer: DMA buffer for statistics
* @mac_op: MAC interface
* @phy_type: PHY type
@@ -771,7 +770,6 @@ struct efx_nic {
bool port_initialized;
struct net_device *net_dev;
- bool rx_checksum_enabled;
struct efx_buffer stats_buffer;
diff --git a/drivers/net/sfc/rx.c b/drivers/net/sfc/rx.c
index fb402c5..b7dc891 100644
--- a/drivers/net/sfc/rx.c
+++ b/drivers/net/sfc/rx.c
@@ -605,7 +605,7 @@ void __efx_rx_packet(struct efx_channel *channel,
skb_record_rx_queue(skb, channel->channel);
}
- if (unlikely(!efx->rx_checksum_enabled))
+ if (unlikely(!(efx->net_dev->features & NETIF_F_RXCSUM)))
checksummed = false;
if (likely(checksummed || rx_buf->is_page)) {
--
1.5.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* [PATCH net-next-2.6 1/6] sfc: Move test of rx_checksum_enabled from nic.c to rx.c
From: Ben Hutchings @ 2011-04-03 19:50 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1301859889.2935.23.camel@localhost>
This is preparation for using the generic netdev features interface,
and should have no effect in itself.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
drivers/net/sfc/nic.c | 6 ++----
drivers/net/sfc/rx.c | 3 +++
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/net/sfc/nic.c b/drivers/net/sfc/nic.c
index e839661..2594f39 100644
--- a/drivers/net/sfc/nic.c
+++ b/drivers/net/sfc/nic.c
@@ -850,7 +850,6 @@ efx_handle_rx_event(struct efx_channel *channel, const efx_qword_t *event)
unsigned expected_ptr;
bool rx_ev_pkt_ok, discard = false, checksummed;
struct efx_rx_queue *rx_queue;
- struct efx_nic *efx = channel->efx;
/* Basic packet information */
rx_ev_byte_cnt = EFX_QWORD_FIELD(*event, FSF_AZ_RX_EV_BYTE_CNT);
@@ -873,9 +872,8 @@ efx_handle_rx_event(struct efx_channel *channel, const efx_qword_t *event)
* UDP/IP, then we can rely on the hardware checksum.
*/
checksummed =
- likely(efx->rx_checksum_enabled) &&
- (rx_ev_hdr_type == FSE_CZ_RX_EV_HDR_TYPE_IPV4V6_TCP ||
- rx_ev_hdr_type == FSE_CZ_RX_EV_HDR_TYPE_IPV4V6_UDP);
+ rx_ev_hdr_type == FSE_CZ_RX_EV_HDR_TYPE_IPV4V6_TCP ||
+ rx_ev_hdr_type == FSE_CZ_RX_EV_HDR_TYPE_IPV4V6_UDP;
} else {
efx_handle_rx_not_ok(rx_queue, event, &rx_ev_pkt_ok, &discard);
checksummed = false;
diff --git a/drivers/net/sfc/rx.c b/drivers/net/sfc/rx.c
index c0fdb59..fb402c5 100644
--- a/drivers/net/sfc/rx.c
+++ b/drivers/net/sfc/rx.c
@@ -605,6 +605,9 @@ void __efx_rx_packet(struct efx_channel *channel,
skb_record_rx_queue(skb, channel->channel);
}
+ if (unlikely(!efx->rx_checksum_enabled))
+ checksummed = false;
+
if (likely(checksummed || rx_buf->is_page)) {
efx_rx_packet_gro(channel, rx_buf, eh, checksummed);
return;
--
1.5.4
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply related
* pull request: sfc-next-2.6 2011-04-03
From: Ben Hutchings @ 2011-04-03 19:44 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-net-drivers
The following changes since commit 9b12c75bf4d58dd85c987ee7b6a4356fdc7c1222:
David S. Miller (1):
net: Order ports in same order as addresses in flow objects.
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6.git master
1. Implement of generic features interface in sfc.
2. Update ethtool_ops documentation.
3. Reimplement ETHTOOL_PHYS_ID as dicussed, dropping the RTNL lock.
Please allow some time for others to review before pulling.
Ben.
Ben Hutchings (6):
sfc: Move test of rx_checksum_enabled from nic.c to rx.c
sfc: Implement generic features interface
ethtool: Convert struct ethtool_ops comment to kernel-doc format
ethtool: Fill out and update comment for struct ethtool_ops
ethtool: Change ETHTOOL_PHYS_ID implementation to allow dropping RTNL
sfc: Implement ethtool_ops::set_phys_id instead of ethtool_ops::phys_id
drivers/net/sfc/efx.c | 17 ++++-
drivers/net/sfc/ethtool.c | 106 ++++---------------------
drivers/net/sfc/net_driver.h | 2 -
drivers/net/sfc/nic.c | 6 +-
drivers/net/sfc/rx.c | 3 +
include/linux/ethtool.h | 177 ++++++++++++++++++++++++++++++------------
net/core/ethtool.c | 46 +++++++++++-
7 files changed, 209 insertions(+), 148 deletions(-)
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Netxen packet loss with VLANs and LRO (was: [PATCH] netxen: fix LRO disable warning)
From: Marc Haber @ 2011-04-03 19:05 UTC (permalink / raw)
To: Amit Kumar Salecha; +Cc: davem, netdev, ameen.rahman, Rajesh Borundia
In-Reply-To: <1300703828-6291-1-git-send-email-amit.salecha@qlogic.com>
Hi,
On Mon, Mar 21, 2011 at 03:37:08AM -0700, Amit Kumar Salecha wrote:
> netxen_nic_set_flags() rejects data if other flag than ETH_FLAG_LRO is set.
> Driver also supports NETIF_F_HW_VLAN_TX.
> Now compare data with ethtool_op_get_flags(), to get all supported features.
Could that be the cause for packet loss on kernel 2.6.38.2 if:
- receiving card is NX3031 [4040:0100]
- frames are received with VLAN tags
- large received offload is on.
Packet Loss of this kind is noticed when doing TCP data transfers
towards the host with the Netxen Interface and the TCP session is
terminated on the Netxen host itself. TCP sessions routed through the
Netxen host are not affected.
My ethtool doesn't allow me to influence the LRO setting alone - it is
disabled when I set rx off but doesn't come on again when rx is set to
on again. So, ethtool -K rx off, ethtool -K rx on fixes the issue.
Is this a known bug, maybe with an available patch?
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190
^ permalink raw reply
* Re: [PATCH] xen: netfront: fix declaration order
From: Michał Mirosław @ 2011-04-03 17:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, netdev, jeremy.fitzhardinge, konrad.wilk,
Ian.Campbell, xen-devel, virtualization
In-Reply-To: <1301828839.2837.143.camel@edumazet-laptop>
On Sun, Apr 03, 2011 at 01:07:19PM +0200, Eric Dumazet wrote:
> Le vendredi 01 avril 2011 à 20:54 -0700, David Miller a écrit :
> > From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > Date: Thu, 31 Mar 2011 13:01:35 +0200 (CEST)
> >
> > > Not tested in any way. The original code for offload setting seems broken
> > > as it resets the features on every netback reconnect.
> > >
> > > This will set GSO_ROBUST at device creation time (earlier than connect time).
> > >
> > > RX checksum offload is forced on - so advertise as it is.
> > >
> > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > Applied.
> Hmm... I had to apply following patch to make it actually compile.
> [PATCH] xen: netfront: fix declaration order
>
> Must declare xennet_fix_features() and xennet_set_features() before
> using them.
Hmm. Sorry for that. Looks like x86 allyesconfig doesn't include this
driver in the build. :/
There really needs to be something like CONFIG_LINT...
Best Regards,
Michał Mirosław
^ permalink raw reply
* Re: [patch net-next-2.6] net: vlan: make non-hw-accel rx path similar to hw-accel
From: Nicolas de Pesloüan @ 2011-04-03 15:23 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, davem, shemminger, kaber, fubar, eric.dumazet, andy,
xiaosuo, jesse
In-Reply-To: <1301739966-7604-1-git-send-email-jpirko@redhat.com>
Le 02/04/2011 12:26, Jiri Pirko a écrit :
> Now there are 2 paths for rx vlan frames. When rx-vlan-hw-accel is
> enabled, skb is untagged by NIC, vlan_tci is set and the skb gets into
> vlan code in __netif_receive_skb - vlan_hwaccel_do_receive.
>
> For non-rx-vlan-hw-accel however, tagged skb goes thru whole
> __netif_receive_skb, it's untagged in ptype_base hander and reinjected
>
> This incosistency is fixed by this patch. Vlan untagging happens early in
> __netif_receive_skb so the rest of code (ptype_all handlers, rx_handlers)
> see the skb like it was untagged by hw.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
<snip>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3da9fb0..bfe9fce 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3130,6 +3130,12 @@ another_round:
>
> __this_cpu_inc(softnet_data.processed);
>
> + if (skb->protocol == cpu_to_be16(ETH_P_8021Q)) {
> + skb = vlan_untag(skb);
> + if (unlikely(!skb))
> + goto out;
> + }
> +
I like the general idea of this patch, but I don't like the idea of re-inserting specific code
inside __netif_receive_skb.
You made a great work removing most - if not all - device specific parts from __netif_receive_skb,
by introducing rx_handler.
I think the above part (and vlan_untag) should be moved to a vlan_rx_handler that would be set on
the net_devices that are the parent of a vlan net_device and are NOT hwaccel.
vlan_rx_handler would return RX_HANDLER_ANOTHER if skb holds a tagged frame (skb->dev changed) and
RX_HANDLER_PASS if skb holds an untagged frame (skb->dev unchanged).
This would also cause protocol handlers to receive the untouched (tagged) frame, if no setup
required the frame to be untagged, which I think is the right thing to do.
> @@ -3177,7 +3183,7 @@ ncls:
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = NULL;
> }
> - if (vlan_hwaccel_do_receive(&skb)) {
> + if (vlan_do_receive(&skb)) {
> ret = __netif_receive_skb(skb);
> goto out;
> } else if (unlikely(!skb))
Why are you calling __netif_receive_skb here? Can't we simply goto another_round?
I really think vlan_untag and vlan_do_receive could me merged in a vlan_rx_handler.
And if someone consider rx_handler processing happens to late for ptype_all handlers, may be it is
time to have a look at one of my previous proposed patch: http://patchwork.ozlabs.org/patch/85578/
Nicolas.
^ permalink raw reply
* [PATCH v2] net: filter: Just In Time compiler
From: Eric Dumazet @ 2011-04-03 13:56 UTC (permalink / raw)
To: David Miller
Cc: netdev, Arnaldo Carvalho de Melo, Ben Hutchings,
Hagen Paul Pfeifer
In-Reply-To: <1301784797.3110.4.camel@localhost>
In order to speedup packet filtering, here is an implementation of a JIT
compiler for x86_64
It is disabled by default, and must be enabled by the admin.
echo 1 >/proc/sys/net/core/bpf_jit_enable
It uses module_alloc() and module_free() to get memory in the 2GB text
kernel range since we call helpers functions from the generated code.
EAX : BPF A accumulator
EBX : BPF X accumulator
RDI : pointer to skb (first argument given to JIT function)
RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
r9d : skb->len - skb->data_len (headlen)
r8 : skb->data
To get a trace of generated code, use :
echo 2 >/proc/sys/net/core/bpf_jit_enable
Example of generated code :
# tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
flen=18 proglen=147 pass=3 image=ffffffffa00b5000
JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
JIT code: ffffffffa00b5090: c0 c9 c3
BPF program is 144 bytes long, so native program is almost same size ;)
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 8
(002) ld [26]
(003) and #0xffffff00
(004) jeq #0xc0a81400 jt 16 jf 5
(005) ld [30]
(006) and #0xffffff00
(007) jeq #0xc0a81400 jt 16 jf 17
(008) jeq #0x806 jt 10 jf 9
(009) jeq #0x8035 jt 10 jf 17
(010) ld [28]
(011) and #0xffffff00
(012) jeq #0xc0a81400 jt 16 jf 13
(013) ld [38]
(014) and #0xffffff00
(015) jeq #0xc0a81400 jt 16 jf 17
(016) ret #65535
(017) ret #0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
---
perf tool might need some changes to take into account JIT.
V2: BPF_S_ALU_AND_K optimizations, BPF_S_ANC_QUEUE support
Move x86 files to arch/x86/net
Documentation/sysctl/net.txt | 11
MAINTAINERS | 1
arch/x86/Kbuild | 1
arch/x86/Kconfig | 1
arch/x86/net/bpf_jit.S | 142 +++++++
arch/x86/net/bpf_jit_comp.c | 655 +++++++++++++++++++++++++++++++++
include/linux/filter.h | 76 +++
include/linux/netdevice.h | 1
include/linux/skbuff.h | 4
net/Kconfig | 13
net/core/filter.c | 65 ---
net/core/sysctl_net_core.c | 9
net/packet/af_packet.c | 2
13 files changed, 917 insertions(+), 64 deletions(-)
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index cbd05ff..3201a70 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -32,6 +32,17 @@ Table : Subdirectories in /proc/sys/net
1. /proc/sys/net/core - Network core options
-------------------------------------------------------
+bpf_jit_enable
+--------------
+
+This enables Berkeley Packet Filter Just in Time compiler.
+Currently supported on x86_64 architecture, bpf_jit provides a framework
+to speed packet filtering, the one used by tcpdump/libpcap for example.
+Values :
+ 0 - disable the JIT (default value)
+ 1 - enable the JIT
+ 2 - enable the JIT and ask the compiler to emit traces on kernel log.
+
rmem_default
------------
diff --git a/MAINTAINERS b/MAINTAINERS
index 6b4b9cd..32898ea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4372,6 +4372,7 @@ S: Maintained
F: net/ipv4/
F: net/ipv6/
F: include/net/ip*
+F: arch/x86/net/*
NETWORKING [LABELED] (NetLabel, CIPSO, Labeled IPsec, SECMARK)
M: Paul Moore <paul.moore@hp.com>
diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 0e10323..0e9dec6 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -15,3 +15,4 @@ obj-y += vdso/
obj-$(CONFIG_IA32_EMULATION) += ia32/
obj-y += platform/
+obj-y += net/
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..855a1bd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -72,6 +72,7 @@ config X86
select IRQ_FORCED_THREADING
select USE_GENERIC_SMP_HELPERS if SMP
select ARCH_NO_SYSDEV_OPS
+ select HAVE_BPF_JIT if X86_64
config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/net/bpf_jit.S b/arch/x86/net/bpf_jit.S
new file mode 100644
index 0000000..a0a9843
--- /dev/null
+++ b/arch/x86/net/bpf_jit.S
@@ -0,0 +1,142 @@
+/* bpf_jit.S : BPF JIT helper functions
+ *
+ * Copyright (C) 2011 Eric Dumazet (eric.dumazet@gmail.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/*
+ * Calling convention :
+ * rdi : skb pointer
+ * esi : offset of byte(s) to fetch in skb (can be scratched)
+ * r8 : copy of skb->data
+ * r9d : hlen = skb->len - skb->data_len
+ */
+#define SKBDATA %r8
+
+sk_load_word_ind:
+ .globl sk_load_word_ind
+
+ add %ebx,%esi /* offset += X */
+# test %esi,%esi /* if (offset < 0) goto bpf_error; */
+ js bpf_error
+
+sk_load_word:
+ .globl sk_load_word
+
+ mov %r9d,%eax # hlen
+ sub %esi,%eax # hlen - offset
+ cmp $3,%eax
+ jle bpf_slow_path_word
+ mov (SKBDATA,%rsi),%eax
+ bswap %eax /* ntohl() */
+ ret
+
+
+sk_load_half_ind:
+ .globl sk_load_half_ind
+
+ add %ebx,%esi /* offset += X */
+ js bpf_error
+
+sk_load_half:
+ .globl sk_load_half
+
+ mov %r9d,%eax
+ sub %esi,%eax # hlen - offset
+ cmp $1,%eax
+ jle bpf_slow_path_half
+ movzwl (SKBDATA,%rsi),%eax
+ rol $8,%ax # ntohs()
+ ret
+
+sk_load_byte_ind:
+ .globl sk_load_byte_ind
+ add %ebx,%esi /* offset += X */
+ js bpf_error
+
+sk_load_byte:
+ .globl sk_load_byte
+
+ cmp %esi,%r9d /* if (offset >= hlen) goto bpf_slow_path_byte */
+ jle bpf_slow_path_byte
+ movzbl (SKBDATA,%rsi),%eax
+ ret
+
+/**
+ * sk_load_byte_msh - BPF_S_LDX_B_MSH helper
+ *
+ * Implements BPF_S_LDX_B_MSH : ldxb 4*([offset]&0xf)
+ * Must preserve A accumulator (%eax)
+ * Inputs : %esi is the offset value, already known positive
+ */
+ENTRY(sk_load_byte_msh)
+ CFI_STARTPROC
+ cmp %esi,%r9d /* if (offset >= hlen) goto bpf_slow_path_byte_msh */
+ jle bpf_slow_path_byte_msh
+ movzbl (SKBDATA,%rsi),%ebx
+ and $15,%bl
+ shl $2,%bl
+ ret
+ CFI_ENDPROC
+ENDPROC(sk_load_byte_msh)
+
+bpf_error:
+# force a return 0 from jit handler
+ xor %eax,%eax
+ mov -8(%rbp),%rbx
+ leaveq
+ ret
+
+/* rsi contains offset and can be scratched */
+#define bpf_slow_path_common(LEN) \
+ push %rdi; /* save skb */ \
+ push %r9; \
+ push SKBDATA; \
+/* rsi already has offset */ \
+ mov $LEN,%ecx; /* len */ \
+ lea -12(%rbp),%rdx; \
+ call skb_copy_bits; \
+ test %eax,%eax; \
+ pop SKBDATA; \
+ pop %r9; \
+ pop %rdi
+
+
+bpf_slow_path_word:
+ bpf_slow_path_common(4)
+ js bpf_error
+ mov -12(%rbp),%eax
+ bswap %eax
+ ret
+
+bpf_slow_path_half:
+ bpf_slow_path_common(2)
+ js bpf_error
+ mov -12(%rbp),%ax
+ rol $8,%ax
+ movzwl %ax,%eax
+ ret
+
+bpf_slow_path_byte:
+ bpf_slow_path_common(1)
+ js bpf_error
+ movzbl -12(%rbp),%eax
+ ret
+
+bpf_slow_path_byte_msh:
+ xchg %eax,%ebx /* dont lose A , X is about to be scratched */
+ bpf_slow_path_common(1)
+ js bpf_error
+ movzbl -12(%rbp),%eax
+ and $15,%al
+ shl $2,%al
+ xchg %eax,%ebx
+ ret
+
+
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
new file mode 100644
index 0000000..a276816
--- /dev/null
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -0,0 +1,655 @@
+/* bpf_jit_comp.c : BPF JIT compiler
+ *
+ * Copyright (C) 2011 Eric Dumazet (eric.dumazet@gmail.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#include <linux/moduleloader.h>
+#include <asm/cacheflush.h>
+#include <linux/netdevice.h>
+#include <linux/filter.h>
+
+/*
+ * Conventions :
+ * EAX : BPF A accumulator
+ * EBX : BPF X accumulator
+ * RDI : pointer to skb (first argument given to JIT function)
+ * RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
+ * ECX,EDX,ESI : scratch registers
+ * r9d : skb->len - skb->data_len (headlen)
+ * r8 : skb->data
+ * -8(RBP) : saved RBX value
+ * -16(RBP)..-80(RBP) : BPF_MEMWORDS values
+ */
+int bpf_jit_enable __read_mostly;
+
+/*
+ * assembly code in arch/x86/net/bpf_jit.S
+ */
+extern u8 sk_load_word[], sk_load_half[], sk_load_byte[], sk_load_byte_msh[];
+extern u8 sk_load_word_ind[], sk_load_half_ind[], sk_load_byte_ind[];
+
+static inline u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+{
+ if (len == 1)
+ *ptr = bytes;
+ else if (len == 2)
+ *(u16 *)ptr = bytes;
+ else {
+ *(u32 *)ptr = bytes;
+ barrier();
+ }
+ return ptr + len;
+}
+
+#define EMIT(bytes, len) do { prog = emit_code(prog, bytes, len); } while (0)
+
+#define EMIT1(b1) EMIT(b1, 1)
+#define EMIT2(b1, b2) EMIT((b1) + ((b2) << 8), 2)
+#define EMIT3(b1, b2, b3) EMIT((b1) + ((b2) << 8) + ((b3) << 16), 3)
+#define EMIT4(b1, b2, b3, b4) EMIT((b1) + ((b2) << 8) + ((b3) << 16) + ((b4) << 24), 4)
+#define EMIT1_off32(b1, off) do { EMIT1(b1); EMIT(off, 4);} while (0)
+
+#define CLEAR_A() EMIT2(0x31, 0xc0) /* xor %eax,%eax */
+#define CLEAR_X() EMIT2(0x31, 0xdb) /* xor %ebx,%ebx */
+
+static inline bool is_imm8(int value)
+{
+ return value <= 127 && value >= -128;
+}
+
+static inline bool is_near(int offset)
+{
+ return offset <= 127 && offset >= -128;
+}
+
+#define EMIT_JMP(offset) \
+do { \
+ if (offset) { \
+ if (is_near(offset)) \
+ EMIT2(0xeb, offset); /* jmp .+off8 */ \
+ else \
+ EMIT1_off32(0xe9, offset); /* jmp .+off32 */ \
+ } \
+} while (0)
+
+/* list of x86 cond jumps opcodes (. + s8)
+ * Add 0x10 (and an extra 0x0f) to generate far jumps (. + s32)
+ */
+#define X86_JB 0x72
+#define X86_JAE 0x73
+#define X86_JE 0x74
+#define X86_JNE 0x75
+#define X86_JBE 0x76
+#define X86_JA 0x77
+
+#define EMIT_COND_JMP(op, offset) \
+do { \
+ if (is_near(offset)) \
+ EMIT2(op, offset); /* jxx .+off8 */ \
+ else { \
+ EMIT2(0x0f, op + 0x10); \
+ EMIT(offset, 4); /* jxx .+off32 */ \
+ } \
+} while (0)
+
+#define COND_SEL(CODE, TOP, FOP) \
+ case CODE: \
+ t_op = TOP; \
+ f_op = FOP; \
+ goto cond_branch
+
+
+#define SEEN_DATAREF 1 /* might call external helpers */
+#define SEEN_XREG 2 /* ebx is used */
+#define SEEN_MEM 4 /* use mem[] for temporary storage */
+
+static inline void bpf_flush_icache(void *start, void *end)
+{
+ mm_segment_t old_fs = get_fs();
+
+ set_fs(KERNEL_DS);
+ smp_wmb();
+ flush_icache_range((unsigned long)start, (unsigned long)end);
+ set_fs(old_fs);
+}
+
+
+void bpf_jit_compile(struct sk_filter *fp)
+{
+ u8 temp[64];
+ u8 *prog;
+ unsigned int proglen, oldproglen = 0;
+ int ilen, i;
+ int t_offset, f_offset;
+ u8 t_op, f_op, seen = 0, pass;
+ u8 *image = NULL;
+ u8 *func;
+ int pc_ret0 = -1; /* bpf index of first RET #0 instruction (if any) */
+ unsigned int cleanup_addr; /* epilogue code offset */
+ unsigned int *addrs;
+ const struct sock_filter *filter = fp->insns;
+ int flen = fp->len;
+
+ if (!bpf_jit_enable)
+ return;
+
+ addrs = kmalloc(flen * sizeof(*addrs), GFP_KERNEL);
+ if (addrs == NULL)
+ return;
+
+ /* Before first pass, make a rough estimation of addrs[]
+ * each bpf instruction is translated to less than 64 bytes
+ */
+ for (proglen = 0, i = 0; i < flen; i++) {
+ proglen += 64;
+ addrs[i] = proglen;
+ }
+ cleanup_addr = proglen; /* epilogue address */
+
+ for (pass = 0; pass < 10; pass++) {
+ /* no prologue/epilogue for trivial filters (RET something) */
+ proglen = 0;
+ prog = temp;
+
+ if (seen) {
+ EMIT4(0x55, 0x48, 0x89, 0xe5); /* push %rbp; mov %rsp,%rbp */
+ EMIT4(0x48, 0x83, 0xec, 96); /* subq $96,%rsp */
+ /* note : must save %rbx in case bpf_error is hit */
+ if (seen & (SEEN_XREG | SEEN_DATAREF))
+ EMIT4(0x48, 0x89, 0x5d, 0xf8); /* mov %rbx, -8(%rbp) */
+ if (seen & SEEN_XREG)
+ CLEAR_X(); /* make sure we dont leek kernel memory */
+
+ /*
+ * If this filter needs to access skb data,
+ * loads r9 and r8 with :
+ * r9 = skb->len - skb->data_len
+ * r8 = skb->data
+ */
+ if (seen & SEEN_DATAREF) {
+ if (offsetof(struct sk_buff, len) <= 127)
+ /* mov off8(%rdi),%r9d */
+ EMIT4(0x44, 0x8b, 0x4f, offsetof(struct sk_buff, len));
+ else {
+ /* mov off32(%rdi),%r9d */
+ EMIT3(0x44, 0x8b, 0x8f);
+ EMIT(offsetof(struct sk_buff, len), 4);
+ }
+ if (is_imm8(offsetof(struct sk_buff, data_len)))
+ /* sub off8(%rdi),%r9d */
+ EMIT4(0x44, 0x2b, 0x4f, offsetof(struct sk_buff, data_len));
+ else {
+ EMIT3(0x44, 0x2b, 0x8f);
+ EMIT(offsetof(struct sk_buff, data_len), 4);
+ }
+
+ if (is_imm8(offsetof(struct sk_buff, data)))
+ /* mov off8(%rdi),%r8 */
+ EMIT4(0x4c, 0x8b, 0x47, offsetof(struct sk_buff, data));
+ else {
+ /* mov off32(%rdi),%r8 */
+ EMIT3(0x4c, 0x8b, 0x87);
+ EMIT(offsetof(struct sk_buff, data), 4);
+ }
+ }
+ }
+
+ switch (filter[0].code) {
+ case BPF_S_RET_K:
+ case BPF_S_LD_W_LEN:
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_LD_W_ABS:
+ case BPF_S_LD_H_ABS:
+ case BPF_S_LD_B_ABS:
+ /* first instruction sets A register (or is RET 'constant') */
+ break;
+ default:
+ /* make sure we dont leak kernel information to user */
+ CLEAR_A(); /* A = 0 */
+ }
+
+ for (i = 0; i < flen; i++) {
+ unsigned int K = filter[i].k;
+
+ switch (filter[i].code) {
+ case BPF_S_ALU_ADD_X: /* A += X; */
+ seen |= SEEN_XREG;
+ EMIT2(0x01, 0xd8); /* add %ebx,%eax */
+ break;
+ case BPF_S_ALU_ADD_K: /* A += K; */
+ if (!K)
+ break;
+ if (is_imm8(K))
+ EMIT3(0x83, 0xc0, K); /* add imm8,%eax */
+ else
+ EMIT1_off32(0x05, K); /* add imm32,%eax */
+ break;
+ case BPF_S_ALU_SUB_X: /* A -= X; */
+ seen |= SEEN_XREG;
+ EMIT2(0x29, 0xd8); /* sub %ebx,%eax */
+ break;
+ case BPF_S_ALU_SUB_K: /* A -= K */
+ if (!K)
+ break;
+ if (is_imm8(K))
+ EMIT3(0x83, 0xe8, K); /* sub imm8,%eax */
+ else
+ EMIT1_off32(0x2d, K); /* sub imm32,%eax */
+ break;
+ case BPF_S_ALU_MUL_X: /* A *= X; */
+ seen |= SEEN_XREG;
+ EMIT3(0x0f, 0xaf, 0xc3); /* imul %ebx,%eax */
+ break;
+ case BPF_S_ALU_MUL_K: /* A *= K */
+ if (is_imm8(K))
+ EMIT3(0x6b, 0xc0, K); /* imul imm8,%eax,%eax */
+ else {
+ EMIT2(0x69, 0xc0); /* imul imm32,%eax */
+ EMIT(K, 4);
+ }
+ break;
+ case BPF_S_ALU_DIV_X: /* A /= X; */
+ seen |= SEEN_XREG;
+ EMIT2(0x85, 0xdb); /* test %ebx,%ebx */
+ if (pc_ret0 != -1)
+ EMIT_COND_JMP(X86_JE, addrs[pc_ret0] - (addrs[i] - 4));
+ else {
+ EMIT_COND_JMP(X86_JNE, 2 + 5);
+ CLEAR_A();
+ EMIT1_off32(0xe9, cleanup_addr - (addrs[i] - 4)); /* jmp .+off32 */
+ }
+ EMIT4(0x31, 0xd2, 0xf7, 0xf3); /* xor %edx,%edx; div %ebx */
+ break;
+ case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
+ EMIT3(0x48, 0x69, 0xc0); /* imul imm32,%rax,%rax */
+ EMIT(K, 4);
+ EMIT4(0x48, 0xc1, 0xe8, 0x20); /* shr $0x20,%rax */
+ break;
+ case BPF_S_ALU_AND_X:
+ seen |= SEEN_XREG;
+ EMIT2(0x21, 0xd8); /* and %ebx,%eax */
+ break;
+ case BPF_S_ALU_AND_K:
+ if (K >= 0xFFFFFF00) {
+ EMIT2(0x24, K & 0xFF); /* and imm8,%al */
+ } else if (K >= 0xFFFF0000) {
+ EMIT2(0x66, 0x25); /* and imm16,%ax */
+ EMIT2(K, 2);
+ } else {
+ EMIT1_off32(0x25, K); /* and imm32,%eax */
+ }
+ break;
+ case BPF_S_ALU_OR_X:
+ seen |= SEEN_XREG;
+ EMIT2(0x09, 0xd8); /* or %ebx,%eax */
+ break;
+ case BPF_S_ALU_OR_K:
+ if (is_imm8(K))
+ EMIT3(0x83, 0xc8, K); /* or imm8,%eax */
+ else
+ EMIT1_off32(0x0d, K); /* or imm32,%eax */
+ break;
+ case BPF_S_ALU_LSH_X: /* A <<= X; */
+ seen |= SEEN_XREG;
+ EMIT4(0x89, 0xd9, 0xd3, 0xe0); /* mov %ebx,%ecx; shl %cl,%eax */
+ break;
+ case BPF_S_ALU_LSH_K:
+ if (K == 0)
+ break;
+ else if (K == 1)
+ EMIT2(0xd1, 0xe0); /* shl %eax */
+ else
+ EMIT3(0xc1, 0xe0, K);
+ break;
+ case BPF_S_ALU_RSH_X: /* A >>= X; */
+ seen |= SEEN_XREG;
+ EMIT4(0x89, 0xd9, 0xd3, 0xe8); /* mov %ebx,%ecx; shr %cl,%eax */
+ break;
+ case BPF_S_ALU_RSH_K: /* A >>= K; */
+ if (K == 0)
+ break;
+ else if (K == 1)
+ EMIT2(0xd1, 0xe8); /* shr %eax */
+ else
+ EMIT3(0xc1, 0xe8, K);
+ break;
+ case BPF_S_ALU_NEG:
+ EMIT2(0xf7, 0xd8); /* neg %eax */
+ break;
+ case BPF_S_RET_K:
+ if (!K) {
+ if (pc_ret0 == -1)
+ pc_ret0 = i;
+ CLEAR_A();
+ } else {
+ EMIT1_off32(0xb8, K); /* mov $imm32,%eax */
+ }
+ /* fallinto */
+ case BPF_S_RET_A:
+ if (seen) {
+ if (i != flen - 1) {
+ EMIT_JMP(cleanup_addr - addrs[i]);
+ break;
+ }
+ if (seen & SEEN_XREG)
+ EMIT4(0x48, 0x8b, 0x5d, 0xf8); /* mov -8(%rbp),%rbx */
+ EMIT1(0xc9); /* leaveq */
+ }
+ EMIT1(0xc3); /* ret */
+ break;
+ case BPF_S_MISC_TAX: /* X = A */
+ seen |= SEEN_XREG;
+ EMIT2(0x89, 0xc3); /* mov %eax,%ebx */
+ break;
+ case BPF_S_MISC_TXA: /* A = X */
+ seen |= SEEN_XREG;
+ EMIT2(0x89, 0xd8); /* mov %ebx,%eax */
+ break;
+ case BPF_S_LD_IMM: /* A = K */
+ if (!K)
+ CLEAR_A();
+ else
+ EMIT1_off32(0xb8, K); /* mov $imm32,%eax */
+ break;
+ case BPF_S_LDX_IMM: /* X = K */
+ seen |= SEEN_XREG;
+ if (!K)
+ CLEAR_X();
+ else
+ EMIT1_off32(0xbb, K); /* mov $imm32,%ebx */
+ break;
+ case BPF_S_LD_MEM: /* A = mem[K] : mov off8(%rbp),%eax */
+ seen |= SEEN_MEM;
+ EMIT3(0x8b, 0x45, 0xf0 - K*4);
+ break;
+ case BPF_S_LDX_MEM: /* X = mem[K] : mov off8(%rbp),%ebx */
+ seen |= SEEN_XREG | SEEN_MEM;
+ EMIT3(0x8b, 0x5d, 0xf0 - K*4);
+ break;
+ case BPF_S_ST: /* mem[K] = A : mov %eax,off8(%rbp) */
+ seen |= SEEN_MEM;
+ EMIT3(0x89, 0x45, 0xf0 - K*4);
+ break;
+ case BPF_S_STX: /* mem[K] = X : mov %ebx,off8(%rbp) */
+ seen |= SEEN_XREG | SEEN_MEM;
+ EMIT3(0x89, 0x5d, 0xf0 - K*4);
+ break;
+ case BPF_S_LD_W_LEN: /* A = skb->len; */
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+ if (is_imm8(offsetof(struct sk_buff, len)))
+ /* mov off8(%rdi),%eax */
+ EMIT3(0x8b, 0x47, offsetof(struct sk_buff, len));
+ else {
+ EMIT2(0x8b, 0x87);
+ EMIT(offsetof(struct sk_buff, len), 4);
+ }
+ break;
+ case BPF_S_LDX_W_LEN: /* X = skb->len; */
+ seen |= SEEN_XREG;
+ if (is_imm8(offsetof(struct sk_buff, len)))
+ /* mov off8(%rdi),%ebx */
+ EMIT3(0x8b, 0x5f, offsetof(struct sk_buff, len));
+ else {
+ EMIT2(0x8b, 0x9f);
+ EMIT(offsetof(struct sk_buff, len), 4);
+ }
+ break;
+ case BPF_S_ANC_PROTOCOL: /* A = ntohs(skb->protocol); */
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, protocol) != 2);
+ if (is_imm8(offsetof(struct sk_buff, protocol))) {
+ /* movzwl off8(%rdi),%eax */
+ EMIT4(0x0f, 0xb7, 0x47, offsetof(struct sk_buff, protocol));
+ } else {
+ EMIT3(0x0f, 0xb7, 0x87); /* movzwl off32(%rdi),%eax */
+ EMIT(offsetof(struct sk_buff, protocol), 4);
+ }
+ EMIT2(0x86, 0xc4); /* ntohs() : xchg %al,%ah */
+ break;
+ case BPF_S_ANC_IFINDEX:
+ if (is_imm8(offsetof(struct sk_buff, dev))) {
+ /* movq off8(%rdi),%rax */
+ EMIT4(0x48, 0x8b, 0x47, offsetof(struct sk_buff, dev));
+ } else {
+ EMIT3(0x48, 0x8b, 0x87); /* movq off32(%rdi),%rax */
+ EMIT(offsetof(struct sk_buff, dev), 4);
+ }
+ EMIT3(0x48, 0x85, 0xc0); /* test %rax,%rax */
+ EMIT_COND_JMP(X86_JE, cleanup_addr - (addrs[i] - 6));
+ BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, ifindex) != 4);
+ EMIT2(0x8b, 0x80); /* mov off32(%rax),%eax */
+ EMIT(offsetof(struct net_device, ifindex), 4);
+ break;
+ case BPF_S_ANC_MARK:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+ if (is_imm8(offsetof(struct sk_buff, mark))) {
+ /* mov off8(%rdi),%eax */
+ EMIT3(0x8b, 0x47, offsetof(struct sk_buff, mark));
+ } else {
+ EMIT2(0x8b, 0x87);
+ EMIT(offsetof(struct sk_buff, mark), 4);
+ }
+ break;
+ case BPF_S_ANC_RXHASH:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, rxhash) != 4);
+ if (is_imm8(offsetof(struct sk_buff, rxhash))) {
+ /* mov off8(%rdi),%eax */
+ EMIT3(0x8b, 0x47, offsetof(struct sk_buff, rxhash));
+ } else {
+ EMIT2(0x8b, 0x87);
+ EMIT(offsetof(struct sk_buff, rxhash), 4);
+ }
+ break;
+ case BPF_S_ANC_QUEUE:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, queue_mapping) != 2);
+ if (is_imm8(offsetof(struct sk_buff, queue_mapping))) {
+ /* movzwl off8(%rdi),%eax */
+ EMIT4(0x0f, 0xb7, 0x47, offsetof(struct sk_buff, queue_mapping));
+ } else {
+ EMIT3(0x0f, 0xb7, 0x87); /* movzwl off32(%rdi),%eax */
+ EMIT(offsetof(struct sk_buff, queue_mapping), 4);
+ }
+ break;
+ case BPF_S_ANC_CPU:
+#ifdef CONFIG_SMP
+ EMIT4(0x65, 0x8b, 0x04, 0x25); /* mov %gs:off32,%eax */
+ EMIT((u32)&cpu_number, 4); /* A = smp_processor_id(); */
+#else
+ CLEAR_A();
+#endif
+ break;
+ case BPF_S_LD_W_ABS:
+ func = sk_load_word;
+common_load: seen |= SEEN_DATAREF;
+ if ((int)K < 0)
+ goto out;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbe, K); /* mov imm32,%esi */
+ EMIT1_off32(0xe8, t_offset); /* call */
+ break;
+ case BPF_S_LD_H_ABS:
+ func = sk_load_half;
+ goto common_load;
+ case BPF_S_LD_B_ABS:
+ func = sk_load_byte;
+ goto common_load;
+ case BPF_S_LDX_B_MSH:
+ if ((int)K < 0) {
+ if (pc_ret0 != -1) {
+ EMIT_JMP(addrs[pc_ret0] - addrs[i]);
+ break;
+ }
+ CLEAR_A();
+ EMIT_JMP(cleanup_addr - addrs[i]);
+ break;
+ }
+ seen |= SEEN_DATAREF | SEEN_XREG;
+ t_offset = sk_load_byte_msh - (image + addrs[i]);
+ EMIT1_off32(0xbe, K); /* mov imm32,%esi */
+ EMIT1_off32(0xe8, t_offset); /* call sk_load_byte_msh */
+ break;
+ case BPF_S_LD_W_IND:
+ func = sk_load_word_ind;
+common_load_ind: seen |= SEEN_DATAREF | SEEN_XREG;
+ t_offset = func - (image + addrs[i]);
+ EMIT1_off32(0xbe, K); /* mov imm32,%esi */
+ EMIT1_off32(0xe8, t_offset); /* call sk_load_xxx_ind */
+ break;
+ case BPF_S_LD_H_IND:
+ func = sk_load_half_ind;
+ goto common_load_ind;
+ case BPF_S_LD_B_IND:
+ func = sk_load_byte_ind;
+ goto common_load_ind;
+ case BPF_S_JMP_JA:
+ t_offset = addrs[i + K] - addrs[i];
+ EMIT_JMP(t_offset);
+ break;
+ COND_SEL(BPF_S_JMP_JGT_K, X86_JA, X86_JBE);
+ COND_SEL(BPF_S_JMP_JGE_K, X86_JAE, X86_JB);
+ COND_SEL(BPF_S_JMP_JEQ_K, X86_JE, X86_JNE);
+ COND_SEL(BPF_S_JMP_JSET_K,X86_JNE, X86_JE);
+ COND_SEL(BPF_S_JMP_JGT_X, X86_JA, X86_JBE);
+ COND_SEL(BPF_S_JMP_JGE_X, X86_JAE, X86_JB);
+ COND_SEL(BPF_S_JMP_JEQ_X, X86_JE, X86_JNE);
+ COND_SEL(BPF_S_JMP_JSET_X,X86_JNE, X86_JE);
+
+cond_branch: f_offset = addrs[i + filter[i].jf] - addrs[i];
+ t_offset = addrs[i + filter[i].jt] - addrs[i];
+
+ /* same targets, can avoid doing the test :) */
+ if (filter[i].jt == filter[i].jf) {
+ EMIT_JMP(t_offset);
+ break;
+ }
+
+ switch (filter[i].code) {
+ case BPF_S_JMP_JGT_X:
+ case BPF_S_JMP_JGE_X:
+ case BPF_S_JMP_JEQ_X:
+ seen |= SEEN_XREG;
+ EMIT2(0x39, 0xd8); /* cmp %ebx,%eax */
+ break;
+ case BPF_S_JMP_JSET_X:
+ seen |= SEEN_XREG;
+ EMIT2(0x85, 0xd8); /* test %ebx,%eax */
+ break;
+ case BPF_S_JMP_JEQ_K:
+ if (K == 0) {
+ EMIT2(0x85, 0xc0); /* test %eax,%eax */
+ break;
+ }
+ case BPF_S_JMP_JGT_K:
+ case BPF_S_JMP_JGE_K:
+ if (K <= 127)
+ EMIT3(0x83, 0xf8, K); /* cmp imm8,%eax */
+ else
+ EMIT1_off32(0x3d, K); /* cmp imm32,%eax */
+ break;
+ case BPF_S_JMP_JSET_K:
+ if (K <= 0xFF)
+ EMIT2(0xa8, K); /* test imm8,%al */
+ else if (!(K & 0xFFFF00FF))
+ EMIT3(0xf6, 0xc4, K >> 8); /* test imm8,%ah */
+ else if (K <= 0xFFFF) {
+ EMIT2(0x66, 0xa9); /* test imm16,%ax */
+ EMIT(K, 2);
+ } else {
+ EMIT1_off32(0xa9, K); /* test imm32,%eax */
+ }
+ break;
+ }
+ if (filter[i].jt != 0) {
+ if (filter[i].jf)
+ t_offset += is_near(f_offset) ? 2 : 6;
+ EMIT_COND_JMP(t_op, t_offset);
+ if (filter[i].jf)
+ EMIT_JMP(f_offset);
+ break;
+ }
+ EMIT_COND_JMP(f_op, f_offset);
+ break;
+ default:
+ /* hmm, too complex filter, give up with jit compiler */
+ goto out;
+ }
+ ilen = prog - temp;
+ if (image) {
+ if (unlikely(proglen + ilen > oldproglen)) {
+ pr_err("bpb_jit_compile fatal error\n");
+ kfree(addrs);
+ module_free(NULL, image);
+ return;
+ }
+ memcpy(image + proglen, temp, ilen);
+ }
+ proglen += ilen;
+ addrs[i] = proglen;
+ prog = temp;
+ }
+ /* last bpf instruction is always a RET :
+ * use it to give the cleanup instruction(s) addr
+ */
+ cleanup_addr = proglen - 1; /* ret */
+ if (seen)
+ cleanup_addr -= 1; /* leaveq */
+ if (seen & SEEN_XREG)
+ cleanup_addr -= 4; /* mov -8(%rbp),%rbx */
+
+ if (image) {
+ WARN_ON(proglen != oldproglen);
+ break;
+ }
+ if (proglen == oldproglen) {
+ image = module_alloc(max_t(unsigned int,
+ proglen,
+ sizeof(struct work_struct)));
+ if (!image)
+ goto out;
+ }
+ oldproglen = proglen;
+ }
+ if (bpf_jit_enable > 1)
+ pr_err("flen=%d proglen=%u pass=%d image=%p\n",
+ flen, proglen, pass, image);
+
+ if (image) {
+ if (bpf_jit_enable > 1)
+ print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_ADDRESS,
+ 16, 1, image, proglen, false);
+
+ bpf_flush_icache(image, image + proglen);
+
+ fp->bpf_func = (void *)image;
+ }
+out:
+ kfree(addrs);
+ return;
+}
+
+static void jit_free_defer(struct work_struct *arg)
+{
+ module_free(NULL, arg);
+}
+
+/* run from softirq, we must use a work_struct to call
+ * module_free() from process context
+ */
+void bpf_jit_free(struct sk_filter *fp)
+{
+ if (fp->bpf_func != sk_run_filter) {
+ struct work_struct *work = (struct work_struct *)fp->bpf_func;
+
+ INIT_WORK(work, jit_free_defer);
+ schedule_work(work);
+ }
+}
+
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45266b7..4609b85 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -135,6 +135,8 @@ struct sk_filter
{
atomic_t refcnt;
unsigned int len; /* Number of filter blocks */
+ unsigned int (*bpf_func)(const struct sk_buff *skb,
+ const struct sock_filter *filter);
struct rcu_head rcu;
struct sock_filter insns[0];
};
@@ -153,6 +155,80 @@ extern unsigned int sk_run_filter(const struct sk_buff *skb,
extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
extern int sk_detach_filter(struct sock *sk);
extern int sk_chk_filter(struct sock_filter *filter, int flen);
+
+#ifdef CONFIG_BPF_JIT
+extern void bpf_jit_compile(struct sk_filter *fp);
+extern void bpf_jit_free(struct sk_filter *fp);
+#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
+#else
+static inline void bpf_jit_compile(struct sk_filter *fp)
+{
+}
+static inline void bpf_jit_free(struct sk_filter *fp)
+{
+}
+#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
+#endif
+
+enum {
+ BPF_S_RET_K = 1,
+ BPF_S_RET_A,
+ BPF_S_ALU_ADD_K,
+ BPF_S_ALU_ADD_X,
+ BPF_S_ALU_SUB_K,
+ BPF_S_ALU_SUB_X,
+ BPF_S_ALU_MUL_K,
+ BPF_S_ALU_MUL_X,
+ BPF_S_ALU_DIV_X,
+ BPF_S_ALU_AND_K,
+ BPF_S_ALU_AND_X,
+ BPF_S_ALU_OR_K,
+ BPF_S_ALU_OR_X,
+ BPF_S_ALU_LSH_K,
+ BPF_S_ALU_LSH_X,
+ BPF_S_ALU_RSH_K,
+ BPF_S_ALU_RSH_X,
+ BPF_S_ALU_NEG,
+ BPF_S_LD_W_ABS,
+ BPF_S_LD_H_ABS,
+ BPF_S_LD_B_ABS,
+ BPF_S_LD_W_LEN,
+ BPF_S_LD_W_IND,
+ BPF_S_LD_H_IND,
+ BPF_S_LD_B_IND,
+ BPF_S_LD_IMM,
+ BPF_S_LDX_W_LEN,
+ BPF_S_LDX_B_MSH,
+ BPF_S_LDX_IMM,
+ BPF_S_MISC_TAX,
+ BPF_S_MISC_TXA,
+ BPF_S_ALU_DIV_K,
+ BPF_S_LD_MEM,
+ BPF_S_LDX_MEM,
+ BPF_S_ST,
+ BPF_S_STX,
+ BPF_S_JMP_JA,
+ BPF_S_JMP_JEQ_K,
+ BPF_S_JMP_JEQ_X,
+ BPF_S_JMP_JGE_K,
+ BPF_S_JMP_JGE_X,
+ BPF_S_JMP_JGT_K,
+ BPF_S_JMP_JGT_X,
+ BPF_S_JMP_JSET_K,
+ BPF_S_JMP_JSET_X,
+ /* Ancillary data */
+ BPF_S_ANC_PROTOCOL,
+ BPF_S_ANC_PKTTYPE,
+ BPF_S_ANC_IFINDEX,
+ BPF_S_ANC_NLATTR,
+ BPF_S_ANC_NLATTR_NEST,
+ BPF_S_ANC_MARK,
+ BPF_S_ANC_QUEUE,
+ BPF_S_ANC_HATYPE,
+ BPF_S_ANC_RXHASH,
+ BPF_S_ANC_CPU,
+};
+
#endif /* __KERNEL__ */
#endif /* __LINUX_FILTER_H__ */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 423a544..298ff62 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2513,6 +2513,7 @@ extern struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev,
extern int netdev_max_backlog;
extern int netdev_tstamp_prequeue;
extern int weight_p;
+extern int bpf_jit_enable;
extern int netdev_set_master(struct net_device *dev, struct net_device *master);
extern int netdev_set_bond_master(struct net_device *dev,
struct net_device *master);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d9e52fa..faa0095 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -391,13 +391,11 @@ struct sk_buff {
__u32 rxhash;
- kmemcheck_bitfield_begin(flags2);
- __u16 queue_mapping:16;
+ __u16 queue_mapping;
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
__u8 ooo_okay:1;
- kmemcheck_bitfield_end(flags2);
/* 0/13 bit hole */
diff --git a/net/Kconfig b/net/Kconfig
index 79cabf1..745fb02 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -232,6 +232,19 @@ config XPS
depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
default y
+config HAVE_BPF_JIT
+ bool
+
+config BPF_JIT
+ bool "enable BPF Just In Time compiler"
+ depends on HAVE_BPF_JIT
+ ---help---
+ Berkeley Packet Filter filtering capabilities are normally handled
+ by an interpreter. This option allows kernel to generate a native
+ code when filter is loaded in memory. This should speedup
+ packet sniffing (libpcap/tcpdump). Note : Admin should enable
+ this feature changing /proc/sys/net/core/bpf_jit_enable
+
menu "Network testing"
config NET_PKTGEN
diff --git a/net/core/filter.c b/net/core/filter.c
index 232b187..e63a794 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -39,65 +39,6 @@
#include <linux/filter.h>
#include <linux/reciprocal_div.h>
-enum {
- BPF_S_RET_K = 1,
- BPF_S_RET_A,
- BPF_S_ALU_ADD_K,
- BPF_S_ALU_ADD_X,
- BPF_S_ALU_SUB_K,
- BPF_S_ALU_SUB_X,
- BPF_S_ALU_MUL_K,
- BPF_S_ALU_MUL_X,
- BPF_S_ALU_DIV_X,
- BPF_S_ALU_AND_K,
- BPF_S_ALU_AND_X,
- BPF_S_ALU_OR_K,
- BPF_S_ALU_OR_X,
- BPF_S_ALU_LSH_K,
- BPF_S_ALU_LSH_X,
- BPF_S_ALU_RSH_K,
- BPF_S_ALU_RSH_X,
- BPF_S_ALU_NEG,
- BPF_S_LD_W_ABS,
- BPF_S_LD_H_ABS,
- BPF_S_LD_B_ABS,
- BPF_S_LD_W_LEN,
- BPF_S_LD_W_IND,
- BPF_S_LD_H_IND,
- BPF_S_LD_B_IND,
- BPF_S_LD_IMM,
- BPF_S_LDX_W_LEN,
- BPF_S_LDX_B_MSH,
- BPF_S_LDX_IMM,
- BPF_S_MISC_TAX,
- BPF_S_MISC_TXA,
- BPF_S_ALU_DIV_K,
- BPF_S_LD_MEM,
- BPF_S_LDX_MEM,
- BPF_S_ST,
- BPF_S_STX,
- BPF_S_JMP_JA,
- BPF_S_JMP_JEQ_K,
- BPF_S_JMP_JEQ_X,
- BPF_S_JMP_JGE_K,
- BPF_S_JMP_JGE_X,
- BPF_S_JMP_JGT_K,
- BPF_S_JMP_JGT_X,
- BPF_S_JMP_JSET_K,
- BPF_S_JMP_JSET_X,
- /* Ancillary data */
- BPF_S_ANC_PROTOCOL,
- BPF_S_ANC_PKTTYPE,
- BPF_S_ANC_IFINDEX,
- BPF_S_ANC_NLATTR,
- BPF_S_ANC_NLATTR_NEST,
- BPF_S_ANC_MARK,
- BPF_S_ANC_QUEUE,
- BPF_S_ANC_HATYPE,
- BPF_S_ANC_RXHASH,
- BPF_S_ANC_CPU,
-};
-
/* No hurry in this branch */
static void *__load_pointer(const struct sk_buff *skb, int k, unsigned int size)
{
@@ -145,7 +86,7 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
if (filter) {
- unsigned int pkt_len = sk_run_filter(skb, filter->insns);
+ unsigned int pkt_len = SK_RUN_FILTER(filter, skb);
err = pkt_len ? pskb_trim(skb, pkt_len) : -EPERM;
}
@@ -638,6 +579,7 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
{
struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
+ bpf_jit_free(fp);
kfree(fp);
}
EXPORT_SYMBOL(sk_filter_release_rcu);
@@ -672,6 +614,7 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
atomic_set(&fp->refcnt, 1);
fp->len = fprog->len;
+ fp->bpf_func = sk_run_filter;
err = sk_chk_filter(fp->insns, fp->len);
if (err) {
@@ -679,6 +622,8 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
return err;
}
+ bpf_jit_compile(fp);
+
old_fp = rcu_dereference_protected(sk->sk_filter,
sock_owned_by_user(sk));
rcu_assign_pointer(sk->sk_filter, fp);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 385b609..a829e3f 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -122,6 +122,15 @@ static struct ctl_table net_core_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+#ifdef CONFIG_BPF_JIT
+ {
+ .procname = "bpf_jit_enable",
+ .data = &bpf_jit_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+#endif
{
.procname = "netdev_tstamp_prequeue",
.data = &netdev_tstamp_prequeue,
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b5362e9..549527b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -538,7 +538,7 @@ static inline unsigned int run_filter(const struct sk_buff *skb,
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
if (filter != NULL)
- res = sk_run_filter(skb, filter->insns);
+ res = SK_RUN_FILTER(filter, skb);
rcu_read_unlock();
return res;
^ permalink raw reply related
* Re: [patch net-next-2.6] net: vlan: make non-hw-accel rx path similar to hw-accel
From: Jiri Pirko @ 2011-04-03 13:22 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: Stephen Hemminger, netdev, davem, kaber, fubar, eric.dumazet,
andy, xiaosuo, jesse
In-Reply-To: <4D983D81.2020500@gmail.com>
Sun, Apr 03, 2011 at 11:27:29AM CEST, nicolas.2p.debian@gmail.com wrote:
>Le 02/04/2011 20:27, Jiri Pirko a écrit :
>>Sat, Apr 02, 2011 at 05:55:24PM CEST, shemminger@linux-foundation.org wrote:
>>>On Sat, 2 Apr 2011 12:26:06 +0200
>>>Jiri Pirko<jpirko@redhat.com> wrote:
><snip>
>>>>+ rawp = skb->data;
>>>>+ if (*(unsigned short *) rawp == 0xFFFF)
>>>>+ /*
>>>>+ * This is a magic hack to spot IPX packets. Older Novell
>>>>+ * breaks the protocol design and runs IPX over 802.3 without
>>>>+ * an 802.2 LLC layer. We look for FFFF which isn't a used
>>>>+ * 802.2 SSAP/DSAP. This won't work for fault tolerant netware
>>>>+ * but does for the rest.
>>>>+ */
>>>>+ skb->protocol = htons(ETH_P_802_3);
>>>>+ else
>>>>+ /*
>>>>+ * Real 802.2 LLC
>>>>+ */
>>>>+ skb->protocol = htons(ETH_P_802_2);
>>>>+}
>>>
>>>What about doublely tagged packets?
>>
>>No problem. Once they are untagged and reinjected they are untagged
>>again and reinjected again:
>>
>>-> __netif_reveive_skb
>>vlan_untag
>>vlan_do_receive
>>-> __netif_receive_skb
>>vlan_untag
>>vlan_do_receive
>
>Hi Jiri,
>
>Instead of untagging and reinjecting, wouldn't it be possible to
>remove all 802.1Q headers in a loop and directly deliver the untagged
>skb? Are there any hw-accel implementations that remove several level
>of tagging?
Makes no sense to untag multiple tags in loop at once. For each tag you
need to call vlan_do_receive so the frame could be properly processed by
vlan code.
I'm not aware of a possibility of hwaccel to untag multiple tags at
once. Honestly I cannot imagine how that could be handled (maybe setting
array of vlan_tcis).
Jirka
>
> Nicolas.
^ permalink raw reply
* [PATCH net-next-2.6] netfilter: get rid of atomic ops in fast path
From: Eric Dumazet @ 2011-04-03 13:15 UTC (permalink / raw)
To: Patrick McHardy
Cc: David Miller, Netfilter Development Mailinglist, netdev,
Jan Engelhardt
In-Reply-To: <1300468038.2888.160.camel@edumazet-laptop>
We currently use a percpu spinlock to 'protect' rule bytes/packets
counters, after various attempts to use RCU instead.
Lately we added a seqlock so that get_counters() can run without
blocking BH or 'writers'. But we really only need the seqcount in it.
Spinlock itself is only locked by the current/owner cpu, so we can
remove it completely.
This cleanups api, using correct 'writer' vs 'reader' semantic.
At replace time, the get_counters() call makes sure all cpus are done
using the old table.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Patrick McHardy <kaber@trash.net>
---
This is an exact copy of patch send on march 18th, thanks !
include/linux/netfilter/x_tables.h | 96 +++++++++++----------------
net/ipv4/netfilter/arp_tables.c | 18 +++--
net/ipv4/netfilter/ip_tables.c | 28 +++----
net/ipv6/netfilter/ip6_tables.c | 19 +++--
net/netfilter/x_tables.c | 9 --
5 files changed, 80 insertions(+), 90 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 3721952..32cddf7 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -456,72 +456,60 @@ extern void xt_proto_fini(struct net *net, u_int8_t af);
extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
extern void xt_free_table_info(struct xt_table_info *info);
-/*
- * Per-CPU spinlock associated with per-cpu table entries, and
- * with a counter for the "reading" side that allows a recursive
- * reader to avoid taking the lock and deadlocking.
- *
- * "reading" is used by ip/arp/ip6 tables rule processing which runs per-cpu.
- * It needs to ensure that the rules are not being changed while the packet
- * is being processed. In some cases, the read lock will be acquired
- * twice on the same CPU; this is okay because of the count.
- *
- * "writing" is used when reading counters.
- * During replace any readers that are using the old tables have to complete
- * before freeing the old table. This is handled by the write locking
- * necessary for reading the counters.
+/**
+ * xt_recseq - recursive seqcount for netfilter use
+ *
+ * Packet processing changes the seqcount only if no recursion happened
+ * get_counters() can use read_seqcount_begin()/read_seqcount_retry(),
+ * because we use the normal seqcount convention :
+ * Low order bit set to 1 if a writer is active.
*/
-struct xt_info_lock {
- seqlock_t lock;
- unsigned char readers;
-};
-DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
+DECLARE_PER_CPU(seqcount_t, xt_recseq);
-/*
- * Note: we need to ensure that preemption is disabled before acquiring
- * the per-cpu-variable, so we do it as a two step process rather than
- * using "spin_lock_bh()".
- *
- * We _also_ need to disable bottom half processing before updating our
- * nesting count, to make sure that the only kind of re-entrancy is this
- * code being called by itself: since the count+lock is not an atomic
- * operation, we can allow no races.
+/**
+ * xt_write_recseq_begin - start of a write section
*
- * _Only_ that special combination of being per-cpu and never getting
- * re-entered asynchronously means that the count is safe.
+ * Begin packet processing : all readers must wait the end
+ * 1) Must be called with preemption disabled
+ * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add())
+ * Returns :
+ * 1 if no recursion on this cpu
+ * 0 if recursion detected
*/
-static inline void xt_info_rdlock_bh(void)
+static inline unsigned int xt_write_recseq_begin(void)
{
- struct xt_info_lock *lock;
+ unsigned int addend;
- local_bh_disable();
- lock = &__get_cpu_var(xt_info_locks);
- if (likely(!lock->readers++))
- write_seqlock(&lock->lock);
-}
+ /*
+ * Low order bit of sequence is set if we already
+ * called xt_write_recseq_begin().
+ */
+ addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
-static inline void xt_info_rdunlock_bh(void)
-{
- struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
+ /*
+ * This is kind of a write_seqcount_begin(), but addend is 0 or 1
+ * We dont check addend value to avoid a test and conditional jump,
+ * since addend is most likely 1
+ */
+ __this_cpu_add(xt_recseq.sequence, addend);
+ smp_wmb();
- if (likely(!--lock->readers))
- write_sequnlock(&lock->lock);
- local_bh_enable();
+ return addend;
}
-/*
- * The "writer" side needs to get exclusive access to the lock,
- * regardless of readers. This must be called with bottom half
- * processing (and thus also preemption) disabled.
+/**
+ * xt_write_recseq_end - end of a write section
+ * @addend: return value from previous xt_write_recseq_begin()
+ *
+ * End packet processing : all readers can proceed
+ * 1) Must be called with preemption disabled
+ * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add())
*/
-static inline void xt_info_wrlock(unsigned int cpu)
-{
- write_seqlock(&per_cpu(xt_info_locks, cpu).lock);
-}
-
-static inline void xt_info_wrunlock(unsigned int cpu)
+static inline void xt_write_recseq_end(unsigned int addend)
{
- write_sequnlock(&per_cpu(xt_info_locks, cpu).lock);
+ /* this is kind of a write_seqcount_end(), but addend is 0 or 1 */
+ smp_wmb();
+ __this_cpu_add(xt_recseq.sequence, addend);
}
/*
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 4b5d457..2ea7433 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -260,6 +260,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
void *table_base;
const struct xt_table_info *private;
struct xt_action_param acpar;
+ unsigned int addend;
if (!pskb_may_pull(skb, arp_hdr_len(skb->dev)))
return NF_DROP;
@@ -267,7 +268,8 @@ unsigned int arpt_do_table(struct sk_buff *skb,
indev = in ? in->name : nulldevname;
outdev = out ? out->name : nulldevname;
- xt_info_rdlock_bh();
+ local_bh_disable();
+ addend = xt_write_recseq_begin();
private = table->private;
table_base = private->entries[smp_processor_id()];
@@ -338,7 +340,8 @@ unsigned int arpt_do_table(struct sk_buff *skb,
/* Verdict */
break;
} while (!acpar.hotdrop);
- xt_info_rdunlock_bh();
+ xt_write_recseq_end(addend);
+ local_bh_enable();
if (acpar.hotdrop)
return NF_DROP;
@@ -712,7 +715,7 @@ static void get_counters(const struct xt_table_info *t,
unsigned int i;
for_each_possible_cpu(cpu) {
- seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+ seqcount_t *s = &per_cpu(xt_recseq, cpu);
i = 0;
xt_entry_foreach(iter, t->entries[cpu], t->size) {
@@ -720,10 +723,10 @@ static void get_counters(const struct xt_table_info *t,
unsigned int start;
do {
- start = read_seqbegin(lock);
+ start = read_seqcount_begin(s);
bcnt = iter->counters.bcnt;
pcnt = iter->counters.pcnt;
- } while (read_seqretry(lock, start));
+ } while (read_seqcount_retry(s, start));
ADD_COUNTER(counters[i], bcnt, pcnt);
++i;
@@ -1115,6 +1118,7 @@ static int do_add_counters(struct net *net, const void __user *user,
int ret = 0;
void *loc_cpu_entry;
struct arpt_entry *iter;
+ unsigned int addend;
#ifdef CONFIG_COMPAT
struct compat_xt_counters_info compat_tmp;
@@ -1171,12 +1175,12 @@ static int do_add_counters(struct net *net, const void __user *user,
/* Choose the copy that is on our node */
curcpu = smp_processor_id();
loc_cpu_entry = private->entries[curcpu];
- xt_info_wrlock(curcpu);
+ addend = xt_write_recseq_begin();
xt_entry_foreach(iter, loc_cpu_entry, private->size) {
ADD_COUNTER(iter->counters, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
- xt_info_wrunlock(curcpu);
+ xt_write_recseq_end(addend);
unlock_up_free:
local_bh_enable();
xt_table_unlock(t);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ffcea0d..2b6b700 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -68,15 +68,6 @@ void *ipt_alloc_initial_table(const struct xt_table *info)
}
EXPORT_SYMBOL_GPL(ipt_alloc_initial_table);
-/*
- We keep a set of rules for each CPU, so we can avoid write-locking
- them in the softirq when updating the counters and therefore
- only need to read-lock in the softirq; doing a write_lock_bh() in user
- context stops packets coming through and allows user context to read
- the counters or update the rules.
-
- Hence the start of any table is given by get_table() below. */
-
/* Returns whether matches rule or not. */
/* Performance critical - called for every packet */
static inline bool
@@ -311,6 +302,7 @@ ipt_do_table(struct sk_buff *skb,
unsigned int *stackptr, origptr, cpu;
const struct xt_table_info *private;
struct xt_action_param acpar;
+ unsigned int addend;
/* Initialization */
ip = ip_hdr(skb);
@@ -331,7 +323,8 @@ ipt_do_table(struct sk_buff *skb,
acpar.hooknum = hook;
IP_NF_ASSERT(table->valid_hooks & (1 << hook));
- xt_info_rdlock_bh();
+ local_bh_disable();
+ addend = xt_write_recseq_begin();
private = table->private;
cpu = smp_processor_id();
table_base = private->entries[cpu];
@@ -430,7 +423,9 @@ ipt_do_table(struct sk_buff *skb,
pr_debug("Exiting %s; resetting sp from %u to %u\n",
__func__, *stackptr, origptr);
*stackptr = origptr;
- xt_info_rdunlock_bh();
+ xt_write_recseq_end(addend);
+ local_bh_enable();
+
#ifdef DEBUG_ALLOW_ALL
return NF_ACCEPT;
#else
@@ -886,7 +881,7 @@ get_counters(const struct xt_table_info *t,
unsigned int i;
for_each_possible_cpu(cpu) {
- seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+ seqcount_t *s = &per_cpu(xt_recseq, cpu);
i = 0;
xt_entry_foreach(iter, t->entries[cpu], t->size) {
@@ -894,10 +889,10 @@ get_counters(const struct xt_table_info *t,
unsigned int start;
do {
- start = read_seqbegin(lock);
+ start = read_seqcount_begin(s);
bcnt = iter->counters.bcnt;
pcnt = iter->counters.pcnt;
- } while (read_seqretry(lock, start));
+ } while (read_seqcount_retry(s, start));
ADD_COUNTER(counters[i], bcnt, pcnt);
++i; /* macro does multi eval of i */
@@ -1312,6 +1307,7 @@ do_add_counters(struct net *net, const void __user *user,
int ret = 0;
void *loc_cpu_entry;
struct ipt_entry *iter;
+ unsigned int addend;
#ifdef CONFIG_COMPAT
struct compat_xt_counters_info compat_tmp;
@@ -1368,12 +1364,12 @@ do_add_counters(struct net *net, const void __user *user,
/* Choose the copy that is on our node */
curcpu = smp_processor_id();
loc_cpu_entry = private->entries[curcpu];
- xt_info_wrlock(curcpu);
+ addend = xt_write_recseq_begin();
xt_entry_foreach(iter, loc_cpu_entry, private->size) {
ADD_COUNTER(iter->counters, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
- xt_info_wrunlock(curcpu);
+ xt_write_recseq_end(addend);
unlock_up_free:
local_bh_enable();
xt_table_unlock(t);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 0b2af9b..ec7cf57 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -340,6 +340,7 @@ ip6t_do_table(struct sk_buff *skb,
unsigned int *stackptr, origptr, cpu;
const struct xt_table_info *private;
struct xt_action_param acpar;
+ unsigned int addend;
/* Initialization */
indev = in ? in->name : nulldevname;
@@ -358,7 +359,8 @@ ip6t_do_table(struct sk_buff *skb,
IP_NF_ASSERT(table->valid_hooks & (1 << hook));
- xt_info_rdlock_bh();
+ local_bh_disable();
+ addend = xt_write_recseq_begin();
private = table->private;
cpu = smp_processor_id();
table_base = private->entries[cpu];
@@ -442,7 +444,9 @@ ip6t_do_table(struct sk_buff *skb,
} while (!acpar.hotdrop);
*stackptr = origptr;
- xt_info_rdunlock_bh();
+
+ xt_write_recseq_end(addend);
+ local_bh_enable();
#ifdef DEBUG_ALLOW_ALL
return NF_ACCEPT;
@@ -899,7 +903,7 @@ get_counters(const struct xt_table_info *t,
unsigned int i;
for_each_possible_cpu(cpu) {
- seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+ seqcount_t *s = &per_cpu(xt_recseq, cpu);
i = 0;
xt_entry_foreach(iter, t->entries[cpu], t->size) {
@@ -907,10 +911,10 @@ get_counters(const struct xt_table_info *t,
unsigned int start;
do {
- start = read_seqbegin(lock);
+ start = read_seqcount_begin(s);
bcnt = iter->counters.bcnt;
pcnt = iter->counters.pcnt;
- } while (read_seqretry(lock, start));
+ } while (read_seqcount_retry(s, start));
ADD_COUNTER(counters[i], bcnt, pcnt);
++i;
@@ -1325,6 +1329,7 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len,
int ret = 0;
const void *loc_cpu_entry;
struct ip6t_entry *iter;
+ unsigned int addend;
#ifdef CONFIG_COMPAT
struct compat_xt_counters_info compat_tmp;
@@ -1381,13 +1386,13 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len,
i = 0;
/* Choose the copy that is on our node */
curcpu = smp_processor_id();
- xt_info_wrlock(curcpu);
+ addend = xt_write_recseq_begin();
loc_cpu_entry = private->entries[curcpu];
xt_entry_foreach(iter, loc_cpu_entry, private->size) {
ADD_COUNTER(iter->counters, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
- xt_info_wrunlock(curcpu);
+ xt_write_recseq_end(addend);
unlock_up_free:
local_bh_enable();
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index a9adf4c..52959ef 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -762,8 +762,8 @@ void xt_compat_unlock(u_int8_t af)
EXPORT_SYMBOL_GPL(xt_compat_unlock);
#endif
-DEFINE_PER_CPU(struct xt_info_lock, xt_info_locks);
-EXPORT_PER_CPU_SYMBOL_GPL(xt_info_locks);
+DEFINE_PER_CPU(seqcount_t, xt_recseq);
+EXPORT_PER_CPU_SYMBOL_GPL(xt_recseq);
static int xt_jumpstack_alloc(struct xt_table_info *i)
{
@@ -1362,10 +1362,7 @@ static int __init xt_init(void)
int rv;
for_each_possible_cpu(i) {
- struct xt_info_lock *lock = &per_cpu(xt_info_locks, i);
-
- seqlock_init(&lock->lock);
- lock->readers = 0;
+ seqcount_init(&per_cpu(xt_recseq, i));
}
xt = kmalloc(sizeof(struct xt_af) * NFPROTO_NUMPROTO, GFP_KERNEL);
^ permalink raw reply related
* [PATCH net-next-2.6 2/2] be2net: Fix suspend/resume operation
From: Padmanabh Ratnakar @ 2011-04-03 11:54 UTC (permalink / raw)
To: netdev, davem; +Cc: Padmanabh Ratnakar, Sarveswara Rao Mygapula
eq_next_idx is not getting reset to zero during suspend.
This causes resume to fail. Added the fix.
Signed-off-by: Sarveswara Rao Mygapula <sarveswararao.mygapula@emulex.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
---
drivers/net/benet/be_main.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 351c1f2..3e6c1cb 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -2356,6 +2356,7 @@ static int be_clear(struct be_adapter *adapter)
be_mcc_queues_destroy(adapter);
be_rx_queues_destroy(adapter);
be_tx_queues_destroy(adapter);
+ adapter->eq_next_idx = 0;
if (be_physfn(adapter) && adapter->sriov_enabled)
for (vf = 0; vf < num_vfs; vf++)
--
1.6.0.2
^ permalink raw reply related
* [PATCH net-next-2.6 0/2] be2net: Patches for fixing suspend/resume
From: Padmanabh Ratnakar @ 2011-04-03 11:53 UTC (permalink / raw)
To: netdev, davem; +Cc: Padmanabh Ratnakar
Hi David,
Following are patches to fix suspend/resume operation.
Please apply.
Thanks,
Padmanabh
Padmanabh Ratnakar (2):
be2net: Rename some struct members for clarity
be2net: Fix suspend/resume operation
drivers/net/benet/be.h | 4 ++--
drivers/net/benet/be_main.c | 11 ++++++-----
2 files changed, 8 insertions(+), 7 deletions(-)
^ permalink raw reply
* [PATCH net-next-2.6 1/2] be2net: Rename some struct members for clarity
From: Padmanabh Ratnakar @ 2011-04-03 11:54 UTC (permalink / raw)
To: netdev, davem; +Cc: Padmanabh Ratnakar, Sarveswara Rao Mygapula
Renamed msix_vec_idx to eq_idx in be_eq_obj struct.
Renamed msix_vec_next_idx to eq_next_idx in be_adapter structure.
These members are used in INTX mode also.
Signed-off-by: Sarveswara Rao Mygapula <sarveswararao.mygapula@emulex.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
---
drivers/net/benet/be.h | 4 ++--
drivers/net/benet/be_main.c | 10 +++++-----
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index 3937bca..29f1210 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -155,7 +155,7 @@ struct be_eq_obj {
u16 min_eqd; /* in usecs */
u16 max_eqd; /* in usecs */
u16 cur_eqd; /* in usecs */
- u8 msix_vec_idx;
+ u8 eq_idx;
struct napi_struct napi;
};
@@ -292,7 +292,7 @@ struct be_adapter {
u32 num_rx_qs;
u32 big_page_size; /* Compounded page size shared by rx wrbs */
- u8 msix_vec_next_idx;
+ u8 eq_next_idx;
struct be_drv_stats drv_stats;
struct vlan_group *vlan_grp;
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index a24fb45..351c1f2 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -1509,7 +1509,7 @@ static int be_tx_queues_create(struct be_adapter *adapter)
if (be_cmd_eq_create(adapter, eq, adapter->tx_eq.cur_eqd))
goto tx_eq_free;
- adapter->tx_eq.msix_vec_idx = adapter->msix_vec_next_idx++;
+ adapter->tx_eq.eq_idx = adapter->eq_next_idx++;
/* Alloc TX eth compl queue */
@@ -1621,7 +1621,7 @@ static int be_rx_queues_create(struct be_adapter *adapter)
if (rc)
goto err;
- rxo->rx_eq.msix_vec_idx = adapter->msix_vec_next_idx++;
+ rxo->rx_eq.eq_idx = adapter->eq_next_idx++;
/* CQ */
cq = &rxo->cq;
@@ -1697,11 +1697,11 @@ static irqreturn_t be_intx(int irq, void *dev)
if (!isr)
return IRQ_NONE;
- if ((1 << adapter->tx_eq.msix_vec_idx & isr))
+ if ((1 << adapter->tx_eq.eq_idx & isr))
event_handle(adapter, &adapter->tx_eq);
for_all_rx_queues(adapter, rxo, i) {
- if ((1 << rxo->rx_eq.msix_vec_idx & isr))
+ if ((1 << rxo->rx_eq.eq_idx & isr))
event_handle(adapter, &rxo->rx_eq);
}
}
@@ -1964,7 +1964,7 @@ static void be_sriov_disable(struct be_adapter *adapter)
static inline int be_msix_vec_get(struct be_adapter *adapter,
struct be_eq_obj *eq_obj)
{
- return adapter->msix_entries[eq_obj->msix_vec_idx].vector;
+ return adapter->msix_entries[eq_obj->eq_idx].vector;
}
static int be_request_irq(struct be_adapter *adapter,
--
1.6.0.2
^ permalink raw reply related
* [PATCH] xen: netfront: fix declaration order
From: Eric Dumazet @ 2011-04-03 11:07 UTC (permalink / raw)
To: David Miller
Cc: mirq-linux, netdev, jeremy.fitzhardinge, konrad.wilk,
Ian.Campbell, xen-devel, virtualization
In-Reply-To: <20110401.205455.70198735.davem@davemloft.net>
Le vendredi 01 avril 2011 à 20:54 -0700, David Miller a écrit :
> From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> Date: Thu, 31 Mar 2011 13:01:35 +0200 (CEST)
>
> > Not tested in any way. The original code for offload setting seems broken
> > as it resets the features on every netback reconnect.
> >
> > This will set GSO_ROBUST at device creation time (earlier than connect time).
> >
> > RX checksum offload is forced on - so advertise as it is.
> >
> > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
>
> Applied.
Hmm... I had to apply following patch to make it actually compile.
Thanks
[PATCH] xen: netfront: fix declaration order
Must declare xennet_fix_features() and xennet_set_features() before
using them.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
drivers/net/xen-netfront.c | 72 +++++++++++++++++------------------
1 file changed, 36 insertions(+), 36 deletions(-)
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index f6e7e27..0cfe4cc 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1140,6 +1140,42 @@ static void xennet_uninit(struct net_device *dev)
gnttab_free_grant_references(np->gref_rx_head);
}
+static u32 xennet_fix_features(struct net_device *dev, u32 features)
+{
+ struct netfront_info *np = netdev_priv(dev);
+ int val;
+
+ if (features & NETIF_F_SG) {
+ if (xenbus_scanf(XBT_NIL, np->xbdev->otherend, "feature-sg",
+ "%d", &val) < 0)
+ val = 0;
+
+ if (!val)
+ features &= ~NETIF_F_SG;
+ }
+
+ if (features & NETIF_F_TSO) {
+ if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+ "feature-gso-tcpv4", "%d", &val) < 0)
+ val = 0;
+
+ if (!val)
+ features &= ~NETIF_F_TSO;
+ }
+
+ return features;
+}
+
+static int xennet_set_features(struct net_device *dev, u32 features)
+{
+ if (!(features & NETIF_F_SG) && dev->mtu > ETH_DATA_LEN) {
+ netdev_info(dev, "Reducing MTU because no SG offload");
+ dev->mtu = ETH_DATA_LEN;
+ }
+
+ return 0;
+}
+
static const struct net_device_ops xennet_netdev_ops = {
.ndo_open = xennet_open,
.ndo_uninit = xennet_uninit,
@@ -1513,42 +1549,6 @@ again:
return err;
}
-static u32 xennet_fix_features(struct net_device *dev, u32 features)
-{
- struct netfront_info *np = netdev_priv(dev);
- int val;
-
- if (features & NETIF_F_SG) {
- if (xenbus_scanf(XBT_NIL, np->xbdev->otherend, "feature-sg",
- "%d", &val) < 0)
- val = 0;
-
- if (!val)
- features &= ~NETIF_F_SG;
- }
-
- if (features & NETIF_F_TSO) {
- if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
- "feature-gso-tcpv4", "%d", &val) < 0)
- val = 0;
-
- if (!val)
- features &= ~NETIF_F_TSO;
- }
-
- return features;
-}
-
-static int xennet_set_features(struct net_device *dev, u32 features)
-{
- if (!(features & NETIF_F_SG) && dev->mtu > ETH_DATA_LEN) {
- netdev_info(dev, "Reducing MTU because no SG offload");
- dev->mtu = ETH_DATA_LEN;
- }
-
- return 0;
-}
-
static int xennet_connect(struct net_device *dev)
{
struct netfront_info *np = netdev_priv(dev);
^ permalink raw reply related
* Re: [patch net-next-2.6] net: vlan: make non-hw-accel rx path similar to hw-accel
From: Nicolas de Pesloüan @ 2011-04-03 9:27 UTC (permalink / raw)
To: Jiri Pirko
Cc: Stephen Hemminger, netdev, davem, kaber, fubar, eric.dumazet,
andy, xiaosuo, jesse
In-Reply-To: <20110402182711.GA11885@psychotron.redhat.com>
Le 02/04/2011 20:27, Jiri Pirko a écrit :
> Sat, Apr 02, 2011 at 05:55:24PM CEST, shemminger@linux-foundation.org wrote:
>> On Sat, 2 Apr 2011 12:26:06 +0200
>> Jiri Pirko<jpirko@redhat.com> wrote:
<snip>
>>> + rawp = skb->data;
>>> + if (*(unsigned short *) rawp == 0xFFFF)
>>> + /*
>>> + * This is a magic hack to spot IPX packets. Older Novell
>>> + * breaks the protocol design and runs IPX over 802.3 without
>>> + * an 802.2 LLC layer. We look for FFFF which isn't a used
>>> + * 802.2 SSAP/DSAP. This won't work for fault tolerant netware
>>> + * but does for the rest.
>>> + */
>>> + skb->protocol = htons(ETH_P_802_3);
>>> + else
>>> + /*
>>> + * Real 802.2 LLC
>>> + */
>>> + skb->protocol = htons(ETH_P_802_2);
>>> +}
>>
>> What about doublely tagged packets?
>
> No problem. Once they are untagged and reinjected they are untagged
> again and reinjected again:
>
> -> __netif_reveive_skb
> vlan_untag
> vlan_do_receive
> -> __netif_receive_skb
> vlan_untag
> vlan_do_receive
Hi Jiri,
Instead of untagging and reinjecting, wouldn't it be possible to remove all 802.1Q headers in a loop
and directly deliver the untagged skb? Are there any hw-accel implementations that remove several
level of tagging?
Nicolas.
^ permalink raw reply
* Re: [PATCH v1] net: filter: Just In Time compiler
From: Eric Dumazet @ 2011-04-03 9:04 UTC (permalink / raw)
To: David Miller; +Cc: netdev, acme
In-Reply-To: <20110402.224344.191403947.davem@davemloft.net>
Le samedi 02 avril 2011 à 22:43 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 03 Apr 2011 00:28:21 +0200
>
> > In order to speedup packet filtering, here is an implementation of a JIT
> > compiler for x86_64
>
> Looks great. Of course, everything should sit under arch/${ARCH}/ in the
> end, but this is an excellent proof of concept!
Yes. The real thing would be to implement iptables as JIT ;)
^ permalink raw reply
* Re: [PATCH v2] vlan: convert VLAN devices to use ndo_fix_features()
From: David Miller @ 2011-04-03 5:50 UTC (permalink / raw)
To: mirq-linux; +Cc: netdev, jesse, kaber, john.r.fastabend, eric.dumazet
In-Reply-To: <20110402144141.0666C1389B@rere.qmqm.pl>
From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Sat, 2 Apr 2011 16:41:41 +0200 (CEST)
> Note: get_flags was actually broken, because it should return the
> flags capped with vlan_features. This is now done implicitly by
> limiting netdev->hw_features.
>
> RX checksumming offload control is (and was) broken, as there was no way
> before to say whether it's done for tagged packets.
>
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Applied.
^ permalink raw reply
* Re: [PATCH] net: Call netdev_features_change() from netdev_update_features()
From: David Miller @ 2011-04-03 5:50 UTC (permalink / raw)
To: mirq-linux; +Cc: netdev, bhutchings, jesse
In-Reply-To: <20110402144141.0192613909@rere.qmqm.pl>
From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Sat, 2 Apr 2011 16:41:40 +0200 (CEST)
> Issue FEAT_CHANGE notification when features are changed by
> netdev_update_features(). This will allow changes made by extra constraints
> on e.g. MTU change to be properly propagated like changes via ethtool.
>
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Applied.
^ permalink raw reply
* Re: [PATCH v1] net: filter: Just In Time compiler
From: David Miller @ 2011-04-03 5:43 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, acme
In-Reply-To: <1301783301.2837.77.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 03 Apr 2011 00:28:21 +0200
> In order to speedup packet filtering, here is an implementation of a JIT
> compiler for x86_64
Looks great. Of course, everything should sit under arch/${ARCH}/ in the
end, but this is an excellent proof of concept!
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox