Re: New driver "sfc" for Solarstorm SFC4000 controller.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
       [not found] <200804301925.m3UJPc72001651@hera.kernel.org>
@ 2008-05-01 19:08 ` Andrew Morton
  2008-05-02 16:05   ` Ben Hutchings
  2008-05-02 19:10   ` Michael Brown
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2008-05-01 19:08 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On Wed, 30 Apr 2008 19:25:38 GMT
Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:

> Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8ceee660aacb29721e26f08e336c58dc4847d1bd
> Commit:     8ceee660aacb29721e26f08e336c58dc4847d1bd
> Parent:     358c12953b88c5a06a57c33eb27c753b2e7934d1
> Author:     Ben Hutchings <bhutchings@solarflare.com>
> AuthorDate: Sun Apr 27 12:55:59 2008 +0100
> Committer:  Jeff Garzik <jgarzik@redhat.com>
> CommitDate: Tue Apr 29 01:42:43 2008 -0400
> 
>     New driver "sfc" for Solarstorm SFC4000 controller.
>     
>     The driver supports the 10Xpress PHY and XFP modules on our reference
>     designs SFE4001 and SFE4002 and the SMC models SMC10GPCIe-XFP and
>     SMC10GPCIe-10BT.
>     
>     Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
>     Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
>
> ..
>
> --- /dev/null
> +++ b/drivers/net/sfc/bitfield.h
>

weep.

This entire (large, complex, undocumeted) header file contains quite
generally applicable code which could be used from many many other parts of
the kernel.  Yet here it all is, stuffed way down in drivers/net/sfc/.

What happens if all drivers do this?  Well, they do.  We get a mess.

btw, the implementation is riotously buggy if anyone passes any of these
macros an expression which has side-effects.

> ...
>
> --- /dev/null
> +++ b/drivers/net/sfc/boards.c
> @@ -0,0 +1,167 @@
> +/****************************************************************************
> + * Driver for Solarflare Solarstorm network controllers and boards
> + * Copyright 2007 Solarflare Communications Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation, incorporated herein by reference.
> + */
> +
> +#include "net_driver.h"
> +#include "phy.h"
> +#include "boards.h"
> +#include "efx.h"
> +
> +/* Macros for unpacking the board revision */
> +/* The revision info is in host byte order. */
> +#define BOARD_TYPE(_rev) (_rev >> 8)
> +#define BOARD_MAJOR(_rev) ((_rev >> 4) & 0xf)
> +#define BOARD_MINOR(_rev) (_rev & 0xf)
> +
> +/* Blink support. If the PHY has no auto-blink mode so we hang it off a timer */
> +#define BLINK_INTERVAL (HZ/2)
> +
> +static void blink_led_timer(unsigned long context)
> +{
> +	struct efx_nic *efx = (struct efx_nic *)context;
> +	struct efx_blinker *bl = &efx->board_info.blinker;
> +	efx->board_info.set_fault_led(efx, bl->state);
> +	bl->state = !bl->state;
> +	if (bl->resubmit) {
> +		bl->timer.expires = jiffies + BLINK_INTERVAL;
> +		add_timer(&bl->timer);

mod_timer()

> +	}
> +}
> +
> +static inline int efx_process_channel(struct efx_channel *channel, int rx_quota)
> +{
> +	int rxdmaqs;
> +	struct efx_rx_queue *rx_queue;
> +
> +	if (unlikely(channel->efx->reset_pending != RESET_TYPE_NONE ||
> +		     !channel->enabled))
> +		return rx_quota;
> +
> +	rxdmaqs = falcon_process_eventq(channel, &rx_quota);
> +
> +	/* Deliver last RX packet. */
> +	if (channel->rx_pkt) {
> +		__efx_rx_packet(channel, channel->rx_pkt,
> +				channel->rx_pkt_csummed);
> +		channel->rx_pkt = NULL;
> +	}
> +
> +	efx_flush_lro(channel);
> +	efx_rx_strategy(channel);
> +
> +	/* Refill descriptor rings as necessary */
> +	rx_queue = &channel->efx->rx_queue[0];
> +	while (rxdmaqs) {
> +		if (rxdmaqs & 0x01)
> +			efx_fast_push_rx_descriptors(rx_queue);
> +		rx_queue++;
> +		rxdmaqs >>= 1;
> +	}
> +
> +	return rx_quota;
> +}

This has two callsites and is far far too large to be inlined.

> +/* Mark channel as finished processing
> + *
> + * Note that since we will not receive further interrupts for this
> + * channel before we finish processing and call the eventq_read_ack()
> + * method, there is no need to use the interrupt hold-off timers.
> + */
> +static inline void efx_channel_processed(struct efx_channel *channel)
> +{
> +	/* Write to EVQ_RPTR_REG.  If a new event arrived in a race
> +	 * with finishing processing, a new interrupt will be raised.
> +	 */
> +	channel->work_pending = 0;
> +	smp_wmb(); /* Ensure channel updated before any new interrupt. */

Is smp_wmb() the right thing here?  I think you're trying to ensure that
the CPU has issued the store before we write the interrupt ack to the
hardware, yes?

If so, the problem which you're addressing can occur on non-SMP hardware
too?

> +	falcon_eventq_read_ack(channel);
> +}
> +
> +void efx_process_channel_now(struct efx_channel *channel)
> +{
> +	struct efx_nic *efx = channel->efx;
> +
> +	BUG_ON(!channel->used_flags);
> +	BUG_ON(!channel->enabled);
> +
> +	/* Disable interrupts and wait for ISRs to complete */
> +	falcon_disable_interrupts(efx);
> +	if (efx->legacy_irq)
> +		synchronize_irq(efx->legacy_irq);
> +	if (channel->has_interrupt && channel->irq)
> +		synchronize_irq(channel->irq);
> +
> +	/* Wait for any NAPI processing to complete */
> +	napi_disable(&channel->napi_str);
> +
> +	/* Poll the channel */
> +	(void) efx_process_channel(channel, efx->type->evq_size);

There's little point in the (void) thing ans we usually don't do it.

For a start: do all non-checked, non-void-returning functions in this
driver have the (void)?  We don't know, because there's no way to tell.  it
isn't maintainable.

> +	/* Ack the eventq. This may cause an interrupt to be generated
> +	 * when they are reenabled */
> +	efx_channel_processed(channel);
> +
> +	napi_enable(&channel->napi_str);
> +	falcon_enable_interrupts(efx);
> +}
> +
> +static void efx_calc_rx_buffer_params(struct efx_nic *efx)
> +{
> +	unsigned int order, len;
> +
> +	len = (max(EFX_PAGE_IP_ALIGN, NET_IP_ALIGN) +
> +	       EFX_MAX_FRAME_LEN(efx->net_dev->mtu) +
> +	       efx->type->rx_buffer_padding);
> +
> +	/* Calculate page-order */
> +	for (order = 0; ((1u << order) * PAGE_SIZE) < len; ++order)
> +		;

get_order()

> +	efx->rx_buffer_len = len;
> +	efx->rx_buffer_order = order;
> +}
> +
> +static void efx_start_channel(struct efx_channel *channel)
> +{
> +	struct efx_rx_queue *rx_queue;
> +
> +	EFX_LOG(channel->efx, "starting chan %d\n", channel->channel);
> +
> +	if (!(channel->efx->net_dev->flags & IFF_UP))
> +		netif_napi_add(channel->napi_dev, &channel->napi_str,
> +			       efx_poll, napi_weight);
> +
> +	channel->work_pending = 0;
> +	channel->enabled = 1;
> +	smp_wmb(); /* ensure channel updated before first interrupt */

etc.

> +	napi_enable(&channel->napi_str);
> +
> +	/* Load up RX descriptors */
> +	efx_for_each_channel_rx_queue(rx_queue, channel)
> +		efx_fast_push_rx_descriptors(rx_queue);
> +}
> +
> +/* This disables event queue processing and packet transmission.
> + * This function does not guarantee that all queue processing
> + * (e.g. RX refill) is complete.
> + */
> +static void efx_stop_channel(struct efx_channel *channel)
> +{
> +	struct efx_rx_queue *rx_queue;
> +
> +	if (!channel->enabled)
> +		return;
> +
> +	EFX_LOG(channel->efx, "stop chan %d\n", channel->channel);
> +
> +	channel->enabled = 0;
> +	napi_disable(&channel->napi_str);
> +
> +	/* Ensure that any worker threads have exited or will be no-ops */
> +	efx_for_each_channel_rx_queue(rx_queue, channel) {
> +		spin_lock_bh(&rx_queue->add_lock);
> +		spin_unlock_bh(&rx_queue->add_lock);

whee.  Does this work correctly on CONFIG_SMP=n builds?

> +	}
> +}
> +
> +static int efx_probe_port(struct efx_nic *efx)
> +{
> +	int rc;
> +
> +	EFX_LOG(efx, "create port\n");
> +
> +	/* Connect up MAC/PHY operations table and read MAC address */
> +	rc = falcon_probe_port(efx);
> +	if (rc)
> +		goto err;
> +
> +	/* Sanity check MAC address */
> +	if (is_valid_ether_addr(efx->mac_address)) {
> +		memcpy(efx->net_dev->dev_addr, efx->mac_address, ETH_ALEN);
> +	} else {
> +		DECLARE_MAC_BUF(mac);

whodidthat?  DECLARE_MAC_BUF() is a definition, not a declaration.

> +		EFX_ERR(efx, "invalid MAC address %s\n",
> +			print_mac(mac, efx->mac_address));
> +		if (!allow_bad_hwaddr) {
> +			rc = -EINVAL;
> +			goto err;
> +		}
> +		random_ether_addr(efx->net_dev->dev_addr);
> +		EFX_INFO(efx, "using locally-generated MAC %s\n",
> +			 print_mac(mac, efx->net_dev->dev_addr));
> +	}
> +
> +	return 0;
> +
> + err:
> +	efx_remove_port(efx);
> +	return rc;
> +}
>
> ...
>
> +static void efx_stop_port(struct efx_nic *efx)
> +{
> +	EFX_LOG(efx, "stop port\n");
> +
> +	mutex_lock(&efx->mac_lock);
> +	efx->port_enabled = 0;
> +	mutex_unlock(&efx->mac_lock);

A lock-around-a-single-atomic-write always get me going.  It's _sometimes_
legitimate: if the other code paths which access the storage are reading it
multiple times within the same locked region, and expect all acesses to
return the same vaule.

But often that isn't the case, and there's a buglet, or unneeded locking.

> +	/* Serialise against efx_set_multicast_list() */
> +	if (NET_DEV_REGISTERED(efx)) {
> +		netif_tx_lock_bh(efx->net_dev);
> +		netif_tx_unlock_bh(efx->net_dev);
> +	}
> +}
>
> ...
>
> +static int efx_init_io(struct efx_nic *efx)
> +{
> +	struct pci_dev *pci_dev = efx->pci_dev;
> +	dma_addr_t dma_mask = efx->type->max_dma_mask;
> +	int rc;
> +
> +	EFX_LOG(efx, "initialising I/O\n");
> +
> +	rc = pci_enable_device(pci_dev);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to enable PCI device\n");
> +		goto fail1;
> +	}
> +
> +	pci_set_master(pci_dev);
> +
> +	/* Set the PCI DMA mask.  Try all possibilities from our
> +	 * genuine mask down to 32 bits, because some architectures
> +	 * (e.g. x86_64 with iommu_sac_force set) will allow 40 bit
> +	 * masks event though they reject 46 bit masks.
> +	 */
> +	while (dma_mask > 0x7fffffffUL) {
> +		if (pci_dma_supported(pci_dev, dma_mask) &&
> +		    ((rc = pci_set_dma_mask(pci_dev, dma_mask)) == 0))
> +			break;
> +		dma_mask >>= 1;
> +	}

ho hum, that kinda sucks, doesn't it.  We'll improve this one day.

> +	if (rc) {
> +		EFX_ERR(efx, "could not find a suitable DMA mask\n");
> +		goto fail2;
> +	}
> +	EFX_LOG(efx, "using DMA mask %llx\n", (unsigned long long) dma_mask);
> +	rc = pci_set_consistent_dma_mask(pci_dev, dma_mask);
> +	if (rc) {
> +		/* pci_set_consistent_dma_mask() is not *allowed* to
> +		 * fail with a mask that pci_set_dma_mask() accepted,
> +		 * but just in case...
> +		 */
> +		EFX_ERR(efx, "failed to set consistent DMA mask\n");
> +		goto fail2;
> +	}
> +
> +	efx->membase_phys = pci_resource_start(efx->pci_dev,
> +					       efx->type->mem_bar);
> +	rc = pci_request_region(pci_dev, efx->type->mem_bar, "sfc");
> +	if (rc) {
> +		EFX_ERR(efx, "request for memory BAR failed\n");
> +		rc = -EIO;
> +		goto fail3;
> +	}
> +	efx->membase = ioremap_nocache(efx->membase_phys,
> +				       efx->type->mem_map_size);
> +	if (!efx->membase) {
> +		EFX_ERR(efx, "could not map memory BAR %d at %lx+%x\n",
> +			efx->type->mem_bar, efx->membase_phys,
> +			efx->type->mem_map_size);
> +		rc = -ENOMEM;
> +		goto fail4;
> +	}
> +	EFX_LOG(efx, "memory BAR %u at %lx+%x (virtual %p)\n",
> +		efx->type->mem_bar, efx->membase_phys, efx->type->mem_map_size,
> +		efx->membase);
> +
> +	return 0;
> +
> + fail4:
> +	release_mem_region(efx->membase_phys, efx->type->mem_map_size);
> + fail3:
> +	efx->membase_phys = 0UL;

Code like this is more maintainable if you leave off the "UL".

> + fail2:
> +	pci_disable_device(efx->pci_dev);
> + fail1:
> +	return rc;
> +}
> +
> +static void efx_fini_io(struct efx_nic *efx)
> +{
> +	EFX_LOG(efx, "shutting down I/O\n");
> +
> +	if (efx->membase) {
> +		iounmap(efx->membase);
> +		efx->membase = NULL;
> +	}
> +
> +	if (efx->membase_phys) {
> +		pci_release_region(efx->pci_dev, efx->type->mem_bar);
> +		efx->membase_phys = 0UL;
> +	}
> +
> +	pci_disable_device(efx->pci_dev);
> +}
> +
> +/* Probe the number and type of interrupts we are able to obtain. */
> +static void efx_probe_interrupts(struct efx_nic *efx)
> +{
> +	int max_channel = efx->type->phys_addr_channels - 1;
> +	struct msix_entry xentries[EFX_MAX_CHANNELS];

hm, 256 bytes of stack.  Let's home nobody increases EFX_MAX_CHANNELS.

> +	int rc, i;
> +
> +	if (efx->interrupt_mode == EFX_INT_MODE_MSIX) {
> +		BUG_ON(!pci_find_capability(efx->pci_dev, PCI_CAP_ID_MSIX));

That's a bit rude.  BUG_ON is (mostly) for handling software errors.  Here
we nuke the box if the hardware is discovered to be bad.  It would be
better to blurt a message and then fail the initialisation.

> +		efx->rss_queues = rss_cpus ? rss_cpus : num_online_cpus();
> +		efx->rss_queues = min(efx->rss_queues, max_channel + 1);
> +		efx->rss_queues = min(efx->rss_queues, EFX_MAX_CHANNELS);
> +
> +		/* Request maximum number of MSI interrupts, and fill out
> +		 * the channel interrupt information the allowed allocation */
> +		for (i = 0; i < efx->rss_queues; i++)
> +			xentries[i].entry = i;
> +		rc = pci_enable_msix(efx->pci_dev, xentries, efx->rss_queues);
> +		if (rc > 0) {
> +			EFX_BUG_ON_PARANOID(rc >= efx->rss_queues);
> +			efx->rss_queues = rc;
> +			rc = pci_enable_msix(efx->pci_dev, xentries,
> +					     efx->rss_queues);
> +		}
> +
> +		if (rc == 0) {
> +			for (i = 0; i < efx->rss_queues; i++) {
> +				efx->channel[i].has_interrupt = 1;
> +				efx->channel[i].irq = xentries[i].vector;
> +			}
> +		} else {
> +			/* Fall back to single channel MSI */
> +			efx->interrupt_mode = EFX_INT_MODE_MSI;
> +			EFX_ERR(efx, "could not enable MSI-X\n");
> +		}
> +	}
> +
> +	/* Try single interrupt MSI */
> +	if (efx->interrupt_mode == EFX_INT_MODE_MSI) {
> +		efx->rss_queues = 1;
> +		rc = pci_enable_msi(efx->pci_dev);
> +		if (rc == 0) {
> +			efx->channel[0].irq = efx->pci_dev->irq;
> +			efx->channel[0].has_interrupt = 1;
> +		} else {
> +			EFX_ERR(efx, "could not enable MSI\n");
> +			efx->interrupt_mode = EFX_INT_MODE_LEGACY;
> +		}
> +	}
> +
> +	/* Assume legacy interrupts */
> +	if (efx->interrupt_mode == EFX_INT_MODE_LEGACY) {
> +		efx->rss_queues = 1;
> +		/* Every channel is interruptible */
> +		for (i = 0; i < EFX_MAX_CHANNELS; i++)
> +			efx->channel[i].has_interrupt = 1;
> +		efx->legacy_irq = efx->pci_dev->irq;
> +	}
> +}
>
> ...
>
> +static void efx_flush_all(struct efx_nic *efx)
> +{
> +	struct efx_rx_queue *rx_queue;
> +
> +	/* Make sure the hardware monitor is stopped */
> +	cancel_delayed_work_sync(&efx->monitor_work);
> +
> +	/* Ensure that all RX slow refills are complete. */
> +	efx_for_each_rx_queue(rx_queue, efx) {
> +		cancel_delayed_work_sync(&rx_queue->work);
> +	}

Normally we'd omit the braces here.  Once upon a time checkpatch.pl would
warn about this but it got broken.

Whee, checkpatch takes over half a minute to check this driver.

Oh dear, it found

#5617: FILE: drivers/net/sfc/falcon.c:1877:
+               if (*(volatile u32 *)dma_done == FALCON_STATS_DONE)

which was naughty of you.  Perhaps this was already discussed in review
with the people who actually know what they're talking about.

> +	/* Stop scheduled port reconfigurations */
> +	cancel_work_sync(&efx->reconfigure_work);
> +
> +}
> +
> +/* Quiesce hardware and software without bringing the link down.
> + * Safe to call multiple times, when the nic and interface is in any
> + * state. The caller is guaranteed to subsequently be in a position
> + * to modify any hardware and software state they see fit without
> + * taking locks. */
> +static void efx_stop_all(struct efx_nic *efx)
> +{
> +	struct efx_channel *channel;
> +
> +	EFX_ASSERT_RESET_SERIALISED(efx);
> +
> +	/* port_enabled can be read safely under the rtnl lock */
> +	if (!efx->port_enabled)
> +		return;
> +
> +	/* Disable interrupts and wait for ISR to complete */
> +	falcon_disable_interrupts(efx);
> +	if (efx->legacy_irq)
> +		synchronize_irq(efx->legacy_irq);
> +	efx_for_each_channel_with_interrupt(channel, efx)
> +		if (channel->irq)
> +			synchronize_irq(channel->irq);

Here we _would_ usually add the braces ;)

> +	/* Stop all NAPI processing and synchronous rx refills */
> +	efx_for_each_channel(channel, efx)
> +		efx_stop_channel(channel);
> +
> +	/* Stop all asynchronous port reconfigurations. Since all
> +	 * event processing has already been stopped, there is no
> +	 * window to loose phy events */
> +	efx_stop_port(efx);
> +
> +	/* Flush reconfigure_work, refill_workqueue, monitor_work */
> +	efx_flush_all(efx);
> +
> +	/* Isolate the MAC from the TX and RX engines, so that queue
> +	 * flushes will complete in a timely fashion. */
> +	falcon_deconfigure_mac_wrapper(efx);
> +	falcon_drain_tx_fifo(efx);
> +
> +	/* Stop the kernel transmit interface late, so the watchdog
> +	 * timer isn't ticking over the flush */
> +	efx_stop_queue(efx);
> +	if (NET_DEV_REGISTERED(efx)) {
> +		netif_tx_lock_bh(efx->net_dev);
> +		netif_tx_unlock_bh(efx->net_dev);
> +	}
> +}

hm, I expected NET_DEV_REGISTERED to be defined in
include/linux/netdevice.h, but it turns out to be an sfc-specific thing.

a) it's poorly named

b) please only implement in cpp that which cannot be implemented in C. 
   (aka: macros suck)

>
> ...
>
> +static void efx_monitor(struct work_struct *data)
> +{
> +	struct efx_nic *efx = container_of(data, struct efx_nic,
> +					   monitor_work.work);
> +	int rc = 0;
> +
> +	EFX_TRACE(efx, "hardware monitor executing on CPU %d\n",
> +		  raw_smp_processor_id());
> +
> +
> +	/* If the mac_lock is already held then it is likely a port
> +	 * reconfiguration is already in place, which will likely do
> +	 * most of the work of check_hw() anyway. */
> +	if (!mutex_trylock(&efx->mac_lock)) {
> +		queue_delayed_work(efx->workqueue, &efx->monitor_work,
> +				   efx_monitor_interval);
> +		return;
> +	}
> +
> +	if (efx->port_enabled)
> +		rc = falcon_check_xmac(efx);
> +	mutex_unlock(&efx->mac_lock);
> +
> +	if (rc) {
> +		if (monitor_reset) {
> +			EFX_ERR(efx, "hardware monitor detected a fault: "
> +				"triggering reset\n");
> +			efx_schedule_reset(efx, RESET_TYPE_MONITOR);
> +		} else {
> +			EFX_ERR(efx, "hardware monitor detected a fault, "
> +				"skipping reset\n");
> +		}
> +	}
> +
> +	queue_delayed_work(efx->workqueue, &efx->monitor_work,
> +			   efx_monitor_interval);
> +}

I don't see anywhere in this driver where the delayed works get flushed. 
You have a flush_workqueue() but that will only remove delayed work items
whose timer has already expired.

I might have missed it - please check for missed cancel_delayed_work()s.


> +/**************************************************************************
> + *
> + * ioctls
> + *
> + *************************************************************************/
> +
>
> ...
>
> +static struct net_device_stats *efx_net_stats(struct net_device *net_dev)
> +{
> +	struct efx_nic *efx = net_dev->priv;
> +	struct efx_mac_stats *mac_stats = &efx->mac_stats;
> +	struct net_device_stats *stats = &net_dev->stats;
> +
> +	if (!spin_trylock(&efx->stats_lock))
> +		return stats;

I'm wondering if this is correct on uniprocessor, but to determine that I'd
need to know why it's here and I cannot determine that from just the
implementation.

IOW: needs a comment.

> +	if (efx->state == STATE_RUNNING) {
> +		falcon_update_stats_xmac(efx);
> +		falcon_update_nic_stats(efx);
> +	}
> +	spin_unlock(&efx->stats_lock);
> +
> +	stats->rx_packets = mac_stats->rx_packets;
> +	stats->tx_packets = mac_stats->tx_packets;
> +	stats->rx_bytes = mac_stats->rx_bytes;
> +	stats->tx_bytes = mac_stats->tx_bytes;
> +	stats->multicast = mac_stats->rx_multicast;
> +	stats->collisions = mac_stats->tx_collision;
> +	stats->rx_length_errors = (mac_stats->rx_gtjumbo +
> +				   mac_stats->rx_length_error);
> +	stats->rx_over_errors = efx->n_rx_nodesc_drop_cnt;
> +	stats->rx_crc_errors = mac_stats->rx_bad;
> +	stats->rx_frame_errors = mac_stats->rx_align_error;
> +	stats->rx_fifo_errors = mac_stats->rx_overflow;
> +	stats->rx_missed_errors = mac_stats->rx_missed;
> +	stats->tx_window_errors = mac_stats->tx_late_collision;
> +
> +	stats->rx_errors = (stats->rx_length_errors +
> +			    stats->rx_over_errors +
> +			    stats->rx_crc_errors +
> +			    stats->rx_frame_errors +
> +			    stats->rx_fifo_errors +
> +			    stats->rx_missed_errors +
> +			    mac_stats->rx_symbol_error);
> +	stats->tx_errors = (stats->tx_window_errors +
> +			    mac_stats->tx_bad);
> +
> +	return stats;
> +}
> +
> +/* Context: netif_tx_lock held, BHs disabled. */
> +static void efx_watchdog(struct net_device *net_dev)
> +{
> +	struct efx_nic *efx = net_dev->priv;
> +
> +	EFX_ERR(efx, "TX stuck with stop_count=%d port_enabled=%d: %s\n",
> +		atomic_read(&efx->netif_stop_count), efx->port_enabled,
> +		monitor_reset ? "resetting channels" : "skipping reset");
> +
> +	if (monitor_reset)
> +		efx_schedule_reset(efx, RESET_TYPE_MONITOR);
> +}
> +
> +
> +/* Context: process, rtnl_lock() held. */

ASSERT_RTNL() beats a comment every time ;)

> +static int efx_change_mtu(struct net_device *net_dev, int new_mtu)
> +{
> +	struct efx_nic *efx = net_dev->priv;
> +	int rc = 0;
> +
> +	EFX_ASSERT_RESET_SERIALISED(efx);
> +
> +	if (new_mtu > EFX_MAX_MTU)
> +		return -EINVAL;
> +
> +	efx_stop_all(efx);
> +
> +	EFX_LOG(efx, "changing MTU to %d\n", new_mtu);
> +
> +	efx_fini_channels(efx);
> +	net_dev->mtu = new_mtu;
> +	rc = efx_init_channels(efx);
> +	if (rc)
> +		goto fail;
> +
> +	efx_start_all(efx);
> +	return rc;
> +
> + fail:
> +	efx_schedule_reset(efx, RESET_TYPE_DISABLE);
> +	return rc;
> +}
> +
>
> ...
>
> +static int efx_reset(struct efx_nic *efx)
> +{
> +	struct ethtool_cmd ecmd;
> +	enum reset_type method = efx->reset_pending;
> +	int rc;
> +
> +	/* Serialise with kernel interfaces */
> +	rtnl_lock();
> +
> +	/* If we're not RUNNING then don't reset. Leave the reset_pending
> +	 * flag set so that efx_pci_probe_main will be retried */
> +	if (efx->state != STATE_RUNNING) {
> +		EFX_INFO(efx, "scheduled reset quenched. NIC not RUNNING\n");
> +		goto unlock_rtnl;
> +	}
> +
> +	efx->state = STATE_RESETTING;
> +	EFX_INFO(efx, "resetting (%d)\n", method);
> +
> +	/* The net_dev->get_stats handler is quite slow, and will fail
> +	 * if a fetch is pending over reset. Serialise against it. */
> +	spin_lock(&efx->stats_lock);
> +	spin_unlock(&efx->stats_lock);

but but but...  efx_net_stats() can start to run immediately after the
spin_unlock()?

> +	efx_stop_all(efx);
> +	mutex_lock(&efx->mac_lock);
> +
> +	rc = efx_reset_down(efx, &ecmd);
> +	if (rc)
> +		goto fail1;
> +
> +	rc = falcon_reset_hw(efx, method);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to reset hardware\n");
> +		goto fail2;
> +	}
> +
> +	/* Allow resets to be rescheduled. */
> +	efx->reset_pending = RESET_TYPE_NONE;
> +
> +	/* Reinitialise bus-mastering, which may have been turned off before
> +	 * the reset was scheduled. This is still appropriate, even in the
> +	 * RESET_TYPE_DISABLE since this driver generally assumes the hardware
> +	 * can respond to requests. */
> +	pci_set_master(efx->pci_dev);
> +
> +	/* Reinitialise device. This is appropriate in the RESET_TYPE_DISABLE
> +	 * case so the driver can talk to external SRAM */
> +	rc = falcon_init_nic(efx);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to initialise NIC\n");
> +		goto fail3;
> +	}
> +
> +	/* Leave device stopped if necessary */
> +	if (method == RESET_TYPE_DISABLE) {
> +		/* Reinitialise the device anyway so the driver unload sequence
> +		 * can talk to the external SRAM */
> +		(void) falcon_init_nic(efx);
> +		rc = -EIO;
> +		goto fail4;
> +	}
> +
> +	rc = efx_reset_up(efx, &ecmd);
> +	if (rc)
> +		goto fail5;
> +
> +	mutex_unlock(&efx->mac_lock);
> +	EFX_LOG(efx, "reset complete\n");
> +
> +	efx->state = STATE_RUNNING;
> +	efx_start_all(efx);
> +
> + unlock_rtnl:
> +	rtnl_unlock();
> +	return 0;
> +
> + fail5:
> + fail4:
> + fail3:
> + fail2:
> + fail1:
> +	EFX_ERR(efx, "has been disabled\n");
> +	efx->state = STATE_DISABLED;
> +
> +	mutex_unlock(&efx->mac_lock);
> +	rtnl_unlock();
> +	efx_unregister_netdev(efx);
> +	efx_fini_port(efx);
> +	return rc;
> +}
> +
>
> ...
>
> +static void efx_pci_remove(struct pci_dev *pci_dev)
> +{
> +	struct efx_nic *efx;
> +
> +	efx = pci_get_drvdata(pci_dev);
> +	if (!efx)
> +		return;
> +
> +	/* Mark the NIC as fini, then stop the interface */
> +	rtnl_lock();
> +	efx->state = STATE_FINI;
> +	dev_close(efx->net_dev);
> +
> +	/* Allow any queued efx_resets() to complete */
> +	rtnl_unlock();

Is dev_close() supposed to be called under rtnl_lock()?  It _looks_ that
way, but it doesn't say so and it has no ASSERT_RTNL().

If "no" then therre might be deadlocks lurking here.

> +	if (efx->membase == NULL)
> +		goto out;
> +
> +	efx_unregister_netdev(efx);
> +
> +	/* Wait for any scheduled resets to complete. No more will be
> +	 * scheduled from this point because efx_stop_all() has been
> +	 * called, we are no longer registered with driverlink, and
> +	 * the net_device's have been removed. */
> +	flush_workqueue(efx->workqueue);
> +
> +	efx_pci_remove_main(efx);
> +
> +out:
> +	efx_fini_io(efx);
> +	EFX_LOG(efx, "shutdown successful\n");
> +
> +	pci_set_drvdata(pci_dev, NULL);
> +	efx_fini_struct(efx);
> +	free_netdev(efx->net_dev);
> +};
>
> ...
>
> +	rc = efx_probe_all(efx);
> +	if (rc)
> +		goto fail1;
> +
> +	rc = efx_init_napi(efx);
> +	if (rc)
> +		goto fail2;
> +
> +	/* Initialise the board */
> +	rc = efx->board_info.init(efx);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to initialise board\n");
> +		goto fail3;
> +	}
> +
> +	rc = falcon_init_nic(efx);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to initialise NIC\n");
> +		goto fail4;
> +	}
> +
> +	rc = efx_init_port(efx);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to initialise port\n");
> +		goto fail5;
> +	}
> +
> +	rc = efx_init_channels(efx);
> +	if (rc)
> +		goto fail6;
> +
> +	rc = falcon_init_interrupt(efx);
> +	if (rc)
> +		goto fail7;
> +
> +	return 0;
> +
> + fail7:
> +	efx_fini_channels(efx);
> + fail6:
> +	efx_fini_port(efx);
> + fail5:
> + fail4:
> + fail3:
> +	efx_fini_napi(efx);
> + fail2:
> +	efx_remove_all(efx);
> + fail1:
> +	return rc;
> +}
> +
>
> ...
>
> --- /dev/null
> +++ b/drivers/net/sfc/ethtool.c
> @@ -0,0 +1,460 @@
> +/****************************************************************************
> + * Driver for Solarflare Solarstorm network controllers and boards
> + * Copyright 2005-2006 Fen Systems Ltd.
> + * Copyright 2006-2008 Solarflare Communications Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation, incorporated herein by reference.
> + */
> +
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/rtnetlink.h>
> +#include "net_driver.h"
> +#include "efx.h"
> +#include "ethtool.h"
> +#include "falcon.h"
> +#include "gmii.h"
> +#include "mac.h"
> +
> +static int efx_ethtool_set_tx_csum(struct net_device *net_dev, u32 enable);

Unneeded declaration.

>
> ...
>
> +static int efx_ethtool_phys_id(struct net_device *net_dev, u32 seconds)
> +{
> +	struct efx_nic *efx = net_dev->priv;
> +
> +	efx->board_info.blink(efx, 1);
> +	schedule_timeout_interruptible(seconds * HZ);
> +	efx->board_info.blink(efx, 0);
> +	return 0;
> +}

You probably have a bug here.

If the calling process has signal_pending() then
schedule_timeout_interruptible() will return immediately.  If this function
is only ever called from a kernel thread then you'll be OK.  Otherwise,
you'll need to use _uninterruptible.

>
> ...
>
> +#if BITS_PER_LONG == 64
> +#define FALCON_DMA_MASK EFX_DMA_MASK(0x00003fffffffffffUL)
> +#else
> +#define FALCON_DMA_MASK EFX_DMA_MASK(0x00003fffffffffffULL)
> +#endif

You just reinvented DMA_47BIT_MASK.

>
> ...
>
> +void falcon_push_buffers(struct efx_tx_queue *tx_queue)
> +{
> +
> +	struct efx_tx_buffer *buffer;
> +	efx_qword_t *txd;
> +	unsigned write_ptr;
> +
> +	BUG_ON(tx_queue->write_count == tx_queue->insert_count);
> +
> +	do {
> +		write_ptr = tx_queue->write_count & FALCON_TXD_RING_MASK;
> +		buffer = &tx_queue->buffer[write_ptr];
> +		txd = falcon_tx_desc(tx_queue, write_ptr);
> +		++tx_queue->write_count;
> +
> +		/* Create TX descriptor ring entry */
> +		EFX_POPULATE_QWORD_5(*txd,
> +				     TX_KER_PORT, 0,
> +				     TX_KER_CONT, buffer->continuation,
> +				     TX_KER_BYTE_CNT, buffer->len,
> +				     TX_KER_BUF_REGION, 0,
> +				     TX_KER_BUF_ADR, buffer->dma_addr);
> +	} while (tx_queue->write_count != tx_queue->insert_count);
> +
> +	wmb(); /* Ensure descriptors are written before they are fetched */

Other places used smb_wmb() for this sort of thing (fishily).

> +	falcon_notify_tx_desc(tx_queue);
> +}
> +
>
> ...
>
> +void falcon_notify_rx_desc(struct efx_rx_queue *rx_queue)
> +{
> +	efx_dword_t reg;
> +	unsigned write_ptr;
> +
> +	while (rx_queue->notified_count != rx_queue->added_count) {
> +		falcon_build_rx_desc(rx_queue,
> +				     rx_queue->notified_count &
> +				     FALCON_RXD_RING_MASK);
> +		++rx_queue->notified_count;
> +	}
> +
> +	wmb();

ditto

> +	write_ptr = rx_queue->added_count & FALCON_RXD_RING_MASK;
> +	EFX_POPULATE_DWORD_1(reg, RX_DESC_WPTR_DWORD, write_ptr);
> +	falcon_writel_page(rx_queue->efx, &reg,
> +			   RX_DESC_UPD_REG_KER_DWORD, rx_queue->queue);
> +}
> +
>
> ...
>
> +static inline void falcon_handle_tx_event(struct efx_channel *channel,
> +					  efx_qword_t *event)

Unneeded "inline".

`inline' is almost always either wrong or unneeded.  I'd suggest a general
inline audit.

> +	unsigned int tx_ev_desc_ptr;
> +	unsigned int tx_ev_q_label;
> +	struct efx_tx_queue *tx_queue;
> +	struct efx_nic *efx = channel->efx;
> +
> +	if (likely(EFX_QWORD_FIELD(*event, TX_EV_COMP))) {
> +		/* Transmit completion */
> +		tx_ev_desc_ptr = EFX_QWORD_FIELD(*event, TX_EV_DESC_PTR);
> +		tx_ev_q_label = EFX_QWORD_FIELD(*event, TX_EV_Q_LABEL);
> +		tx_queue = &efx->tx_queue[tx_ev_q_label];
> +		efx_xmit_done(tx_queue, tx_ev_desc_ptr);
> +	} else if (EFX_QWORD_FIELD(*event, TX_EV_WQ_FF_FULL)) {
> +		/* Rewrite the FIFO write pointer */
> +		tx_ev_q_label = EFX_QWORD_FIELD(*event, TX_EV_Q_LABEL);
> +		tx_queue = &efx->tx_queue[tx_ev_q_label];
> +
> +		if (NET_DEV_REGISTERED(efx))
> +			netif_tx_lock(efx->net_dev);
> +		falcon_notify_tx_desc(tx_queue);
> +		if (NET_DEV_REGISTERED(efx))
> +			netif_tx_unlock(efx->net_dev);
> +	} else if (EFX_QWORD_FIELD(*event, TX_EV_PKT_ERR) &&
> +		   EFX_WORKAROUND_10727(efx)) {
> +		efx_schedule_reset(efx, RESET_TYPE_TX_DESC_FETCH);
> +	} else {
> +		EFX_ERR(efx, "channel %d unexpected TX event "
> +			EFX_QWORD_FMT"\n", channel->channel,
> +			EFX_QWORD_VAL(*event));
> +	}
> +}
> +
>
> ...
>
> +int falcon_probe_nic(struct efx_nic *efx)
> +{
> +	struct falcon_nic_data *nic_data;
> +	int rc;
> +
> +	/* Initialise I2C interface state */
> +	efx->i2c.efx = efx;
> +	efx->i2c.op = &falcon_i2c_bit_operations;
> +	efx->i2c.sda = 1;
> +	efx->i2c.scl = 1;
> +
> +	/* Allocate storage for hardware specific data */
> +	nic_data = kzalloc(sizeof(*nic_data), GFP_KERNEL);
> +	efx->nic_data = (void *) nic_data;

Maybe efx_nic.nic_data should have type `struct falcon_nic_data*'?  It
never holds anything else, does it?

> +	/* Determine number of ports etc. */
> +	rc = falcon_probe_nic_variant(efx);
> +	if (rc)
> +		goto fail1;
> +
> +	/* Probe secondary function if expected */
> +	if (FALCON_IS_DUAL_FUNC(efx)) {
> +		struct pci_dev *dev = pci_dev_get(efx->pci_dev);
> +
> +		while ((dev = pci_get_device(EFX_VENDID_SFC, FALCON_A_S_DEVID,
> +					     dev))) {
> +			if (dev->bus == efx->pci_dev->bus &&
> +			    dev->devfn == efx->pci_dev->devfn + 1) {
> +				nic_data->pci_dev2 = dev;
> +				break;
> +			}
> +		}
> +		if (!nic_data->pci_dev2) {
> +			EFX_ERR(efx, "failed to find secondary function\n");
> +			rc = -ENODEV;
> +			goto fail2;
> +		}
> +	}
> +
> +	/* Now we can reset the NIC */
> +	rc = falcon_reset_hw(efx, RESET_TYPE_ALL);
> +	if (rc) {
> +		EFX_ERR(efx, "failed to reset NIC\n");
> +		goto fail3;
> +	}
> +
> +	/* Allocate memory for INT_KER */
> +	rc = falcon_alloc_buffer(efx, &efx->irq_status, sizeof(efx_oword_t));
> +	if (rc)
> +		goto fail4;
> +	BUG_ON(efx->irq_status.dma_addr & 0x0f);

Are we going BUG on an error in the hardware, or on a software error?

> +	EFX_LOG(efx, "INT_KER at %llx (virt %p phys %lx)\n",
> +		(unsigned long long)efx->irq_status.dma_addr,
> +		efx->irq_status.addr, virt_to_phys(efx->irq_status.addr));
> +
> +	/* Read in the non-volatile configuration */
> +	rc = falcon_probe_nvconfig(efx);
> +	if (rc)
> +		goto fail5;
> +
> +	return 0;
> +
> + fail5:
> +	falcon_free_buffer(efx, &efx->irq_status);
> + fail4:
> +	/* fall-thru */
> + fail3:
> +	if (nic_data->pci_dev2) {
> +		pci_dev_put(nic_data->pci_dev2);
> +		nic_data->pci_dev2 = NULL;
> +	}
> + fail2:
> +	/* fall-thru */
> + fail1:
> +	kfree(efx->nic_data);
> +	return rc;
> +}
> +
> +/* This call performs hardware-specific global initialisation, such as
> + * defining the descriptor cache sizes and number of RSS channels.
> + * It does not set up any buffers, descriptor rings or event queues.
> + */
> +int falcon_init_nic(struct efx_nic *efx)
> +{
> +	struct falcon_nic_data *data;
> +	efx_oword_t temp;
> +	unsigned thresh;
> +	int rc;
> +
> +	data = (struct falcon_nic_data *)efx->nic_data;

Unneeded and undesirable cast.

> +	/* Set up the address region register. This is only needed
> +	 * for the B0 FPGA, but since we are just pushing in the
> +	 * reset defaults this may as well be unconditional. */
> +	EFX_POPULATE_OWORD_4(temp, ADR_REGION0, 0,
> +				   ADR_REGION1, (1 << 16),
> +				   ADR_REGION2, (2 << 16),
> +				   ADR_REGION3, (3 << 16));
> +	falcon_write(efx, &temp, ADR_REGION_REG_KER);
> +
> +	/* Use on-chip SRAM */
> +	falcon_read(efx, &temp, NIC_STAT_REG);
> +	EFX_SET_OWORD_FIELD(temp, ONCHIP_SRAM, 1);
> +	falcon_write(efx, &temp, NIC_STAT_REG);
> +
>
> ...
>
> +#define FALCON_REV(efx) ((efx)->pci_dev->revision)

Implement not in cpp that which .... ;)

>
> ...
>
> +struct falcon_nvconfig {
> +	efx_oword_t ee_vpd_cfg_reg;			/* 0x300 */
> +	u8 mac_address[2][8];			/* 0x310 */
> +	efx_oword_t pcie_sd_ctl0123_reg;		/* 0x320 */
> +	efx_oword_t pcie_sd_ctl45_reg;			/* 0x330 */
> +	efx_oword_t pcie_pcs_ctl_stat_reg;		/* 0x340 */
> +	efx_oword_t hw_init_reg;			/* 0x350 */
> +	efx_oword_t nic_stat_reg;			/* 0x360 */
> +	efx_oword_t glb_ctl_reg;			/* 0x370 */
> +	efx_oword_t srm_cfg_reg;			/* 0x380 */
> +	efx_oword_t spare_reg;				/* 0x390 */
> +	__le16 board_magic_num;			/* 0x3A0 */
> +	__le16 board_struct_ver;
> +	__le16 board_checksum;
> +	struct falcon_nvconfig_board_v2 board_v2;
> +} __attribute__ ((packed));

Linux has a little "__packed" helper macro.

> +#endif /* EFX_FALCON_HWDEFS_H */
> diff --git a/drivers/net/sfc/falcon_io.h b/drivers/net/sfc/falcon_io.h
> new file mode 100644
> index 0000000..ea08184
> --- /dev/null
> +++ b/drivers/net/sfc/falcon_io.h
>
> ...
>
> +
> +#define _falcon_writeq(efx, value, reg) \
> +	__raw_writeq((__force u64) (value), (efx)->membase + (reg))
> +#define _falcon_writel(efx, value, reg) \
> +	__raw_writel((__force u32) (value), (efx)->membase + (reg))
> +#define _falcon_readq(efx, reg) \
> +	((__force __le64) __raw_readq((efx)->membase + (reg)))
> +#define _falcon_readl(efx, reg) \
> +	((__force __le32) __raw_readl((efx)->membase + (reg)))

I don't think these needed to be implemented as macros.

I suspect that if they were implemented in C, those mysterious casts
wouldn't be needed?

> +/* Writes to a normal 16-byte Falcon register, locking as appropriate. */
> +static inline void falcon_write(struct efx_nic *efx, efx_oword_t *value,
> +				unsigned int reg)
> +{
> +	unsigned long flags;
> +
> +	EFX_REGDUMP(efx, "writing register %x with " EFX_OWORD_FMT "\n", reg,
> +		    EFX_OWORD_VAL(*value));
> +
> +	spin_lock_irqsave(&efx->biu_lock, flags);
> +#ifdef FALCON_USE_QWORD_IO
> +	_falcon_writeq(efx, value->u64[0], reg + 0);
> +	wmb();
> +	_falcon_writeq(efx, value->u64[1], reg + 8);
> +#else
> +	_falcon_writel(efx, value->u32[0], reg + 0);
> +	_falcon_writel(efx, value->u32[1], reg + 4);
> +	_falcon_writel(efx, value->u32[2], reg + 8);
> +	wmb();
> +	_falcon_writel(efx, value->u32[3], reg + 12);
> +#endif
> +	mmiowb();
> +	spin_unlock_irqrestore(&efx->biu_lock, flags);
> +}

hoo boy.  Did you _really_ want to inline this?  It has many many
callsites.

>
> ...
>
> --- /dev/null
> +++ b/drivers/net/sfc/i2c-direct.h

There is no linkage with the kernel's own i2c layer?  Should there be?

>
> ...
>
> +struct efx_i2c_bit_operations {
> +	void (*setsda) (struct efx_i2c_interface *i2c);
> +	void (*setscl) (struct efx_i2c_interface *i2c);
> +	int (*getsda) (struct efx_i2c_interface *i2c);
> +	int (*getscl) (struct efx_i2c_interface *i2c);
> +	unsigned int udelay;
> +	unsigned int mdelay;

whoa, namespace clash.  Some architectures like to do #define udelay(x)
zot(x).  Fortunately the preprocessor considers `foo()' to be different
from `foo'.  So I guess it'll be OK.

> +};
> +
>
> ...
>
> +static inline int mdio_clause45_phyxgxs_lane_sync(struct efx_nic *efx)
> +{
> +	int i, sync, lane_status;
> +
> +	for (i = 0; i < 2; ++i)
> +		lane_status = mdio_clause45_read(efx, efx->mii.phy_id,
> +						 MDIO_MMD_PHYXS,
> +						 MDIO_PHYXS_LANE_STATE);
> +
> +	sync = (lane_status & (1 << MDIO_PHYXS_LANE_ALIGNED_LBN)) != 0;
> +	if (!sync)
> +		EFX_INFO(efx, "XGXS lane status: %x\n", lane_status);
> +	return sync;
> +}

I think this function will expand to about a terabyte of code.

Please do revisit the inlining decisions - it will end up quite a bit
tighter and, as a result, we think, faster.

>
> ...
>
> +/**************************************************************************
> + *
> + * Build definitions
> + *
> + **************************************************************************/
> +#ifndef EFX_DRIVER_NAME
> +#define EFX_DRIVER_NAME	"sfc"
> +#endif
> +#define EFX_DRIVER_VERSION	"2.2.0136"

I would suggest that you remove this.  It's just not a useful way of
establishing what version of the driver your users are running.  We use the
kernel version information for this.

For sure when other people patch your code they will forget to update this.

> +#ifdef EFX_ENABLE_DEBUG

I can't find anywhere which defines this.

It should be a Kconfig option?

> +#define EFX_BUG_ON_PARANOID(x) BUG_ON(x)
> +#define EFX_WARN_ON_PARANOID(x) WARN_ON(x)
> +#else
> +#define EFX_BUG_ON_PARANOID(x) do {} while (0)
> +#define EFX_WARN_ON_PARANOID(x) do {} while (0)
> +#endif
> +
> +#define NET_DEV_REGISTERED(efx)					\
> +	((efx)->net_dev->reg_state == NETREG_REGISTERED)
> +
> +/* Include net device name in log messages if it has been registered.
> + * Use efx->name not efx->net_dev->name so that races with (un)registration
> + * are harmless.
> + */
> +#define NET_DEV_NAME(efx) (NET_DEV_REGISTERED(efx) ? (efx)->name : "")

could be written in C ;)

>
> ...
>
> +/* Kernel headers may redefine inline anyway */
> +#ifndef inline
> +#define inline inline __attribute__ ((always_inline))
> +#endif

whoa.  What are we doing here and why?

>
> ...
>
> +/**
> + * struct efx_rx_queue - An Efx RX queue
> + * @efx: The associated Efx NIC
> + * @queue: DMA queue number
> + * @used: Queue is used by net driver
> + * @channel: The associated channel
> + * @buffer: The software buffer ring
> + * @rxd: The hardware descriptor ring
> + * @added_count: Number of buffers added to the receive queue.
> + * @notified_count: Number of buffers given to NIC (<= @added_count).
> + * @removed_count: Number of buffers removed from the receive queue.
> + * @add_lock: Receive queue descriptor add spin lock.
> + *	This lock must be held in order to add buffers to the RX
> + *	descriptor ring (rxd and buffer) and to update added_count (but
> + *	not removed_count).
> + * @max_fill: RX descriptor maximum fill level (<= ring size)
> + * @fast_fill_trigger: RX descriptor fill level that will trigger a fast fill
> + *	(<= @max_fill)
> + * @fast_fill_limit: The level to which a fast fill will fill
> + *	(@fast_fill_trigger <= @fast_fill_limit <= @max_fill)
> + * @min_fill: RX descriptor minimum non-zero fill level.
> + *	This records the minimum fill level observed when a ring
> + *	refill was triggered.
> + * @min_overfill: RX descriptor minimum overflow fill level.
> + *	This records the minimum fill level at which RX queue
> + *	overflow was observed.  It should never be set.
> + * @alloc_page_count: RX allocation strategy counter.
> + * @alloc_skb_count: RX allocation strategy counter.
> + * @work: Descriptor push work thread
> + * @buf_page: Page for next RX buffer.
> + *	We can use a single page for multiple RX buffers. This tracks
> + *	the remaining space in the allocation.
> + * @buf_dma_addr: Page's DMA address.
> + * @buf_data: Page's host address.
> + */

nice effort, but gee I dislike this.  It's better to thumb your nose at
kerneldoc and document the fields down at their definition sites, I reckon.
People will forget to update the description otherwise.

> +struct efx_rx_queue {
> +	struct efx_nic *efx;
> +	int queue;
> +	int used;
> +	struct efx_channel *channel;
> +	struct efx_rx_buffer *buffer;
> +	struct efx_special_buffer rxd;
> +
> +	int added_count;
> +	int notified_count;
> +	int removed_count;
> +	spinlock_t add_lock;
> +	unsigned int max_fill;
> +	unsigned int fast_fill_trigger;
> +	unsigned int fast_fill_limit;
> +	unsigned int min_fill;
> +	unsigned int min_overfill;
> +	unsigned int alloc_page_count;
> +	unsigned int alloc_skb_count;
> +	struct delayed_work work;
> +	unsigned int slow_fill_count;
> +
> +	struct page *buf_page;
> +	dma_addr_t buf_dma_addr;
> +	char *buf_data;
> +};
> +
>
> ...
>
> +/*
> + * Alignment of page-allocated RX buffers
> + *
> + * Controls the number of bytes inserted at the start of an RX buffer.
> + * This is the equivalent of NET_IP_ALIGN [which controls the alignment
> + * of the skb->head for hardware DMA].
> + */
> +#if defined(__i386__) || defined(__x86_64__)
> +#define EFX_PAGE_IP_ALIGN 0
> +#else
> +#define EFX_PAGE_IP_ALIGN NET_IP_ALIGN
> +#endif

Now what's going on here?  We're taking advantage of x86's unaligned-access
capabilities, I assume.

Well...  if that is a legitimate thing to do in this driver then it is
legitimate to do in _other_ drivers.  Hence the logic should not be
open-coded in a private place such as this.

Added bonus: there might be other CPU types which are good at unaligned
access, and they will also set CONFIG_ARCH_IS_GOOD_AT_UNALIGNED_ACCESSES
and your driver will run better on those archtectures.

No?

>
> ...
>
> +/* Set bit in a little-endian bitfield */
> +static inline void set_bit_le(int nr, unsigned char *addr)
> +{
> +	addr[nr / 8] |= (1 << (nr % 8));
> +}
> +
> +/* Clear bit in a little-endian bitfield */
> +static inline void clear_bit_le(int nr, unsigned char *addr)
> +{
> +	addr[nr / 8] &= ~(1 << (nr % 8));
> +}

a) please stop implementing non-driver-specific infrastructure within a
   driver!  You've identified shortcomings in Linux - let's fix Linux.

b) this duplicates stuff from include/asm-generic/bitops/le.h

>
> ...
>
> +#define RX_DATA_OFFSET(_data)				\
> +	(((unsigned long) (_data)) & (PAGE_SIZE-1))
> +#define RX_BUF_OFFSET(_rx_buf)				\
> +	RX_DATA_OFFSET((_rx_buf)->data)
> +
> +#define RX_PAGE_SIZE(_efx)				\
> +	(PAGE_SIZE * (1u << (_efx)->rx_buffer_order))

implement not in cpp...

>
> ...
>
> +static int efx_get_frag_hdr(struct skb_frag_struct *frag, void **mac_hdr,
> +			    void **ip_hdr, void **tcpudp_hdr, u64 *hdr_flags,
> +			    void *priv)
> +{
> +	struct efx_channel *channel = (struct efx_channel *)priv;

unneeded and undesirable cast of void*

> +	struct ethhdr *eh;
> +	struct iphdr *iph;
> +
> +	/* We support EtherII and VLAN encapsulated IPv4 */
> +	eh = (struct ethhdr *)(page_address(frag->page) + frag->page_offset);

page_address() returns void*...

> +	*mac_hdr = eh;
> +
> +	if (eh->h_proto == htons(ETH_P_IP)) {
> +		iph = (struct iphdr *)(eh + 1);
> +	} else {
> +		struct vlan_ethhdr *veh = (struct vlan_ethhdr *)eh;
> +		if (veh->h_vlan_encapsulated_proto != htons(ETH_P_IP))
> +			goto fail;
> +
> +		iph = (struct iphdr *)(veh + 1);
> +	}
> +	*ip_hdr = iph;
> +
> +	/* We can only do LRO over TCP */
> +	if (iph->protocol != IPPROTO_TCP)
> +		goto fail;
> +
> +	*hdr_flags = LRO_IPV4 | LRO_TCP;
> +	*tcpudp_hdr = (struct tcphdr *)((u8 *) iph + iph->ihl * 4);
> +
> +	channel->rx_alloc_level += RX_ALLOC_FACTOR_LRO;
> +	return 0;
> + fail:
> +	channel->rx_alloc_level += RX_ALLOC_FACTOR_SKB;
> +	return -1;
> +}
> +
>
> ...
>
> +static inline int efx_init_rx_buffer_page(struct efx_rx_queue *rx_queue,
> +					  struct efx_rx_buffer *rx_buf)
> +{
> +	struct efx_nic *efx = rx_queue->efx;
> +	int bytes, space, offset;
> +
> +	bytes = efx->rx_buffer_len - EFX_PAGE_IP_ALIGN;
> +
> +	/* If there is space left in the previously allocated page,
> +	 * then use it. Otherwise allocate a new one */
> +	rx_buf->page = rx_queue->buf_page;
> +	if (rx_buf->page == NULL) {
> +		dma_addr_t dma_addr;
> +
> +		rx_buf->page = alloc_pages(__GFP_COLD | __GFP_COMP | GFP_ATOMIC,
> +					   efx->rx_buffer_order);

I don't think we should be using the open-coded __GFP_COMP here.  That's
more an mm-internal thing.

atomic allocations of higher-order pages are unreliable.  Sometimes very
unreliable.  I guess there's not much we can do about that.

> +		if (unlikely(rx_buf->page == NULL))
> +			return -ENOMEM;
> +
> +		dma_addr = pci_map_page(efx->pci_dev, rx_buf->page,
> +					0, RX_PAGE_SIZE(efx),
> +					PCI_DMA_FROMDEVICE);
> +
> +		if (unlikely(pci_dma_mapping_error(dma_addr))) {
> +			__free_pages(rx_buf->page, efx->rx_buffer_order);
> +			rx_buf->page = NULL;
> +			return -EIO;
> +		}
> +
> +		rx_queue->buf_page = rx_buf->page;
> +		rx_queue->buf_dma_addr = dma_addr;
> +		rx_queue->buf_data = ((char *) page_address(rx_buf->page) +
> +				      EFX_PAGE_IP_ALIGN);
> +	}
> +
> +	offset = RX_DATA_OFFSET(rx_queue->buf_data);
> +	rx_buf->len = bytes;
> +	rx_buf->dma_addr = rx_queue->buf_dma_addr + offset;
> +	rx_buf->data = rx_queue->buf_data;
> +
> +	/* Try to pack multiple buffers per page */
> +	if (efx->rx_buffer_order == 0) {
> +		/* The next buffer starts on the next 512 byte boundary */
> +		rx_queue->buf_data += ((bytes + 0x1ff) & ~0x1ff);
> +		offset += ((bytes + 0x1ff) & ~0x1ff);
> +
> +		space = RX_PAGE_SIZE(efx) - offset;
> +		if (space >= bytes) {
> +			/* Refs dropped on kernel releasing each skb */
> +			get_page(rx_queue->buf_page);
> +			goto out;
> +		}
> +	}
> +
> +	/* This is the final RX buffer for this page, so mark it for
> +	 * unmapping */
> +	rx_queue->buf_page = NULL;
> +	rx_buf->unmap_addr = rx_queue->buf_dma_addr;
> +
> + out:
> +	return 0;
> +}
> +
>
> ...
>
> +static int tenxpress_phy_init(struct efx_nic *efx)
> +{
> +	struct tenxpress_phy_data *phy_data;
> +	int rc = 0;
> +
> +	phy_data = kzalloc(sizeof(*phy_data), GFP_KERNEL);

shouldn't we do something if this fails?

> +	efx->phy_data = phy_data;
> +
> +	tenxpress_set_state(efx, TENXPRESS_STATUS_NORMAL);
> +
> +	rc = mdio_clause45_wait_reset_mmds(efx,
> +					   TENXPRESS_REQUIRED_DEVS);
> +	if (rc < 0)
> +		goto fail;
> +
> +	rc = mdio_clause45_check_mmds(efx, TENXPRESS_REQUIRED_DEVS, 0);
> +	if (rc < 0)
> +		goto fail;
> +
> +	rc = tenxpress_init(efx);
> +	if (rc < 0)
> +		goto fail;
> +
> +	schedule_timeout_uninterruptible(HZ / 5); /* 200ms */
> +
> +	/* Let XGXS and SerDes out of reset and resets 10XPress */
> +	falcon_reset_xaui(efx);
> +
> +	return 0;
> +
> + fail:
> +	kfree(efx->phy_data);
> +	efx->phy_data = NULL;
> +	return rc;
> +}
> +
>
> ...
>


Well, it's a nice-looking driver, albeit somewhat on the small side.

Apart from drivers/net/sfc/bitfield.h and other such shouldnt-be-private
things, which are little tragedies.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-01 19:08 ` New driver "sfc" for Solarstorm SFC4000 controller Andrew Morton
@ 2008-05-02 16:05   ` Ben Hutchings
  2008-05-02 18:09     ` Andrew Morton
  2008-05-02 19:10   ` Michael Brown
  1 sibling, 1 reply; 11+ messages in thread
From: Ben Hutchings @ 2008-05-02 16:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, linux-net-drivers

Andrew Morton wrote:
> On Wed, 30 Apr 2008 19:25:38 GMT
> Linux Kernel Mailing List <linux-kernel@vger.kernel.org> wrote:
<snip>
> > --- /dev/null
> > +++ b/drivers/net/sfc/bitfield.h
> >
> 
> weep.
> 
> This entire (large, complex, undocumeted) header file contains quite
> generally applicable code which could be used from many many other parts of
> the kernel.  Yet here it all is, stuffed way down in drivers/net/sfc/.
>
> What happens if all drivers do this?  Well, they do.  We get a mess.

You're right, this does get reinvented quite a bit.  Please don't wait
for driver writers to implement common code.  We're generally less
experienced in kernel development, and are disinclined to make driver
changes become dependent on changes to common code that are more likely
to provoke objections.
 
> btw, the implementation is riotously buggy if anyone passes any of these
> macros an expression which has side-effects.
 
This is sadly true.

<snip>
> > +static void blink_led_timer(unsigned long context)
> > +{
> > +	struct efx_nic *efx = (struct efx_nic *)context;
> > +	struct efx_blinker *bl = &efx->board_info.blinker;
> > +	efx->board_info.set_fault_led(efx, bl->state);
> > +	bl->state = !bl->state;
> > +	if (bl->resubmit) {
> > +		bl->timer.expires = jiffies + BLINK_INTERVAL;
> > +		add_timer(&bl->timer);
> 
> mod_timer()
 
I'll have a look at that.

<snip>
> > +static inline int efx_process_channel(struct efx_channel *channel, int rx_quota)
<snip>
> This has two callsites and is far far too large to be inlined.

Right.  We'll do an internal review of all the inlines.

> > +/* Mark channel as finished processing
> > + *
> > + * Note that since we will not receive further interrupts for this
> > + * channel before we finish processing and call the eventq_read_ack()
> > + * method, there is no need to use the interrupt hold-off timers.
> > + */
> > +static inline void efx_channel_processed(struct efx_channel *channel)
> > +{
> > +	/* Write to EVQ_RPTR_REG.  If a new event arrived in a race
> > +	 * with finishing processing, a new interrupt will be raised.
> > +	 */
> > +	channel->work_pending = 0;
> > +	smp_wmb(); /* Ensure channel updated before any new interrupt. */
> 
> Is smp_wmb() the right thing here?  I think you're trying to ensure that
> the CPU has issued the store before we write the interrupt ack to the
> hardware, yes?
 
The comment about EVQ_RPTR_REG is really in the wrong place.  That's done
by falcon_eventq_read_ack(), not by the assignment in this function.
The memory barrier is there to synchronise with the interrupt handler,
not the hardware, so we believe it is correct.

<snip>
> > +	/* Calculate page-order */
> > +	for (order = 0; ((1u << order) * PAGE_SIZE) < len; ++order)
> > +		;
> 
> get_order()

Wonderful!

> > +	channel->work_pending = 0;
> > +	channel->enabled = 1;
> > +	smp_wmb(); /* ensure channel updated before first interrupt */
> 
> etc.
 
Again, this is synchronising with the interrupt handler not the hardware.
All hardware access is (or should be) in the falcon* source files.

<snip>
> > +	/* Ensure that any worker threads have exited or will be no-ops */
> > +	efx_for_each_channel_rx_queue(rx_queue, channel) {
> > +		spin_lock_bh(&rx_queue->add_lock);
> > +		spin_unlock_bh(&rx_queue->add_lock);
> 
> whee.  Does this work correctly on CONFIG_SMP=n builds?
 
This is supposed to synchronise with __efx_fast_push_rx_descriptors()
which will never be called after channel->enabled is false.  It might
not be correct in a non-SMP build with preemption enabled.  I must
admit we rarely test non-SMP builds as it's extremely unlikely that
anyone would use a 10G NIC in a single-processor system.

> > +	mutex_lock(&efx->mac_lock);
> > +	efx->port_enabled = 0;
> > +	mutex_unlock(&efx->mac_lock);
> 
> A lock-around-a-single-atomic-write always get me going.  It's _sometimes_
> legitimate: if the other code paths which access the storage are reading it
> multiple times within the same locked region, and expect all acesses to
> return the same vaule.

Actually a compiler may generate repeated loads even if there's only one
read in the code.  I don't know whether gcc ever does this.

> But often that isn't the case, and there's a buglet, or unneeded locking.
 
The intent is that no MAC operations will occur after this.  Obviously
just changing the enabled flag doesn't achieve that; the locking is
necessary to synchronise with MAC operations that are already running.

<snip>
> > +	int rc, i;
> > +
> > +	if (efx->interrupt_mode == EFX_INT_MODE_MSIX) {
> > +		BUG_ON(!pci_find_capability(efx->pci_dev, PCI_CAP_ID_MSIX));
> 
> That's a bit rude.  BUG_ON is (mostly) for handling software errors.  Here
> we nuke the box if the hardware is discovered to be bad.  It would be
> better to blurt a message and then fail the initialisation.
 
This is meant to check that the initialisation of efx->interrupt_mode
was sane, not that the hardware is OK.

<snip>
> > +	/* Ensure that all RX slow refills are complete. */
> > +	efx_for_each_rx_queue(rx_queue, efx) {
> > +		cancel_delayed_work_sync(&rx_queue->work);
> > +	}
> 
> Normally we'd omit the braces here.  Once upon a time checkpatch.pl would
> warn about this but it got broken.

Sorry, we're taking a while to adjust to strict kernel style.

> Whee, checkpatch takes over half a minute to check this driver.
> 
> Oh dear, it found
> 
> #5617: FILE: drivers/net/sfc/falcon.c:1877:
> +               if (*(volatile u32 *)dma_done == FALCON_STATS_DONE)
> 
> which was naughty of you.  Perhaps this was already discussed in review
> with the people who actually know what they're talking about.

There wasn't any specific discussion of this.  Is it wrong?  We want to
prevent the compiler from caching *dma_done, which is itself written by DMA.

<snip>
> I don't see anywhere in this driver where the delayed works get flushed. 
> You have a flush_workqueue() but that will only remove delayed work items
> whose timer has already expired.
> 
> I might have missed it - please check for missed cancel_delayed_work()s.
 
We use cancel_delayed_work_sync() in efx_flush_all().

<snip>
> > +static struct net_device_stats *efx_net_stats(struct net_device *net_dev)
> > +{
> > +	struct efx_nic *efx = net_dev->priv;
> > +	struct efx_mac_stats *mac_stats = &efx->mac_stats;
> > +	struct net_device_stats *stats = &net_dev->stats;
> > +
> > +	if (!spin_trylock(&efx->stats_lock))
> > +		return stats;
> 
> I'm wondering if this is correct on uniprocessor, but to determine that I'd
> need to know why it's here and I cannot determine that from just the
> implementation.
>
> IOW: needs a comment.

The function comment is not quite correct - this may be called under either
dev_base_lock (if called via /proc/net/dev) or RTNL (if called via ethtool), so
we need our own lock to serialise stats requests.  At the same time, we don't
actually want to spin waiting for new stats if an update is already in progress
- stale stats are generally acceptable as long as they will get updated soon.
I'm not sure whether this is correct on uniprocessor, so we'll review this
internally.
 
<snip>
> > +	/* The net_dev->get_stats handler is quite slow, and will fail
> > +	 * if a fetch is pending over reset. Serialise against it. */
> > +	spin_lock(&efx->stats_lock);
> > +	spin_unlock(&efx->stats_lock);
> 
> but but but...  efx_net_stats() can start to run immediately after the
> spin_unlock()?

But it will then find efx->state == STATE_RESETTING, and will return the old
stats rather than trying to update them.

> > +static void efx_pci_remove(struct pci_dev *pci_dev)
> > +{
> > +	struct efx_nic *efx;
> > +
> > +	efx = pci_get_drvdata(pci_dev);
> > +	if (!efx)
> > +		return;
> > +
> > +	/* Mark the NIC as fini, then stop the interface */
> > +	rtnl_lock();
> > +	efx->state = STATE_FINI;
> > +	dev_close(efx->net_dev);
> > +
> > +	/* Allow any queued efx_resets() to complete */
> > +	rtnl_unlock();
> 
> Is dev_close() supposed to be called under rtnl_lock()?  It _looks_ that
> way, but it doesn't say so and it has no ASSERT_RTNL().
 
Yes, see dev_change_flags().

<snip>
> > +static int efx_ethtool_phys_id(struct net_device *net_dev, u32 seconds)
> > +{
> > +	struct efx_nic *efx = net_dev->priv;
> > +
> > +	efx->board_info.blink(efx, 1);
> > +	schedule_timeout_interruptible(seconds * HZ);
> > +	efx->board_info.blink(efx, 0);
> > +	return 0;
> > +}
> 
> You probably have a bug here.
> 
> If the calling process has signal_pending() then
> schedule_timeout_interruptible() will return immediately.  If this function
> is only ever called from a kernel thread then you'll be OK.  Otherwise,
> you'll need to use _uninterruptible.

This is intentional.  The user runs ethtool -p <device> to find the port
that <device> corresponds to, then can hit control-C when they've spotted
it.

> > ...
> >
> > +#if BITS_PER_LONG == 64
> > +#define FALCON_DMA_MASK EFX_DMA_MASK(0x00003fffffffffffUL)
> > +#else
> > +#define FALCON_DMA_MASK EFX_DMA_MASK(0x00003fffffffffffULL)
> > +#endif
> 
> You just reinvented DMA_47BIT_MASK.
 
Or even 46?  I think this predates that; obviously it should be changed.

<snip>
> > +	/* Allocate storage for hardware specific data */
> > +	nic_data = kzalloc(sizeof(*nic_data), GFP_KERNEL);
> > +	efx->nic_data = (void *) nic_data;
> 
> Maybe efx_nic.nic_data should have type `struct falcon_nic_data*'?  It
> never holds anything else, does it?

Not at present.

<snip>
> > +	/* Allocate memory for INT_KER */
> > +	rc = falcon_alloc_buffer(efx, &efx->irq_status, sizeof(efx_oword_t));
> > +	if (rc)
> > +		goto fail4;
> > +	BUG_ON(efx->irq_status.dma_addr & 0x0f);
> 
> Are we going BUG on an error in the hardware, or on a software error?
 
Software error.

<snip>
> > +#define _falcon_writeq(efx, value, reg) \
> > +	__raw_writeq((__force u64) (value), (efx)->membase + (reg))
> > +#define _falcon_writel(efx, value, reg) \
> > +	__raw_writel((__force u32) (value), (efx)->membase + (reg))
> > +#define _falcon_readq(efx, reg) \
> > +	((__force __le64) __raw_readq((efx)->membase + (reg)))
> > +#define _falcon_readl(efx, reg) \
> > +	((__force __le32) __raw_readl((efx)->membase + (reg)))
> 
> I don't think these needed to be implemented as macros.
> 
> I suspect that if they were implemented in C, those mysterious casts
> wouldn't be needed?

The casts are there to keep sparse happy - we're writing/reading little-
endian values, but the raw MMIO functions don't specify endian-ness.

> > +/* Writes to a normal 16-byte Falcon register, locking as appropriate. */
> > +static inline void falcon_write(struct efx_nic *efx, efx_oword_t *value,
> > +				unsigned int reg)
> > +{
> > +	unsigned long flags;
> > +
> > +	EFX_REGDUMP(efx, "writing register %x with " EFX_OWORD_FMT "\n", reg,
> > +		    EFX_OWORD_VAL(*value));
> > +
> > +	spin_lock_irqsave(&efx->biu_lock, flags);
> > +#ifdef FALCON_USE_QWORD_IO
> > +	_falcon_writeq(efx, value->u64[0], reg + 0);
> > +	wmb();
> > +	_falcon_writeq(efx, value->u64[1], reg + 8);
> > +#else
> > +	_falcon_writel(efx, value->u32[0], reg + 0);
> > +	_falcon_writel(efx, value->u32[1], reg + 4);
> > +	_falcon_writel(efx, value->u32[2], reg + 8);
> > +	wmb();
> > +	_falcon_writel(efx, value->u32[3], reg + 12);
> > +#endif
> > +	mmiowb();
> > +	spin_unlock_irqrestore(&efx->biu_lock, flags);
> > +}
> 
> hoo boy.  Did you _really_ want to inline this?  It has many many
> callsites.
 
Maybe not, as I don't think this is speed-critical (some registers are
special-cased as allowing 32-bit writes).  We'll check the impact on code
size and speed.

> >
> > ...
> >
> > --- /dev/null
> > +++ b/drivers/net/sfc/i2c-direct.h
> 
> There is no linkage with the kernel's own i2c layer?  Should there be?

I started trying to do that, but it looked like it would result in a net
addition of code.

<snip>
> > +#define EFX_DRIVER_VERSION	"2.2.0136"
> 
> I would suggest that you remove this.  It's just not a useful way of
> establishing what version of the driver your users are running.  We use the
> kernel version information for this.
 
This is common practice for net drivers.  Maybe we are in our own little
world on netdev?

> For sure when other people patch your code they will forget to update this.
> 
> > +#ifdef EFX_ENABLE_DEBUG
> 
> I can't find anywhere which defines this.
> 
> It should be a Kconfig option?
 
Maybe we should just remove the code that's conditional on this.

> >
> > ...
> >
> > +/* Kernel headers may redefine inline anyway */
> > +#ifndef inline
> > +#define inline inline __attribute__ ((always_inline))
> > +#endif
> 
> whoa.  What are we doing here and why?
 
I think this is a remnant of kernel backward-compatibility code, which we
have generally excluded from our submission using unifdef.

<snip>
> > +/*
> > + * Alignment of page-allocated RX buffers
> > + *
> > + * Controls the number of bytes inserted at the start of an RX buffer.
> > + * This is the equivalent of NET_IP_ALIGN [which controls the alignment
> > + * of the skb->head for hardware DMA].
> > + */
> > +#if defined(__i386__) || defined(__x86_64__)
> > +#define EFX_PAGE_IP_ALIGN 0
> > +#else
> > +#define EFX_PAGE_IP_ALIGN NET_IP_ALIGN
> > +#endif
> 
> Now what's going on here?  We're taking advantage of x86's unaligned-access
> capabilities, I assume.
 
Correct.

> Well...  if that is a legitimate thing to do in this driver then it is
> legitimate to do in _other_ drivers.  Hence the logic should not be
> open-coded in a private place such as this.
> 
> Added bonus: there might be other CPU types which are good at unaligned
> access, and they will also set CONFIG_ARCH_IS_GOOD_AT_UNALIGNED_ACCESSES
> and your driver will run better on those archtectures.
>
> No?

I seem to remember a proposal for such a config macro.  We'll be happy to
use that once it's there.
 
> >
> > ...
> >
> > +/* Set bit in a little-endian bitfield */
> > +static inline void set_bit_le(int nr, unsigned char *addr)
> > +{
> > +	addr[nr / 8] |= (1 << (nr % 8));
> > +}
> > +
> > +/* Clear bit in a little-endian bitfield */
> > +static inline void clear_bit_le(int nr, unsigned char *addr)
> > +{
> > +	addr[nr / 8] &= ~(1 << (nr % 8));
> > +}
> 
> a) please stop implementing non-driver-specific infrastructure within a
>    driver!  You've identified shortcomings in Linux - let's fix Linux.
> 
> b) this duplicates stuff from include/asm-generic/bitops/le.h
 
I'm not clear whether you think <asm-generic/bitops/le.h> is sufficient.
It looks like it is.

<snip>
> > +	struct ethhdr *eh;
> > +	struct iphdr *iph;
> > +
> > +	/* We support EtherII and VLAN encapsulated IPv4 */
> > +	eh = (struct ethhdr *)(page_address(frag->page) + frag->page_offset);
> 
> page_address() returns void*...

Yes, and arithmetic on void * is commonly used in the kernel.  Just for fun,
try make KCFLAGS=-Wpointer-arith.

> > +static inline int efx_init_rx_buffer_page(struct efx_rx_queue *rx_queue,
> > +					  struct efx_rx_buffer *rx_buf)
> > +{
> > +	struct efx_nic *efx = rx_queue->efx;
> > +	int bytes, space, offset;
> > +
> > +	bytes = efx->rx_buffer_len - EFX_PAGE_IP_ALIGN;
> > +
> > +	/* If there is space left in the previously allocated page,
> > +	 * then use it. Otherwise allocate a new one */
> > +	rx_buf->page = rx_queue->buf_page;
> > +	if (rx_buf->page == NULL) {
> > +		dma_addr_t dma_addr;
> > +
> > +		rx_buf->page = alloc_pages(__GFP_COLD | __GFP_COMP | GFP_ATOMIC,
> > +					   efx->rx_buffer_order);
> 
> I don't think we should be using the open-coded __GFP_COMP here.  That's
> more an mm-internal thing.

What's the alternative?

> atomic allocations of higher-order pages are unreliable.  Sometimes very
> unreliable.  I guess there's not much we can do about that.
 
I know.  We will switch to allocating single pages some time soon for
Falcon rev B0 and later NICs.  Unfortunately Falcon rev A1 didn't include
scattering for physically-addressed buffers, so we would still need to
allocate compound pages for it when using jumbo frames.

<snip>
> Well, it's a nice-looking driver, albeit somewhat on the small side.
 
:-)

> Apart from drivers/net/sfc/bitfield.h and other such shouldnt-be-private
> things, which are little tragedies.

Thanks for taking the time to review this code.  We'll get to work on
clearing up the issues you've raised.  Also, I acknowledge the several
points I didn't respond to individually.
 
Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-02 16:05   ` Ben Hutchings
@ 2008-05-02 18:09     ` Andrew Morton
  2008-05-06 12:50       ` Ben Hutchings
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2008-05-02 18:09 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, linux-net-drivers

On Fri, 2 May 2008 17:05:35 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:

> > Oh dear, it found
> > 
> > #5617: FILE: drivers/net/sfc/falcon.c:1877:
> > +               if (*(volatile u32 *)dma_done == FALCON_STATS_DONE)
> > 
> > which was naughty of you.  Perhaps this was already discussed in review
> > with the people who actually know what they're talking about.
> 
> There wasn't any specific discussion of this.  Is it wrong?  We want to
> prevent the compiler from caching *dma_done, which is itself written by DMA.

Documentation/volatile-considered-harmful.txt has some dicussion.

> ...
> > > +#define EFX_DRIVER_VERSION	"2.2.0136"
> > 
> > I would suggest that you remove this.  It's just not a useful way of
> > establishing what version of the driver your users are running.  We use the
> > kernel version information for this.
>  
> This is common practice for net drivers.  Maybe we are in our own little
> world on netdev?

That would make it easy for a researcher to work out how often we change a
driver versus how often we increment its private version number.

> > > +static inline int efx_init_rx_buffer_page(struct efx_rx_queue *rx_queue,
> > > +					  struct efx_rx_buffer *rx_buf)
> > > +{
> > > +	struct efx_nic *efx = rx_queue->efx;
> > > +	int bytes, space, offset;
> > > +
> > > +	bytes = efx->rx_buffer_len - EFX_PAGE_IP_ALIGN;
> > > +
> > > +	/* If there is space left in the previously allocated page,
> > > +	 * then use it. Otherwise allocate a new one */
> > > +	rx_buf->page = rx_queue->buf_page;
> > > +	if (rx_buf->page == NULL) {
> > > +		dma_addr_t dma_addr;
> > > +
> > > +		rx_buf->page = alloc_pages(__GFP_COLD | __GFP_COMP | GFP_ATOMIC,
> > > +					   efx->rx_buffer_order);
> > 
> > I don't think we should be using the open-coded __GFP_COMP here.  That's
> > more an mm-internal thing.
> 
> What's the alternative?

Just remove the __GFP_COMP, I expect.  __GFP_COMP will ask the page
allocator to add extra book-keeping info to the pageframe (via
prep_compound_page()).  I doubt if the driver uses that information.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-01 19:08 ` New driver "sfc" for Solarstorm SFC4000 controller Andrew Morton
  2008-05-02 16:05   ` Ben Hutchings
@ 2008-05-02 19:10   ` Michael Brown
  2008-05-03 20:01     ` Ben Hutchings
  1 sibling, 1 reply; 11+ messages in thread
From: Michael Brown @ 2008-05-02 19:10 UTC (permalink / raw)
  To: netdev

On Thu, 1 May 2008, Andrew Morton wrote:
> >
> > ...
> >
> > --- /dev/null
> > +++ b/drivers/net/sfc/i2c-direct.h
> 
> There is no linkage with the kernel's own i2c layer?  Should there be?

Last time I checked (i.e. when I originally wrote this bit of the code), 
the kernel's own i2c layer didn't provide any clean way for kernel code 
(rather than user code) to access i2c devices.

As originally written, there was also a link to the kernel's i2c layer so 
that the NIC's onboard i2c bus could be exposed to e.g. lm_sensors for 
temperature monitoring.  I believe that this part of the driver was 
expunged since it made the patch "too large", but I may be wrong.

Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-02 19:10   ` Michael Brown
@ 2008-05-03 20:01     ` Ben Hutchings
  0 siblings, 0 replies; 11+ messages in thread
From: Ben Hutchings @ 2008-05-03 20:01 UTC (permalink / raw)
  To: Michael Brown; +Cc: netdev

Michael Brown wrote:
> On Thu, 1 May 2008, Andrew Morton wrote:
> > >
> > > ...
> > >
> > > --- /dev/null
> > > +++ b/drivers/net/sfc/i2c-direct.h
> > 
> > There is no linkage with the kernel's own i2c layer?  Should there be?
> 
> Last time I checked (i.e. when I originally wrote this bit of the code), 
> the kernel's own i2c layer didn't provide any clean way for kernel code 
> (rather than user code) to access i2c devices.

You may be thinking of the lm87 sensor driver, which exposes its
configuration through sysfs (iirc) and not through specific kernel
functions.  There was an I2C module for EF1 boards that worked with lm87
and the I2C framework, but it was removed along with all EF1 support.
Perhaps I should look at adapting that to the Falcon boards.  We would
still want to do at least the initial programming of the sensors from the
sfc driver though.

> As originally written, there was also a link to the kernel's i2c layer so 
> that the NIC's onboard i2c bus could be exposed to e.g. lm_sensors for 
> temperature monitoring.  I believe that this part of the driver was 
> expunged since it made the patch "too large", but I may be wrong.

The temperature and voltage monitoring was not included.  In the submitted
driver, the I2C code is needed for power control and setting the over-
temperature cut-out value on SFE4001 boards.  These use a MAX6647, not an
LM87.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-02 18:09     ` Andrew Morton
@ 2008-05-06 12:50       ` Ben Hutchings
  2008-05-06 14:24         ` Andrew Morton
  2008-05-07 17:55         ` Ben Hutchings
  0 siblings, 2 replies; 11+ messages in thread
From: Ben Hutchings @ 2008-05-06 12:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, linux-net-drivers

Andrew Morton wrote:
> On Fri, 2 May 2008 17:05:35 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> > > Oh dear, it found
> > > 
> > > #5617: FILE: drivers/net/sfc/falcon.c:1877:
> > > +               if (*(volatile u32 *)dma_done == FALCON_STATS_DONE)
> > > 
> > > which was naughty of you.  Perhaps this was already discussed in review
> > > with the people who actually know what they're talking about.
> > 
> > There wasn't any specific discussion of this.  Is it wrong?  We want to
> > prevent the compiler from caching *dma_done, which is itself written by DMA.
> 
> Documentation/volatile-considered-harmful.txt has some dicussion.

Looks like we should be using something like:

	while (*dma_done != FALCON_STATS_DONE)
		cpu_relax();

But then how do we time-out?

> > > > +static inline int efx_init_rx_buffer_page(struct efx_rx_queue *rx_queue,
> > > > +					  struct efx_rx_buffer *rx_buf)
> > > > +{
> > > > +	struct efx_nic *efx = rx_queue->efx;
> > > > +	int bytes, space, offset;
> > > > +
> > > > +	bytes = efx->rx_buffer_len - EFX_PAGE_IP_ALIGN;
> > > > +
> > > > +	/* If there is space left in the previously allocated page,
> > > > +	 * then use it. Otherwise allocate a new one */
> > > > +	rx_buf->page = rx_queue->buf_page;
> > > > +	if (rx_buf->page == NULL) {
> > > > +		dma_addr_t dma_addr;
> > > > +
> > > > +		rx_buf->page = alloc_pages(__GFP_COLD | __GFP_COMP | GFP_ATOMIC,
> > > > +					   efx->rx_buffer_order);
> > > 
> > > I don't think we should be using the open-coded __GFP_COMP here.  That's
> > > more an mm-internal thing.
> > 
> > What's the alternative?
> 
> Just remove the __GFP_COMP, I expect.  __GFP_COMP will ask the page
> allocator to add extra book-keeping info to the pageframe (via
> prep_compound_page()).  I doubt if the driver uses that information.

It looks like this flag was used by mistake, based on past experience with
a driver that needed buffers to be mapped into user space.  The original
author of the RX page-allocation code is away at the moment, so I will
have to wait to ask why he used it.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-06 12:50       ` Ben Hutchings
@ 2008-05-06 14:24         ` Andrew Morton
  2008-05-06 15:04           ` Ben Hutchings
  2008-05-07 17:55         ` Ben Hutchings
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2008-05-06 14:24 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, linux-net-drivers

On Tue, 6 May 2008 13:50:09 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:

> Andrew Morton wrote:
> > On Fri, 2 May 2008 17:05:35 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:
> > 
> > > > Oh dear, it found
> > > > 
> > > > #5617: FILE: drivers/net/sfc/falcon.c:1877:
> > > > +               if (*(volatile u32 *)dma_done == FALCON_STATS_DONE)
> > > > 
> > > > which was naughty of you.  Perhaps this was already discussed in review
> > > > with the people who actually know what they're talking about.
> > > 
> > > There wasn't any specific discussion of this.  Is it wrong?  We want to
> > > prevent the compiler from caching *dma_done, which is itself written by DMA.
> > 
> > Documentation/volatile-considered-harmful.txt has some dicussion.
> 
> Looks like we should be using something like:
> 
> 	while (*dma_done != FALCON_STATS_DONE)
> 		cpu_relax();
> 
> But then how do we time-out?

Dunno, something like

 	while (*dma_done != FALCON_STATS_DONE) {
		if (time_after(jiffies, when_to_timeout))
			goto timeout;
 		cpu_relax();
	}

or

 	while (*dma_done != FALCON_STATS_DONE) {
		udelay(1);
		if (++timeout > TIMEOUT)
			goto timeout;
	}

?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-06 14:24         ` Andrew Morton
@ 2008-05-06 15:04           ` Ben Hutchings
  2008-05-06 15:12             ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Ben Hutchings @ 2008-05-06 15:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, linux-net-drivers

Andrew Morton wrote:
> On Tue, 6 May 2008 13:50:09 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> > Andrew Morton wrote:
> > > On Fri, 2 May 2008 17:05:35 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:
> > > 
> > > > > Oh dear, it found
> > > > > 
> > > > > #5617: FILE: drivers/net/sfc/falcon.c:1877:
> > > > > +               if (*(volatile u32 *)dma_done == FALCON_STATS_DONE)
> > > > > 
> > > > > which was naughty of you.  Perhaps this was already discussed in review
> > > > > with the people who actually know what they're talking about.
> > > > 
> > > > There wasn't any specific discussion of this.  Is it wrong?  We want to
> > > > prevent the compiler from caching *dma_done, which is itself written by DMA.
> > > 
> > > Documentation/volatile-considered-harmful.txt has some dicussion.
> > 
> > Looks like we should be using something like:
> > 
> > 	while (*dma_done != FALCON_STATS_DONE)
> > 		cpu_relax();
> > 
> > But then how do we time-out?
> 
> Dunno, something like
> 
>  	while (*dma_done != FALCON_STATS_DONE) {
> 		if (time_after(jiffies, when_to_timeout))
> 			goto timeout;
>  		cpu_relax();
> 	}
 
This isn't good for short timeouts.

> or
> 
>  	while (*dma_done != FALCON_STATS_DONE) {
> 		udelay(1);
> 		if (++timeout > TIMEOUT)
> 			goto timeout;
> 	}
 
On some architectures udelay() is defined using inline assembly that is not
specified as clobbering memory.  So the compiler need not reload *dma_done
after calling it, and this doesn't work.

> ?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-06 15:04           ` Ben Hutchings
@ 2008-05-06 15:12             ` Andrew Morton
  2008-05-06 15:40               ` Ben Hutchings
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2008-05-06 15:12 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, linux-net-drivers

On Tue, 6 May 2008 16:04:53 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:

> > or
> > 
> >  	while (*dma_done != FALCON_STATS_DONE) {
> > 		udelay(1);
> > 		if (++timeout > TIMEOUT)
> > 			goto timeout;
> > 	}
>  
> On some architectures udelay() is defined using inline assembly that is not
> specified as clobbering memory.  So the compiler need not reload *dma_done
> after calling it, and this doesn't work.

So pop a cpu_relax() in there as the document describes?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-06 15:12             ` Andrew Morton
@ 2008-05-06 15:40               ` Ben Hutchings
  0 siblings, 0 replies; 11+ messages in thread
From: Ben Hutchings @ 2008-05-06 15:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, linux-net-drivers

Andrew Morton wrote:
> On Tue, 6 May 2008 16:04:53 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> > > or
> > > 
> > >  	while (*dma_done != FALCON_STATS_DONE) {
> > > 		udelay(1);
> > > 		if (++timeout > TIMEOUT)
> > > 			goto timeout;
> > > 	}
> >  
> > On some architectures udelay() is defined using inline assembly that is not
> > specified as clobbering memory.  So the compiler need not reload *dma_done
> > after calling it, and this doesn't work.
> 
> So pop a cpu_relax() in there as the document describes?

That changes the timing, though it may not matter that much.

Seems to me that this is actually in the last set of "situations where
volatile makes sense":

  - Pointers to data structures in coherent memory which might be modified
    by I/O devices can, sometimes, legitimately be volatile.  A ring buffer
    used by a network adapter, where that adapter changes pointers to
    indicate which descriptors have been processed, is an example of this
    type of situation.

So we could add declare dma_done as volatile u32 * and not do anything
clever in the loop.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New driver "sfc" for Solarstorm SFC4000 controller.
  2008-05-06 12:50       ` Ben Hutchings
  2008-05-06 14:24         ` Andrew Morton
@ 2008-05-07 17:55         ` Ben Hutchings
  1 sibling, 0 replies; 11+ messages in thread
From: Ben Hutchings @ 2008-05-07 17:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, linux-net-drivers

Ben Hutchings wrote:
> Andrew Morton wrote:
> > On Fri, 2 May 2008 17:05:35 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote:
<snip>
> > > > > +		rx_buf->page = alloc_pages(__GFP_COLD | __GFP_COMP | GFP_ATOMIC,
> > > > > +					   efx->rx_buffer_order);
> > > > 
> > > > I don't think we should be using the open-coded __GFP_COMP here.  That's
> > > > more an mm-internal thing.
> > > 
> > > What's the alternative?
> > 
> > Just remove the __GFP_COMP, I expect.  __GFP_COMP will ask the page
> > allocator to add extra book-keeping info to the pageframe (via
> > prep_compound_page()).  I doubt if the driver uses that information.
> 
> It looks like this flag was used by mistake, based on past experience with
> a driver that needed buffers to be mapped into user space.  The original
> author of the RX page-allocation code is away at the moment, so I will
> have to wait to ask why he used it.

This is not a mistake.   We need multi-page buffers to be treated as
compound pages because once they are attached to an skbuff they will
be freed by skb_release_data() calling put_page(), not __free_pages().

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-05-07 17:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200804301925.m3UJPc72001651@hera.kernel.org>
2008-05-01 19:08 ` New driver "sfc" for Solarstorm SFC4000 controller Andrew Morton
2008-05-02 16:05   ` Ben Hutchings
2008-05-02 18:09     ` Andrew Morton
2008-05-06 12:50       ` Ben Hutchings
2008-05-06 14:24         ` Andrew Morton
2008-05-06 15:04           ` Ben Hutchings
2008-05-06 15:12             ` Andrew Morton
2008-05-06 15:40               ` Ben Hutchings
2008-05-07 17:55         ` Ben Hutchings
2008-05-02 19:10   ` Michael Brown
2008-05-03 20:01     ` Ben Hutchings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).