Netdev List
 help / color / mirror / Atom feed
* [PATCH 3/4] dccp: Refine the wait-for-ccid mechanism
From: Gerrit Renker @ 2010-10-28  5:16 UTC (permalink / raw)
  To: davem; +Cc: dccp, netdev, Gerrit Renker
In-Reply-To: <1288242988-5333-3-git-send-email-gerrit@erg.abdn.ac.uk>

This extends the existing wait-for-ccid routine so that it may be used with
different types of CCID, addressing the following problems:

 1) The queue-drain mechanism only works with rate-based CCIDs. If CCID-2 for
    example has a full TX queue and becomes network-limited just as the
    application wants to close, then waiting for CCID-2 to become unblocked
    could lead to an indefinite  delay (i.e., application "hangs").
 2) Since each TX CCID in turn uses a feedback mechanism, there may be changes
    in its sending policy while the queue is being drained. This can lead to
    further delays during which the application will not be able to terminate.
 3) The minimum wait time for CCID-3/4 can be expected to be the queue length
    times the current inter-packet delay. For example if tx_qlen=100 and a delay
    of 15 ms is used for each packet, then the application would have to wait
    for a minimum of 1.5 seconds before being allowed to exit.
 4) There is no way for the user/application to control this behaviour. It would
    be good to use the timeout argument of dccp_close() as an upper bound. Then
    the maximum time that an application is willing to wait for its CCIDs to can
    be set via the SO_LINGER option.

These problems are addressed by giving the CCID a grace period of up to the
`timeout' value.

The wait-for-ccid function is, as before, used when the application
 (a) has read all the data in its receive buffer and
 (b) if SO_LINGER was set with a non-zero linger time, or
 (c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ
     state (client application closes after receiving CloseReq).

In addition, there is a catch-all case of __skb_queue_purge() after waiting for
the CCID. This is necessary since the write queue may still have data when
 (a) the host has been passively-closed,
 (b) abnormal termination (unread data, zero linger time),
 (c) wait-for-ccid could not finish within the given time limit.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
---
 net/dccp/dccp.h   |    5 +-
 net/dccp/output.c |  115 ++++++++++++++++++++++++++++++-----------------------
 net/dccp/proto.c  |   21 +++++++++-
 net/dccp/timer.c  |    2 +-
 4 files changed, 89 insertions(+), 54 deletions(-)

--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -243,8 +243,9 @@ extern void dccp_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
 extern void dccp_send_sync(struct sock *sk, const u64 seq,
 			   const enum dccp_pkt_type pkt_type);
 
-extern void dccp_write_xmit(struct sock *sk, int block);
-extern void dccp_write_space(struct sock *sk);
+extern void   dccp_write_xmit(struct sock *sk);
+extern void   dccp_write_space(struct sock *sk);
+extern void   dccp_flush_write_queue(struct sock *sk, long *time_budget);
 
 extern void dccp_init_xmit_timers(struct sock *sk);
 static inline void dccp_clear_xmit_timers(struct sock *sk)
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -209,49 +209,29 @@ void dccp_write_space(struct sock *sk)
 }
 
 /**
- * dccp_wait_for_ccid - Wait for ccid to tell us we can send a packet
+ * dccp_wait_for_ccid  -  Await CCID send permission
  * @sk:    socket to wait for
- * @skb:   current skb to pass on for waiting
- * @delay: sleep timeout in milliseconds (> 0)
- * This function is called by default when the socket is closed, and
- * when a non-zero linger time is set on the socket. For consistency
+ * @delay: timeout in jiffies
+ * This is used by CCIDs which need to delay the send time in process context.
  */
-static int dccp_wait_for_ccid(struct sock *sk, struct sk_buff *skb, int delay)
+static int dccp_wait_for_ccid(struct sock *sk, unsigned long delay)
 {
-	struct dccp_sock *dp = dccp_sk(sk);
 	DEFINE_WAIT(wait);
-	unsigned long jiffdelay;
-	int rc;
-
-	do {
-		dccp_pr_debug("delayed send by %d msec\n", delay);
-		jiffdelay = msecs_to_jiffies(delay);
-
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	long remaining;
 
-		sk->sk_write_pending++;
-		release_sock(sk);
-		schedule_timeout(jiffdelay);
-		lock_sock(sk);
-		sk->sk_write_pending--;
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	sk->sk_write_pending++;
+	release_sock(sk);
 
-		if (sk->sk_err)
-			goto do_error;
-		if (signal_pending(current))
-			goto do_interrupted;
+	remaining = schedule_timeout(delay);
 
-		rc = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
-	} while ((delay = rc) > 0);
-out:
+	lock_sock(sk);
+	sk->sk_write_pending--;
 	finish_wait(sk_sleep(sk), &wait);
-	return rc;
-
-do_error:
-	rc = -EPIPE;
-	goto out;
-do_interrupted:
-	rc = -EINTR;
-	goto out;
+
+	if (signal_pending(current) || sk->sk_err)
+		return -1;
+	return remaining;
 }
 
 /**
@@ -305,7 +285,53 @@ static void dccp_xmit_packet(struct sock *sk)
 	ccid_hc_tx_packet_sent(dp->dccps_hc_tx_ccid, sk, len);
 }
 
-void dccp_write_xmit(struct sock *sk, int block)
+/**
+ * dccp_flush_write_queue  -  Drain queue at end of connection
+ * Since dccp_sendmsg queues packets without waiting for them to be sent, it may
+ * happen that the TX queue is not empty at the end of a connection. We give the
+ * HC-sender CCID a grace period of up to @time_budget jiffies. If this function
+ * returns with a non-empty write queue, it will be purged later.
+ */
+void dccp_flush_write_queue(struct sock *sk, long *time_budget)
+{
+	struct dccp_sock *dp = dccp_sk(sk);
+	struct sk_buff *skb;
+	long delay, rc;
+
+	while (*time_budget > 0 && (skb = skb_peek(&sk->sk_write_queue))) {
+		rc = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
+
+		switch (ccid_packet_dequeue_eval(rc)) {
+		case CCID_PACKET_WILL_DEQUEUE_LATER:
+			/*
+			 * If the CCID determines when to send, the next sending
+			 * time is unknown or the CCID may not even send again
+			 * (e.g. remote host crashes or lost Ack packets).
+			 */
+			DCCP_WARN("CCID did not manage to send all packets\n");
+			return;
+		case CCID_PACKET_DELAY:
+			delay = msecs_to_jiffies(rc);
+			if (delay > *time_budget)
+				return;
+			rc = dccp_wait_for_ccid(sk, delay);
+			if (rc < 0)
+				return;
+			*time_budget -= (delay - rc);
+			/* check again if we can send now */
+			break;
+		case CCID_PACKET_SEND_AT_ONCE:
+			dccp_xmit_packet(sk);
+			break;
+		case CCID_PACKET_ERR:
+			skb_dequeue(&sk->sk_write_queue);
+			kfree_skb(skb);
+			dccp_pr_debug("packet discarded due to err=%ld\n", rc);
+		}
+	}
+}
+
+void dccp_write_xmit(struct sock *sk)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 	struct sk_buff *skb;
@@ -317,19 +343,9 @@ void dccp_write_xmit(struct sock *sk, int block)
 		case CCID_PACKET_WILL_DEQUEUE_LATER:
 			return;
 		case CCID_PACKET_DELAY:
-			if (!block) {
-				sk_reset_timer(sk, &dp->dccps_xmit_timer,
-						msecs_to_jiffies(rc)+jiffies);
-				return;
-			}
-			rc = dccp_wait_for_ccid(sk, skb, rc);
-			if (rc && rc != -EINTR) {
-				DCCP_BUG("err=%d after dccp_wait_for_ccid", rc);
-				skb_dequeue(&sk->sk_write_queue);
-				kfree_skb(skb);
-				break;
-			}
-			/* fall through */
+			sk_reset_timer(sk, &dp->dccps_xmit_timer,
+				       jiffies + msecs_to_jiffies(rc));
+			return;
 		case CCID_PACKET_SEND_AT_ONCE:
 			dccp_xmit_packet(sk);
 			break;
@@ -648,7 +664,6 @@ void dccp_send_close(struct sock *sk, const int active)
 		DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_CLOSE;
 
 	if (active) {
-		dccp_write_xmit(sk, 1);
 		dccp_skb_entail(sk, skb);
 		dccp_transmit_skb(sk, skb_clone(skb, prio));
 		/*
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -726,7 +726,13 @@ int dccp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		goto out_discard;
 
 	skb_queue_tail(&sk->sk_write_queue, skb);
-	dccp_write_xmit(sk,0);
+	/*
+	 * The xmit_timer is set if the TX CCID is rate-based and will expire
+	 * when congestion control permits to release further packets into the
+	 * network. Window-based CCIDs do not use this timer.
+	 */
+	if (!timer_pending(&dp->dccps_xmit_timer))
+		dccp_write_xmit(sk);
 out_release:
 	release_sock(sk);
 	return rc ? : len;
@@ -951,9 +957,22 @@ void dccp_close(struct sock *sk, long timeout)
 		/* Check zero linger _after_ checking for unread data. */
 		sk->sk_prot->disconnect(sk, 0);
 	} else if (sk->sk_state != DCCP_CLOSED) {
+		/*
+		 * Normal connection termination. May need to wait if there are
+		 * still packets in the TX queue that are delayed by the CCID.
+		 */
+		dccp_flush_write_queue(sk, &timeout);
 		dccp_terminate_connection(sk);
 	}
 
+	/*
+	 * Flush write queue. This may be necessary in several cases:
+	 * - we have been closed by the peer but still have application data;
+	 * - abortive termination (unread data or zero linger time),
+	 * - normal termination but queue could not be flushed within time limit
+	 */
+	__skb_queue_purge(&sk->sk_write_queue);
+
 	sk_stream_wait_close(sk, timeout);
 
 adjudge_to_death:
--- a/net/dccp/timer.c
+++ b/net/dccp/timer.c
@@ -249,7 +249,7 @@ static void dccp_write_xmitlet(unsigned long data)
 	if (sock_owned_by_user(sk))
 		sk_reset_timer(sk, &dccp_sk(sk)->dccps_xmit_timer, jiffies + 1);
 	else
-		dccp_write_xmit(sk, 0);
+		dccp_write_xmit(sk);
 	bh_unlock_sock(sk);
 }
 

^ permalink raw reply

* Re: [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Eric Dumazet @ 2010-10-28  5:18 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: sonnyrao, Divy Le Ray, Dimitris Michailidis, netdev, linux-kernel
In-Reply-To: <1288242390-28574-1-git-send-email-nacc@us.ibm.com>

Le mercredi 27 octobre 2010 à 22:06 -0700, Nishanth Aravamudan a écrit :
> Hi Eric,
> 
> Something like the following?:
> 
> Thanks,
> Nish

Yes, probably, but I dont have the hardware so cannot test.



^ permalink raw reply

* Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation
From: Michael S. Tsirkin @ 2010-10-28  5:20 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, netdev, kvm, linux-kernel
In-Reply-To: <1288240804.14342.1.camel@localhost.localdomain>

On Wed, Oct 27, 2010 at 09:40:04PM -0700, Shirley Ma wrote:
> Resubmit this patch for fixing some minor error (white space, typo).
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>

My concern is this can delay signalling for unlimited time.
Could you pls test this with guests that do not have
2b5bbe3b8bee8b38bdc27dd9c0270829b6eb7eeb
b0c39dbdc204006ef3558a66716ff09797619778
that is 2.6.31 and older?

This seems to be slighltly out of spec, even though
for TX, signals are less important.

Two ideas:
1. How about writing out used, just delaying the signal?
   This way we don't have to queue separately.
2. How about flushing out queued stuff before we exit
   the handle_tx loop? That would address most of
   the spec issue.

> ---
>  drivers/vhost/net.c   |   20 +++++++++++++++++++-
>  drivers/vhost/vhost.c |   32 ++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |    3 +++
>  3 files changed, 54 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 4b4da5b..3eb8016 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -128,6 +128,7 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	int max_pend = vq->num - (vq->num >> 2);
>  
>  	sock = rcu_dereference_check(vq->private_data,
>  				     lockdep_is_held(&vq->mutex));
> @@ -198,7 +199,24 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		/*
> +		 * if no pending buffer size allocate, signal used buffer
> +		 * one by one, otherwise, signal used buffer when reaching
> +		 * 3/4 ring size to reduce CPU utilization.
> +		 */
> +		if (unlikely(vq->pend))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		else {
> +			vq->pend[vq->num_pend].id = head;
> +			vq->pend[vq->num_pend].len = 0;
> +			++vq->num_pend;
> +			if (vq->num_pend == max_pend) {
> +				vhost_add_used_and_signal_n(&net->dev, vq,
> +							    vq->pend,
> +							    vq->num_pend);
> +				vq->num_pend = 0;
> +			}
> +		}
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 94701ff..f2f3288 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	/* signal pending used buffers */
> +	if (vq->pend) {
> +		if (vq->num_pend != 0) {
> +			vhost_add_used_and_signal_n(dev, vq, vq->pend,
> +						    vq->num_pend);
> +			vq->num_pend = 0;
> +		}
> +		kfree(vq->pend);
> +	}
> +	vq->pend = NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -273,7 +283,14 @@ long vhost_dev_init(struct vhost_dev *dev,
>  		dev->vqs[i].heads = NULL;
>  		dev->vqs[i].dev = dev;
>  		mutex_init(&dev->vqs[i].mutex);
> +		dev->vqs[i].num_pend = 0;
> +		dev->vqs[i].pend = NULL;
>  		vhost_vq_reset(dev, dev->vqs + i);
> +		/* signal 3/4 of ring size used buffers */
> +		dev->vqs[i].pend = kmalloc((dev->vqs[i].num -
> +					   (dev->vqs[i].num >> 2)) *
> +					   sizeof *dev->vqs[i].pend,
> +					   GFP_KERNEL);
>  		if (dev->vqs[i].handle_kick)
>  			vhost_poll_init(&dev->vqs[i].poll,
>  					dev->vqs[i].handle_kick, POLLIN, dev);
> @@ -599,6 +616,21 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  			r = -EINVAL;
>  			break;
>  		}
> +		if (vq->num != s.num) {
> +			/* signal used buffers first */
> +			if (vq->pend) {
> +				if (vq->num_pend != 0) {
> +					vhost_add_used_and_signal_n(vq->dev, vq,
> +								    vq->pend,
> +								    vq->num_pend);
> +					vq->num_pend = 0;
> +				}
> +				kfree(vq->pend);
> +			}
> +			/* realloc pending used buffers size */
> +			vq->pend = kmalloc((s.num - (s.num >> 2)) *
> +					   sizeof *vq->pend, GFP_KERNEL);
> +		}
>  		vq->num = s.num;
>  		break;
>  	case VHOST_SET_VRING_BASE:
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 073d06a..78949c0 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -108,6 +108,9 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	/* delay multiple used buffers to signal once */
> +	int num_pend;
> +	struct vring_used_elem *pend;
>  };
>  
>  struct vhost_dev {
> 

^ permalink raw reply

* Re: [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Nishanth Aravamudan @ 2010-10-28  5:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: sonnyrao, Divy Le Ray, Dimitris Michailidis, netdev, linux-kernel
In-Reply-To: <1288243110.2658.126.camel@edumazet-laptop>

On 28.10.2010 [07:18:30 +0200], Eric Dumazet wrote:
> Le mercredi 27 octobre 2010 à 22:06 -0700, Nishanth Aravamudan a écrit :
> > Hi Eric,
> > 
> > Something like the following?:
> > 
> > Thanks,
> > Nish
> 
> Yes, probably, but I dont have the hardware so cannot test.

Sorry, should have been clearer. I do have the hardware, tested with the
patch, it does work. Wanted to make sure that it made sense to the
maintainers, of course, but:

Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>

Can be added, if needed.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply

* Re: [PATCH] via-velocity: Codestyle fixes
From: Alexander Wigen @ 2010-10-28  5:02 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev, linux-kernel
In-Reply-To: <1288239318.1836.7.camel@Joe-Laptop>

Hi Joe,

Thanks for that very quick review, this is my first kernel patch.

On Thu, 28 Oct 2010, you wrote:
> On Thu, 2010-10-28 at 13:48 +1000, alex@wigen.net wrote:
> > From: Alexander Wigen <alex@wigen.net>
> > @@ -1542,7 +1542,7 @@ static int velocity_alloc_rx_buf(struct
> > velocity_info *vptr, int idx)
> > 
> >  	 *	Fill in the descriptor to match
> >  	 */
> > 
> > -	*((u32 *) & (rd->rdesc0)) = 0;
> > +	*((u32 *) &(rd->rdesc0)) = 0;
> 
> I think this style change isn't really that good.
> rdesc0 is:
> 
> struct rdesc0 {
> 	__le16 RSR;		/* Receive status */
> 	__le16 len;		/* bits 0--13; bit 15 - owner */
> };
> 
> I think it more sensible to either set the fields
> in the struct to 0 or to memset the struct to 0.

Not sure what is best so I reverted the change.

> > @@ -1681,7 +1681,7 @@ static int velocity_init_td_ring(struct
> > velocity_info *vptr)
> > 
> >  static void velocity_free_dma_rings(struct velocity_info *vptr)
> >  {
> >  
> >  	const int size = vptr->options.numrx * sizeof(struct rx_desc) +
> > 
> > -		vptr->options.numtx * sizeof(struct tx_desc) * vptr->tx.numq;
> > +		vptr->options.numtx * sizeof(struct tx_desc) *vptr->tx.numq;
> 
> If checkpatch warns about this, it's wrong.

It does produce a error acctually, but I have reverted the change:

drivers/net/via-velocity.c:1684: ERROR: space prohibited after that '*' 
(ctx:WxW)


> 
> > @@ -2477,24 +2477,24 @@ static struct net_device_stats
> > *velocity_get_stats(struct net_device *dev)
> > 
> >  	dev->stats.rx_errors = vptr->mib_counter[HW_MIB_ifRxErrorPkts];
> >  	dev->stats.rx_length_errors =
> >  	vptr->mib_counter[HW_MIB_ifInRangeLengthErrors];
> > 
> > -//  unsigned long   rx_dropped;     /* no space in linux buffers    */
> > +/*  unsigned long   rx_dropped;     /* no space in linux buffers    */
> 
> why not just remove all these commented-out lines?

Done.

> > @@ -2957,7 +2957,7 @@ static int velocity_set_wol(struct velocity_info
> > *vptr)
> > 
> >  		memcpy(arp->ar_tip, vptr->ip_addr, 4);
> >  		
> >  		crc = wol_calc_crc((sizeof(struct arp_packet) + 7) / 8, buf,
> > 
> > -				(u8 *) & mask_pattern[0][0]);
> > +				(u8 *) &mask_pattern[0][0]);
> 
> perhaps just (u8 *)mask_pattern?
> 
> > @@ -3454,7 +3453,7 @@ static const struct ethtool_ops
> > velocity_ethtool_ops = {
> > 
> >  	.set_wol	=	velocity_ethtool_set_wol,
> >  	.get_msglevel	=	velocity_get_msglevel,
> >  	.set_msglevel	=	velocity_set_msglevel,
> > 
> > -	.set_sg 	=	ethtool_op_set_sg,
> > +	.set_sg	=	ethtool_op_set_sg,
> 
> bad alignment?

Sorry, wrong editor tab width.

Cheers,
Alexander Wigen

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/2] be2net: Adding an option to use INTx instead of MSI-X
From: Grant Grundler @ 2010-10-28  5:37 UTC (permalink / raw)
  To: David Miller; +Cc: michael, bhutchings, somnath.kotur, netdev, linux-pci
In-Reply-To: <20101027.084604.226761472.davem@davemloft.net>

On Wed, Oct 27, 2010 at 08:46:04AM -0700, David Miller wrote:
> From: Michael Ellerman <michael@ellerman.id.au>
> Date: Wed, 27 Oct 2010 10:20:35 +1100
> 
> > On Tue, 2010-10-26 at 14:32 +0100, Ben Hutchings wrote:
> >> Michael Ellerman wrote:
> >> > On Mon, 2010-10-25 at 16:25 -0700, David Miller wrote:
> >> > > From: Ben Hutchings <bhutchings@solarflare.com>
> >> > > Date: Mon, 25 Oct 2010 23:38:53 +0100
> > 
> >> > Ethtool would be nice, but only for network drivers. Is there a generic
> >> > solution, quirks are obviously not keeping people happy.
> >>  
> >> Since this is (normally) a property of the system, pci=nomsi is the
> >> generic solution.
> > 
> > Sort of, it's a big hammer. Did all these driver writers not know about
> > pci=nomsi or did they prefer to add a parameter to their driver for some
> > reason?
> 
> Every time I've actually done the work to try and track down the
> true issue, it always turned out to be a PCI chipset problem rather
> than a device specific issue.

I agree with your generalization. I can think of only one exception:
ISTR pre-5705 tg3 chips would send both MSI and assert IRQ line at the same time.

My guess is driver writers just want knob to "work around" any issues
*their* driver might see with chipset. Disabling MSI for all drivers
doesn't leave opportunity for experimenting with individual drivers.

hth,
grant

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-28  5:50 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@in.ibm.com>

On Thu, Oct 28, 2010 at 10:44:14AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 04:39:13 PM:
> 
> (merging two posts into one)
> 
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> For external host, there is a contention among different
> queues (vhosts) when packets are processed in tun/bridge,
> unless I implement MQ TX for macvtap (tun/bridge?).  So
> my testing shows a small improvement (1 to 1.5% average)
> in BW and a rise in SD (between 10-15%).  For remote host,
> I think tun/macvtap needs MQ TX support?

Confused. I thought this *is* with a multiqueue tun/macvtap?
bridge does not do any queueing AFAIK ...
I think we need to fix the contention. With migration what was guest to
host a minute ago might become guest to external now ...

> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
> 
> I will get a test run with and without binding and post the
> results later today.
> 
> Thanks,
> 
> - KK

^ permalink raw reply

* Re: [patch v2] fix stack overflow in pktgen_if_write()
From: Dan Carpenter @ 2010-10-28  6:05 UTC (permalink / raw)
  To: Nelson Elhage
  Cc: Eric Dumazet, David S. Miller, Robert Olsson, Andy Shevchenko,
	netdev
In-Reply-To: <20101027230657.GT16803@ksplice.com>

On Wed, Oct 27, 2010 at 07:06:57PM -0400, Nelson Elhage wrote:
> You want to add a trailing NUL, or else printk will read off the end of the
> buffer.
> 
> Also, by memdup()ing count + 1 bytes, you're technically reading one more byte
> than userspace asked for, which could in principle lead to a spurious EFAULT.
> 

That's a lot of bugs per line.  :(  I'm eating humble pie today...

regards,
dan carpenter



^ permalink raw reply

* [patch v3] fix stack overflow in pktgen_if_write()
From: Dan Carpenter @ 2010-10-28  6:05 UTC (permalink / raw)
  To: Nelson Elhage
  Cc: Eric Dumazet, David S. Miller, Robert Olsson, Andy Shevchenko,
	netdev
In-Reply-To: <20101027230657.GT16803@ksplice.com>

Nelson Elhage says he was able to oops both amd64 and i386 test 
machines with 8k writes to the pktgen file.  Let's just allocate the
buffer on the heap instead of on the stack.

This can only be triggered by root so there are no security issues here.

Reported-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Dan Carpenter <error27@gmail.com>
---
v3:  just use kmalloc()

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 2c0df0f..c8d3620 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,12 +887,17 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		char tb[count + 1];
+		char *tb;
+
+		tb = kmalloc(count + 1, GFP_KERNEL);
+		if (!tb)
+			return -ENOMEM;
 		if (copy_from_user(tb, user_buffer, count))
 			return -EFAULT;
 		tb[count] = 0;
 		printk(KERN_DEBUG "pktgen: %s,%lu  buffer -:%s:-\n", name,
 		       (unsigned long)count, tb);
+		kfree(tb);
 	}
 
 	if (!strcmp(name, "min_pkt_size")) {

^ permalink raw reply related

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-10-28  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty
In-Reply-To: <20101028055054.GG5599@redhat.com>

> "Michael S. Tsirkin" <mst@redhat.com>

> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > For external host, there is a contention among different
> > queues (vhosts) when packets are processed in tun/bridge,
> > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > my testing shows a small improvement (1 to 1.5% average)
> > in BW and a rise in SD (between 10-15%).  For remote host,
> > I think tun/macvtap needs MQ TX support?
>
> Confused. I thought this *is* with a multiqueue tun/macvtap?
> bridge does not do any queueing AFAIK ...
> I think we need to fix the contention. With migration what was guest to
> host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK


^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-28  6:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty
In-Reply-To: <OF03CB03BB.27A87B85-ON652577CA.002069A5-652577CA.0021C2E2@in.ibm.com>

On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com>
> 
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > For external host, there is a contention among different
> > > queues (vhosts) when packets are processed in tun/bridge,
> > > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > > my testing shows a small improvement (1 to 1.5% average)
> > > in BW and a rise in SD (between 10-15%).  For remote host,
> > > I think tun/macvtap needs MQ TX support?
> >
> > Confused. I thought this *is* with a multiqueue tun/macvtap?
> > bridge does not do any queueing AFAIK ...
> > I think we need to fix the contention. With migration what was guest to
> > host a minute ago might become guest to external now ...
> 
> Macvtap RX is MQ but not TX. I don't think MQ TX support is
> required for macvtap, though. Is it enough for existing
> macvtap sendmsg to work, since it calls dev_queue_xmit
> which selects the txq for the outgoing device?
> 
> Thanks,
> 
> - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST

^ permalink raw reply

* Re: [PATCH] ipv6: addrconf: don't remove address state on ifdown if the address is being kept
From: Lorenzo Colitti @ 2010-10-28  6:45 UTC (permalink / raw)
  To: netdev; +Cc: Maciej Żenczykowski, David Miller, Brian Haley
In-Reply-To: <AANLkTimyqFoUvjQM_qLS2YeVmJQ_6-LJc6KfLVE-mzH3@mail.gmail.com>

On Wed, Oct 27, 2010 at 9:16 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Fix it so that none of this state is updated if the address is being kept on the interface.

Or, since the logic of the patched code is a little hard to follow,
here's an alternative patch against 2.6.36. I think it does exactly
the same as the previous patch, but it moves things around to make the
result more readable.

Tested: Added a statically configured IPv6 address to an interface,
started ping, brought link down, brought link up again. When link came
up ping kept on going and "ip -6 maddr" showed that the host was still
subscribed to there

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>

--- 2.6.36-orig/net/ipv6/addrconf.c	2010-10-20 13:30:22.000000000 -0700
+++ linux-2.6.36/net/ipv6/addrconf.c	2010-10-27 23:26:57.000000000 -0700
@@ -2737,10 +2737,6 @@ static int addrconf_ifdown(struct net_de
 			/* Flag it for later restoration when link comes up */
 			ifa->flags |= IFA_F_TENTATIVE;
 			ifa->state = INET6_IFADDR_STATE_DAD;
-
-			write_unlock_bh(&idev->lock);
-
-			in6_ifa_hold(ifa);
 		} else {
 			list_del(&ifa->if_list);

@@ -2755,19 +2751,15 @@ static int addrconf_ifdown(struct net_de
 			ifa->state = INET6_IFADDR_STATE_DEAD;
 			spin_unlock_bh(&ifa->state_lock);

-			if (state == INET6_IFADDR_STATE_DEAD)
-				goto put_ifa;
+			if (state == INET6_IFADDR_STATE_DEAD) {
+				in6_ifa_put(ifa);
+			} else {
+				__ipv6_ifa_notify(RTM_DELADDR, ifa);
+				atomic_notifier_call_chain(&inet6addr_chain,
+							   NETDEV_DOWN, ifa);
+			}
+			write_lock_bh(&idev->lock);
 		}
-
-		__ipv6_ifa_notify(RTM_DELADDR, ifa);
-		if (ifa->state == INET6_IFADDR_STATE_DEAD)
-			atomic_notifier_call_chain(&inet6addr_chain,
-						   NETDEV_DOWN, ifa);
-
-put_ifa:
-		in6_ifa_put(ifa);
-
-		write_lock_bh(&idev->lock);
 	}

 	list_splice(&keep_list, &idev->addr_list);

^ permalink raw reply

* About the Davicom PHY in drivers/net/phy in Linux kernel
From: macpaul @ 2010-10-28  6:33 UTC (permalink / raw)
  To: netdev; +Cc: afleming, jeff, f.rodo, joseph_chang

Hi all,

According to the source code of Davicom PHY in Linux,
We should do "ISOLATE" command to PHY before setting "auto negotiation" or "MII/RMII".

However, I've found that with Faraday's MAC/GMAC controller (ftmac100/ftgmac100), setting ISOLATE for multiple PHY configuration will lead MDC become stop because the flaw inside the MAC controller. Faraday's MAC/GMAC will leverage the TX_CLK as the MDC source. When FTMAC100 send "ISOLATE" to Davicom's PHY, the TX_CLK send out from PHY will be stopped, then MDC will also become stop.

However, this mail wasn't meant to discuss about the design flaw inside the IC.

We've done two test to the following codes.

1st: if we just skip the " BMCR_ISOLATE" setting command, we will get PHY sometimes become unstable, for example, could not do DHCP request successfully.

2nd: if we replace "BMCR_ISOLATE" to "BMCR_RESET", we could get rid of the problem occurred by Faraday GMAC. And the PHY works well still.

I've found that in some other PHY implementation, for example, in "marvell.c", there are seems no ISOLATE commands. There are only RESET commands.
If we could replace BMCR_ISOLATE to BMCR_RESET in current kernel source, will there be any unpredictable behavior happened?

Please give us suggestion according to your experiences.
Thanks a lot.

+static int dm9161_config_aneg(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE);
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best regards,
Macpaul Lin

^ permalink raw reply

* Re: [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Divy Le Ray @ 2010-10-28  6:35 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: eric.dumazet, sonnyrao, Dimitrios Michailidis, netdev,
	linux-kernel
In-Reply-To: <1288242390-28574-1-git-send-email-nacc@us.ibm.com>

On 10/27/2010 10:06 PM, Nishanth Aravamudan wrote:
> Hi Eric,
>
> Something like the following?:
>
> Thanks,
> Nish
>
>
> Along the same lines as "cxgb4: fix crash due to manipulating queues
> before registration" (8f6d9f40476895571df039b6f1f5230ec7faebad), before
> commit "net: allocate tx queues in register_netdevice"
> netif_tx_stop_all_queues and related functions could be used between
> device allocation and registration but now only after registration.
> cxgb4 has such a call before registration and crashes now.  Move it
> after register_netdev.
>
> Signed-off-by: Nishanth Aravamudan<nacc@us.ibm.com>

Acked-by: Divy Le Ray <divy@chelsio.com>

> Cc: eric.dumazet@gmail.com
> Cc: sonnyrao@us.ibm.com
> Cc: Divy Le Ray<divy@chelsio.com>
> Cc: Dimitris Michailidis<dm@chelsio.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   drivers/net/cxgb3/cxgb3_main.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
> index 4e3c123..96c70a5 100644
> --- a/drivers/net/cxgb3/cxgb3_main.c
> +++ b/drivers/net/cxgb3/cxgb3_main.c
> @@ -3301,7 +3301,6 @@ static int __devinit init_one(struct pci_dev *pdev,
>   		pi->rx_offload = T3_RX_CSUM | T3_LRO;
>   		pi->port_id = i;
>   		netif_carrier_off(netdev);
> -		netif_tx_stop_all_queues(netdev);
>   		netdev->irq = pdev->irq;
>   		netdev->mem_start = mmio_start;
>   		netdev->mem_end = mmio_start + mmio_len - 1;
> @@ -3342,6 +3341,7 @@ static int __devinit init_one(struct pci_dev *pdev,
>   				adapter->name = adapter->port[i]->name;
>
>   			__set_bit(i,&adapter->registered_device_map);
> +			 netif_tx_stop_all_queues(adapter->port[i]);
>   		}
>   	}
>   	if (!adapter->registered_device_map) {



^ permalink raw reply

* Re: 2.6.35->2.6.36 regression, vanilla kernel panic, ppp or hrtimers crashing
From: Jarek Poplawski @ 2010-10-28  7:05 UTC (permalink / raw)
  To: Denys Fedoryshchenko
  Cc: Thomas Gleixner, Paul Mackerras, linux-kernel, netdev
In-Reply-To: <201010251222.37191.nuclearcat@nuclearcat.com>

On 2010-10-25 11:22, Denys Fedoryshchenko wrote:
> Hi
> 
> Here is what i got from netconsole
>  [  259.238755] BUG: unable to handle kernel 
>  paging request
>  at f8ba001c
>  [  259.238953] IP:
>  [<c0199ebe>] do_select+0x2cc/0x502
...
> It is not easy to do full git bisect(it is semi-embedded distro), but i can 
> try reversing particular commits, if someone can give idea which one, and can 
> try debug patches.

Hi,
Nothing concrete, but you might try reverting this one:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.36.y.git;a=commitdiff;h=15fd0cd9a2ad24a78fbee369dec8ca660979d57e

Jarek P.

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-10-28  7:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, Michael S. Tsirkin,
	netdev, rusty
In-Reply-To: <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@LocalDomain>

> Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
>
> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
>
> I will get a test run with and without binding and post the
> results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_____________________________________________________
     numtxqs=8,vhosts=5, Bind vs No-bind
#     BW%     CPU%     RCPU%     SD%       RSD%
_____________________________________________________
1     11.25     10.77    1.89     0        -6.06
2     18.66     7.20     7.20    -14.28    -7.40
4     4.24     -1.27     1.56    -2.70     -.98
8     14.91    -3.79     5.46    -12.19    -3.76
16    12.32    -8.67     4.63    -35.97    -26.66
24    11.68    -7.83     5.10    -40.73    -32.37
32    13.09    -10.51    6.57    -51.52    -42.28
40    11.04    -4.12     11.23   -50.69    -42.81
48    8.61     -10.30    6.04    -62.38    -55.54
64    7.55     -6.05     6.41    -61.20    -56.04
80    8.74     -11.45    6.29    -72.65    -67.17
96    9.84     -6.01     9.87    -69.89    -64.78
128   5.57     -6.23     8.99    -75.03    -70.97
_____________________________________________________
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
    1.  All my test results earlier was binding vhost
        to cpus 0-3 for both org and new kernel.
    2.  I am not using MST's use_mq patch, only mainline
        kernel. However, I reported earlier that I got
        better results with that patch. The result for
        MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK


^ permalink raw reply

* Re: [Bugme-new] [Bug 20772] New: ip= nfsaddrs= stopped working
From: Simon Horman @ 2010-10-27 22:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, bugzilla-daemon, bugme-daemon, mpetersen
In-Reply-To: <20101019120938.93622e82.akpm@linux-foundation.org>

On Tue, Oct 19, 2010 at 12:09:38PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 19 Oct 2010 18:20:15 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=20772
> > 
> >            Summary: ip= nfsaddrs= stopped working
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 2.6.33, 2.6.32, 2.6.26
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: IPV4
> >         AssignedTo: shemminger@linux-foundation.org
> >         ReportedBy: mpetersen@peak6.com
> >         Regression: Yes
> > 
> > 
> > I can no longer set a static IP address using either ip= or nfsaddr=,
> > ie ip=192.168.0.42:192.168.0.69:255.255.255.0:testhost:eth0:none has no
> > effect, and DHCP is still attempted.
> > 
> > Even if I undefine IP_PNP_DHCP IP_PNP_BOOTP and IP_PNP_RARP in the
> > config (but leave IP_PNP,) the kernel seems to ignore the static IP
> > settings and try to get a an IP address with DHCP somehow.

Hi,

I believe that the problem is that there is a minor syntax error in the
configuration parameter. Specifically there should be two colons between
192.168.0.69 and 255.255.255.0 or between 192.168.0.42 and 192.168.0.69.

As it stands the settings are:

	my address	192.168.0.42
	server address	192.168.0.69
	gateway		255.255.255.0
	netmask		testhost
	hostname	eth0
	interface	none

I believe it is the interface=none that is killing any IP configuration
by the kernel as no such interface exists.

^ permalink raw reply

* RE: About the Davicom PHY in drivers/net/phy in Linux kernel
From: Joseph Chang @ 2010-10-28  7:59 UTC (permalink / raw)
  To: macpaul, netdev; +Cc: afleming, jeff, f.rodo
In-Reply-To: <50FD90C65C53FB45BADEEBCD84FF07F2029CB176@ATCPCS06.andestech.com>

Dear Mac Paul,

For you are using DAVICOM PHY in this case.

Our suggestion, please check below red comment text:

 * == > Would you tell us your:
	CPU = ?
	Linux Kernel version= ?
      I will like to download the same source code from LXR.
      And can check more detail for you. 

+static int dm9161_config_aneg(struct phy_device *phydev) {
+	int err;
+
+	/* Isolate the PHY */
     err = 0;
+	//err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
    err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE | 0x200 );
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev) {
+	int err;
+
+	/* Isolate the PHY */
     err = 0;
+	//err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE | 0x200 );
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best Regards,
Joseph CHANG 
System Application Engineering Division 
Davicom Semiconductor, Inc. 
No. 6 Li-Hsin Rd. VI, Science-Based Park, 
Hsin-Chu, Taiwan. 
Tel: 886-3-5798797 Ex 8534
Fax: 886-3-5646929 
Web: http://www.davicom.com.tw 


-----Original Message-----
From: macpaul@andestech.com [mailto:macpaul@andestech.com] 
Sent: Thursday, October 28, 2010 2:34 PM
To: netdev@vger.kernel.org
Cc: afleming@freescale.com; jeff@garzik.org; f.rodo@til-technologies.fr; joseph_chang@mail.davicom.com.tw
Subject: About the Davicom PHY in drivers/net/phy in Linux kernel

Hi all,

According to the source code of Davicom PHY in Linux,
We should do "ISOLATE" command to PHY before setting "auto negotiation" or "MII/RMII".

However, I've found that with Faraday's MAC/GMAC controller (ftmac100/ftgmac100), setting ISOLATE for multiple PHY configuration will lead MDC become stop because the flaw inside the MAC controller. Faraday's MAC/GMAC will leverage the TX_CLK as the MDC source. When FTMAC100 send "ISOLATE" to Davicom's PHY, the TX_CLK send out from PHY will be stopped, then MDC will also become stop.

However, this mail wasn't meant to discuss about the design flaw inside the IC.

We've done two test to the following codes.

1st: if we just skip the " BMCR_ISOLATE" setting command, we will get PHY sometimes become unstable, for example, could not do DHCP request successfully.

2nd: if we replace "BMCR_ISOLATE" to "BMCR_RESET", we could get rid of the problem occurred by Faraday GMAC. And the PHY works well still.

I've found that in some other PHY implementation, for example, in "marvell.c", there are seems no ISOLATE commands. There are only RESET commands.
If we could replace BMCR_ISOLATE to BMCR_RESET in current kernel source, will there be any unpredictable behavior happened?

Please give us suggestion according to your experiences.
Thanks a lot.

+static int dm9161_config_aneg(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE);
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best regards,
Macpaul Lin

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



^ permalink raw reply

* Re: IPV6 raw socket denies bind(2)
From: Rémi Denis-Courmont @ 2010-10-28  8:25 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: netdev
In-Reply-To: <alpine.LNX.2.01.1010280000260.7820@obet.zrqbmnf.qr>


On Thu, 28 Oct 2010 00:01:57 +0200 (CEST), Jan Engelhardt
<jengelh@medozas.de> wrote:
> #include <sys/socket.h>
> #include <stdio.h>
> #include <string.h>
> #include <netinet/udp.h>
> #include <netinet/in.h>
> #include <netinet/ip6.h>
> #include <arpa/inet.h>
> #include <stdlib.h>
> 
> int main(void)
> {
> 	struct sockaddr_in6 src = {};
> 	int sk;
> 
> 	sk = socket(AF_INET6, SOCK_RAW, IPPROTO_UDP);
> 	memset(&src, 0, sizeof(src));
> 	inet_pton(AF_INET6, "::1", &src);

That should be:
inet_pton(AF_INET6, "::1", &src.sin6_addr);
No wonder it does not work.


-- 
Rémi Denis-Courmont
http://www.remlab.net
http://fi.linkedin.com/in/remidenis


^ permalink raw reply

* Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
From: Stefan Hajnoczi @ 2010-10-28  8:57 UTC (permalink / raw)
  To: Shirley Ma; +Cc: mst@redhat.com, David Miller, netdev, kvm, linux-kernel
In-Reply-To: <1288213557.17571.28.camel@localhost.localdomain>

On Wed, Oct 27, 2010 at 10:05 PM, Shirley Ma <mashirle@us.ibm.com> wrote:
> This patch changes vhost TX used buffer signal to guest from one by
> one to up to 3/4 of vring size. This change improves vhost TX message
> size from 256 to 8K performance for both bandwidth and CPU utilization
> without inducing any regression.

Any concerns about introducing latency or does the guest not care when
TX completions come in?

> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
>
>  drivers/vhost/net.c   |   19 ++++++++++++++++++-
>  drivers/vhost/vhost.c |   31 +++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |    3 +++
>  3 files changed, 52 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 4b4da5b..bd1ba71 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -198,7 +198,24 @@ static void handle_tx(struct vhost_net *net)
>                if (err != len)
>                        pr_debug("Truncated TX packet: "
>                                 " len %d != %zd\n", err, len);
> -               vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +               /*
> +                * if no pending buffer size allocate, signal used buffer
> +                * one by one, otherwise, signal used buffer when reaching
> +                * 3/4 ring size to reduce CPU utilization.
> +                */
> +               if (unlikely(vq->pend))
> +                       vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +               else {
> +                       vq->pend[vq->num_pend].id = head;

I don't understand the logic here: if !vq->pend then we assign to
vq->pend[vq->num_pend].

> +                       vq->pend[vq->num_pend].len = 0;
> +                       ++vq->num_pend;
> +                       if (vq->num_pend == (vq->num - (vq->num >> 2))) {
> +                               vhost_add_used_and_signal_n(&net->dev, vq,
> +                                                           vq->pend,
> +                                                           vq->num_pend);
> +                               vq->num_pend = 0;
> +                       }
> +               }
>                total_len += len;
>                if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>                        vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 94701ff..47696d2 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>        vq->call_ctx = NULL;
>        vq->call = NULL;
>        vq->log_ctx = NULL;
> +       /* signal pending used buffers */
> +       if (vq->pend) {
> +               if (vq->num_pend != 0) {
> +                       vhost_add_used_and_signal_n(dev, vq, vq->pend,
> +                                                   vq->num_pend);
> +                       vq->num_pend = 0;
> +               }
> +               kfree(vq->pend);
> +       }
> +       vq->pend = NULL;
>  }
>
>  static int vhost_worker(void *data)
> @@ -273,7 +283,13 @@ long vhost_dev_init(struct vhost_dev *dev,
>                dev->vqs[i].heads = NULL;
>                dev->vqs[i].dev = dev;
>                mutex_init(&dev->vqs[i].mutex);
> +               dev->vqs[i].num_pend = 0;
> +               dev->vqs[i].pend = NULL;
>                vhost_vq_reset(dev, dev->vqs + i);
> +               /* signal 3/4 of ring size used buffers */
> +               dev->vqs[i].pend = kmalloc((dev->vqs[i].num -
> +                                          (dev->vqs[i].num >> 2)) *
> +                                          sizeof *vq->peed, GFP_KERNEL);

Has this patch been compile tested?  vq->peed?

Stefan

^ permalink raw reply

* Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
From: Stefan Hajnoczi @ 2010-10-28  8:59 UTC (permalink / raw)
  To: Shirley Ma; +Cc: mst@redhat.com, David Miller, netdev, kvm, linux-kernel
In-Reply-To: <AANLkTinqG_G2pevoYJyiHiBda1ZZS-53Uqz9WEspbMOy@mail.gmail.com>

On Thu, Oct 28, 2010 at 9:57 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
Just read the patch 1/1 discussion and it looks like you're already on
it.  Sorry for the noise.

Stefan

^ permalink raw reply

* [PATCH] 8390: Don't oops on starting dev queue
From: Pavel Emelyanov @ 2010-10-28  9:01 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: David Miller, Linux Netdev List

The __NS8390_init tries to start the device queue before the
device is registered. This results in an oops (snipped):

[    2.865493] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[    2.866106] IP: [<ffffffffa000602a>] netif_start_queue+0xb/0x12 [8390]
[    2.881267] Call Trace:
[    2.881437]  [<ffffffffa000624d>] __NS8390_init+0x102/0x15a [8390]
[    2.881999]  [<ffffffffa00062ae>] NS8390_init+0x9/0xb [8390]
[    2.882237]  [<ffffffffa000d820>] ne2k_pci_init_one+0x297/0x354 [ne2k_pci]
[    2.882955]  [<ffffffff811c7a0e>] local_pci_probe+0x12/0x16
[    2.883308]  [<ffffffff811c85ad>] pci_device_probe+0xc3/0xef
[    2.884049]  [<ffffffff8129218d>] driver_probe_device+0xbe/0x14b
[    2.884937]  [<ffffffff81292260>] __driver_attach+0x46/0x62
[    2.885170]  [<ffffffff81291788>] bus_for_each_dev+0x49/0x78
[    2.885781]  [<ffffffff81291fbb>] driver_attach+0x1c/0x1e
[    2.886089]  [<ffffffff812912ab>] bus_add_driver+0xba/0x227
[    2.886330]  [<ffffffff8129259a>] driver_register+0x9e/0x115
[    2.886933]  [<ffffffff811c8815>] __pci_register_driver+0x50/0xac
[    2.887785]  [<ffffffffa001102c>] ne2k_pci_init+0x2c/0x2e [ne2k_pci]
[    2.888093]  [<ffffffff81000212>] do_one_initcall+0x7c/0x130
[    2.888693]  [<ffffffff8106d74f>] sys_init_module+0x99/0x1da
[    2.888946]  [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

This happens because the netif_start_queue sets respective bit on the dev->_tx
array which is not yet allocated.

As far as I understand the code removing the netif_start_queue from __NS8390_init
is OK, since queue will be started later on device open. Plz, correct me if I'm wrong.

Found in the Dave's current tree, so he's in Cc.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

diff --git a/drivers/net/lib8390.c b/drivers/net/lib8390.c
index 316bb70..e7030ce 100644
--- a/drivers/net/lib8390.c
+++ b/drivers/net/lib8390.c
@@ -1077,7 +1077,6 @@ static void __NS8390_init(struct net_device *dev, int startp)
 	ei_outb_p(ei_local->rx_start_page, e8390_base + EN1_CURPAG);
 	ei_outb_p(E8390_NODMA+E8390_PAGE0+E8390_STOP, e8390_base+E8390_CMD);
 
-	netif_start_queue(dev);
 	ei_local->tx1 = ei_local->tx2 = 0;
 	ei_local->txing = 0;
 


^ permalink raw reply related

* Re: [PATCH] conntrack: allow nf_ct_alloc_hashtable() to get highmem pages
From: Patrick McHardy @ 2010-10-28 10:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Netfilter Development Mailinglist, stable
In-Reply-To: <1288170555.2709.63.camel@edumazet-laptop>

On 27.10.2010 11:09, Eric Dumazet wrote:
> commit ea781f197d6a8 (use SLAB_DESTROY_BY_RCU and get rid of call_rcu())
> did a mistake in __vmalloc() call in nf_ct_alloc_hashtable().
> 
> I forgot to add __GFP_HIGHMEM, so pages were taken from LOWMEM only.

Applied, thanks. I'll also pass this to -stable.

^ permalink raw reply

* Does ip does support labels for IPv6 addresses?
From: Heiko Gerstung @ 2010-10-28 10:41 UTC (permalink / raw)
  To: netdev

Hi !

Not sure if this is where I should ask, please feel free to send me to
someone else.

I just tried to use "ip" (from the iproute2 project) to assign multiple
IPv6 addresses to a physical network interface. In order to be able to
distinguish between them, I wanted to use the "label" feature as
described on the ip man page, but it seems that this is not supported
for IPv6 addresses.

Does anybody know if this is a (documentation-)bug, a feature or a
misconfiguration of ~/brain ?

Example:

# ip addr add dev eth0 2001::23/64 label eth0:0

at least adds the address correctly. But when I want to check with
# ip addr show
the label is not listed anywhere and using
# ip addr show label "eth0:0" does not work either (does not come up
with any output at all).

The description of the "label" option on the man page does not mention
that this is not supported on IPv6.

Best Regards,
 Heiko


-- 

Heiko Gerstung

*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>

------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*


^ permalink raw reply

* [net-2.6 PATCH] ixgb: call pci_disable_device in ixgb_remove
From: Jeff Kirsher @ 2010-10-28 10:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Ben Hutchings, Emil Tantilov,
	Jeff Kirsher

From: Emil Tantilov <emil.s.tantilov@intel.com>

ixgb fails to work after reload on recent kernels:

rmmod ixgb (dev->current_state = PCI_UNKNOWN)
modprobe ixgb (pci_enable_device will bail leaving current_state to PCI_UNKNOWN)
ifup eth0
do_IRQ: 2.82 No irq handler for vector (irq -1)

The issue was exposed by commit fcd097f31a6ee207cc0c3da9cccd2a86d4334785
PCI: MSI: Remove unsafe and unnecessary hardware access

which avoids HW writes for power states != PCI_D0

CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgb/ixgb_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index 666207a..caa8192 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -533,6 +533,7 @@ ixgb_remove(struct pci_dev *pdev)
 	pci_release_regions(pdev);
 
 	free_netdev(netdev);
+	pci_disable_device(pdev);
 }
 
 /**


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox