Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Jason Wang @ 2014-10-13  6:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <20141012092744.GA9567@redhat.com>

On 10/12/2014 05:27 PM, Michael S. Tsirkin wrote:
> On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
>> Below should be useful for some experiments Jason is doing.
>> I thought I'd send it out for early review/feedback.
>>
>> event idx feature allows us to defer interrupts until
>> a specific # of descriptors were used.
>> Sometimes it might be useful to get an interrupt after
>> a specific descriptor, regardless.
>> This adds a descriptor flag for this, and an API
>> to create an urgent output descriptor.
>> This is still an RFC:
>> we'll need a feature bit for drivers to detect this,
>> but we've run out of feature bits for virtio 0.X.
>> For experimentation purposes, drivers can assume
>> this is set, or add a driver-specific feature bit.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> I see that as compared to my original patch, you have
> added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
> I don't think it's necessary, see below.
>
> As such, I think this patch should be split:
> - original patch adding support for urgent descriptors
> - a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?

Not sure this is a good idea, since the api of first patch is in-completed.
>
>> ---
>>  drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
>>  include/linux/virtio.h           | 14 ++++++++
>>  include/uapi/linux/virtio_ring.h |  5 ++-
>>  3 files changed, 89 insertions(+), 5 deletions(-)
>>
[...]
>>  
>> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
>> +{
>> +	struct vring_virtqueue *vq = to_vvq(_vq);
>> +	u16 last_used_idx;
>> +
>> +	START_USE(vq);
>> +	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
>> +	last_used_idx = vq->last_used_idx;
>> +	END_USE(vq);
>> +	return last_used_idx;
>> +}
>> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
>> +
> You can implement virtqueue_enable_cb_prepare_urgent
> simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.
>
> The effect is same: host sends interrupts only if there
> is an urgent descriptor.

Seems not, consider the case when event index was disabled. This will
turn on all interrupts.

^ permalink raw reply

* Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
From: Jason Wang @ 2014-10-13  6:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: kvm, mst, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <1413038899.9362.43.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/11/2014 10:48 PM, Eric Dumazet wrote:
> On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
>> We free transmitted packets in ndo_start_xmit() in the past to get better
>> performance in the past. One side effect is that skb_orphan() needs to be
>> called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
>> fact. For TCP protocol, this means several optimization could not work well
>> such as TCP small queue and auto corking. This can lead extra low
>> throughput of small packets stream.
>>
>> Thanks to the urgent descriptor support. This patch tries to solve this
>> issue by enable the tx interrupt selectively for stream packets. This means
>> we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
>> tx interrupt for those packets. After we get tx interrupt, a tx napi was
>> scheduled to free those packets.
>>
>> With this method, sk_wmem_alloc of TCP socket were more accurate than in
>> the past which let TCP can batch more through TSQ and auto corking.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>  drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 128 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 5810841..b450fc4 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -72,6 +72,8 @@ struct send_queue {
>>  
>>  	/* Name of the send queue: output.$index */
>>  	char name[40];
>> +
>> +	struct napi_struct napi;
>>  };
>>  
>>  /* Internal representation of a receive virtqueue */
>> @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>>  	return p;
>>  }
>>  
>> +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
>> +{
>> +	struct sk_buff *skb;
>> +	unsigned int len;
>> +	struct virtnet_info *vi = sq->vq->vdev->priv;
>> +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>> +	int sent = 0;
>> +
>> +	while (sent < budget &&
>> +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>> +		pr_debug("Sent skb %p\n", skb);
>> +
>> +		u64_stats_update_begin(&stats->tx_syncp);
>> +		stats->tx_bytes += skb->len;
>> +		stats->tx_packets++;
>> +		u64_stats_update_end(&stats->tx_syncp);
>> +
>> +		dev_kfree_skb_any(skb);
>> +		sent++;
>> +	}
>> +
> You could accumulate skb->len in a totlen var, and perform a single
>
> 	u64_stats_update_begin(&stats->tx_syncp);
> 	stats->tx_bytes += totlen;
> 	stats->tx_packets += sent;
> 	u64_stats_update_end(&stats->tx_syncp);
>
> after the loop.
>

Yes, will do this in a separated patch.
>> +	return sent;
>> +}
>> +
> ...
>
>> +
>> +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
>> +{
>> +	union {
>> +		unsigned char *network;
>> +		struct iphdr *ipv4;
>> +		struct ipv6hdr *ipv6;
>> +	} hdr;
>> +	struct tcphdr *th = tcp_hdr(skb);
>> +	u16 payload_len;
>> +
>> +	hdr.network = skb_network_header(skb);
>> +
>> +	/* Only IPv4/IPv6 with TCP is supported */
> 	Oh well, yet another packet flow dissector :)
>
> 	If most packets were caught by your implementation, you could use it
> for fast patj and fallback to skb_flow_dissect() for encapsulated
> traffic.
>
> 	struct flow_keys keys;   
>
> 	if (!skb_flow_dissect(skb, &keys)) 
> 		return false;
>
> 	if (keys.ip_proto != IPPROTO_TCP)
> 		return false;
>
> 	then check __skb_get_poff() how to get th, and check if there is some
> payload...
>
>

Yes, but we don't know if most of packets were TCP or encapsulated TCP,
it depends on userspace application. If not, looks like
skb_flow_dissect() can bring some overhead, or it could be ignored?

skb_flow_dissect

^ permalink raw reply

* Re: [RFC PATCH 3/3] i2c: show and change bus frequency via sysfs
From: Guenter Roeck @ 2014-10-12 17:06 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: Mark Roszko, linux-i2c, linux-api, lkml, Johan Hovold,
	Wolfram Sang
In-Reply-To: <CAE1zotL4Pb5etHzakCxeDLvZR112XGGaHomZ_dbh=AQVc0wvPg@mail.gmail.com>

On Sun, Oct 12, 2014 at 12:32:56PM +0300, Octavian Purdila wrote:
> On Sat, Oct 11, 2014 at 11:14 PM, Mark Roszko <mark.roszko@gmail.com> wrote:
> 
> > This seems limiting to arches with peripherals that can support a range of
> > frequencies rather than fixed numbers.
> > Also it creates some portability quirkiness between platforms when all the
> > i2c bus drivers have different supported freq lists and you have to match
> > exactly the right frequency. I.e. one guy does 60khz but another only has
> > 80khz.
> 
> Sorry, I don't understand your points here. If this limitations exists
> they are not introduced by this patch. This patch just exposes the
> frequency so that it can be read or changed in userspace.
> 
> > Another issue is in systems where you have i2c devices on the same bus as
> > the sysfs user space driver. User space could set a bus frequency that
> > prevents operation with a system i2c device.
> 
> Changing the frequency is limited to root. Also, bus drivers do not
> have to implement set_freq if it is thought not to be safe.
> 
> On a different not, I have noticed that a fixed set of frequencies
> might not be the best API, since multiple drivers rather support a
> rather large set of frequencies in a range. A better API might be to
> expose a min-max range and let the bus driver adjust the requested
> frequency. I will follow up with a second version that does that.

As two separate sysfs attributes, maybe ? sysfs is supposed to provide
one value per attribute.

For the patch itself, I would find it better if you used is_visible to
determine if the new attributes should be visible (and/or writable) instead
of returning -EOPNOTSUPP.

Guenter

^ permalink raw reply

* Re: [RFC PATCH 3/3] i2c: show and change bus frequency via sysfs
From: Octavian Purdila @ 2014-10-12  9:32 UTC (permalink / raw)
  To: Mark Roszko
  Cc: linux-i2c, linux-api-u79uwXL29TY76Z2rM5mHXA, lkml, Johan Hovold,
	Wolfram Sang
In-Reply-To: <CAJjB1qKQ0ND2CeF=Npahn0r19WjdTwc7D3Th7PLnn0FJxHa_KQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Sat, Oct 11, 2014 at 11:14 PM, Mark Roszko <mark.roszko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> This seems limiting to arches with peripherals that can support a range of
> frequencies rather than fixed numbers.
> Also it creates some portability quirkiness between platforms when all the
> i2c bus drivers have different supported freq lists and you have to match
> exactly the right frequency. I.e. one guy does 60khz but another only has
> 80khz.

Sorry, I don't understand your points here. If this limitations exists
they are not introduced by this patch. This patch just exposes the
frequency so that it can be read or changed in userspace.

> Another issue is in systems where you have i2c devices on the same bus as
> the sysfs user space driver. User space could set a bus frequency that
> prevents operation with a system i2c device.

Changing the frequency is limited to root. Also, bus drivers do not
have to implement set_freq if it is thought not to be safe.

On a different not, I have noticed that a fixed set of frequencies
might not be the best API, since multiple drivers rather support a
rather large set of frequencies in a range. A better API might be to
expose a min-max range and let the bus driver adjust the requested
frequency. I will follow up with a second version that does that.

^ permalink raw reply

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Michael S. Tsirkin @ 2014-10-12  9:27 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <1413011806-3813-2-git-send-email-jasowang@redhat.com>

On Sat, Oct 11, 2014 at 03:16:44PM +0800, Jason Wang wrote:
> Below should be useful for some experiments Jason is doing.
> I thought I'd send it out for early review/feedback.
> 
> event idx feature allows us to defer interrupts until
> a specific # of descriptors were used.
> Sometimes it might be useful to get an interrupt after
> a specific descriptor, regardless.
> This adds a descriptor flag for this, and an API
> to create an urgent output descriptor.
> This is still an RFC:
> we'll need a feature bit for drivers to detect this,
> but we've run out of feature bits for virtio 0.X.
> For experimentation purposes, drivers can assume
> this is set, or add a driver-specific feature bit.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

I see that as compared to my original patch, you have
added a new flag: VRING_AVAIL_F_NO_URGENT_INTERRUPT
I don't think it's necessary, see below.

As such, I think this patch should be split:
- original patch adding support for urgent descriptors
- a patch adding virtqueue_enable/disable_cb_urgent(_prepare)?




> ---
>  drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
>  include/linux/virtio.h           | 14 ++++++++
>  include/uapi/linux/virtio_ring.h |  5 ++-
>  3 files changed, 89 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 4d08f45a..a5188c6 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
>  
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static inline int vring_add_indirect(struct vring_virtqueue *vq,
> +				     bool urgent,
>  				     struct scatterlist *sgs[],
>  				     struct scatterlist *(*next)
>  				       (struct scatterlist *, unsigned int *),
> @@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  	/* Use a single buffer which doesn't continue */
>  	head = vq->free_head;
>  	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
> +	if (urgent)
> +		vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
>  	vq->vring.desc[head].addr = virt_to_phys(desc);
>  	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
>  	kmemleak_ignore(desc);
> @@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  }
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
> +			        bool urgent,
>  				struct scatterlist *sgs[],
>  				struct scatterlist *(*next)
>  				  (struct scatterlist *, unsigned int *),
> @@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
> -		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
> +		head = vring_add_indirect(vq, urgent, sgs, next, total_sg, total_out,
>  					  total_in,
>  					  out_sgs, in_sgs, gfp);
>  		if (likely(head >= 0))
> @@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	for (n = 0; n < out_sgs; n++) {
>  		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
>  			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
> +			if (urgent) {
> +				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
> +				urgent = false;
> +			}
>  			vq->vring.desc[i].addr = sg_phys(sg);
>  			vq->vring.desc[i].len = sg->length;
>  			prev = i;
> @@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	for (; n < (out_sgs + in_sgs); n++) {
>  		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
>  			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
> +			if (urgent) {
> +				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
> +				urgent = false;
> +			}
>  			vq->vring.desc[i].addr = sg_phys(sg);
>  			vq->vring.desc[i].len = sg->length;
>  			prev = i;
> @@ -305,6 +317,8 @@ add_head:
>  
>  /**
>   * virtqueue_add_sgs - expose buffers to other end
> + * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt
> + *          after this descriptor was completed
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of terminated scatterlists.
>   * @out_num: the number of scatterlists readable by other side
> @@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
>  		for (sg = sgs[i]; sg; sg = sg_next(sg))
>  			total_in++;
>  	}
> -	return virtqueue_add(_vq, sgs, sg_next_chained,
> +	return virtqueue_add(_vq, false, sgs, sg_next_chained,
>  			     total_out, total_in, out_sgs, in_sgs, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
> @@ -360,11 +374,35 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 void *data,
>  			 gfp_t gfp)
>  {
> -	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
> +	return virtqueue_add(vq, false, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>  
>  /**
> + * virtqueue_add_outbuf - expose output buffers to other end
> + *          in case virtqueue_enable_cb_delayed was called, cause an interrupt
> + *          after this descriptor was completed
> + * @vq: the struct virtqueue we're talking about.
> + * @sgs: array of scatterlists (need not be terminated!)
> + * @num: the number of scatterlists readable by other side
> + * @data: the token identifying the buffer.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
> +			 struct scatterlist sg[], unsigned int num,
> +			 void *data,
> +			 gfp_t gfp)
> +{
> +	return virtqueue_add(vq, true, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_urgent);
> +
> +/**
>   * virtqueue_add_inbuf - expose input buffers to other end
>   * @vq: the struct virtqueue we're talking about.
>   * @sgs: array of scatterlists (need not be terminated!)
> @@ -382,7 +420,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
>  			void *data,
>  			gfp_t gfp)
>  {
> -	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
> +	return virtqueue_add(vq, false, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
>  
> @@ -595,6 +633,14 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
>  
> +void virtqueue_disable_cb_urgent(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +
> +	vq->vring.avail->flags |= VRING_AVAIL_F_NO_URGENT_INTERRUPT;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_disable_cb_urgent);
> +
>  /**
>   * virtqueue_enable_cb_prepare - restart callbacks after disable_cb
>   * @vq: the struct virtqueue we're talking about.
> @@ -626,6 +672,19 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
>  
> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	u16 last_used_idx;
> +
> +	START_USE(vq);
> +	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
> +	last_used_idx = vq->last_used_idx;
> +	END_USE(vq);
> +	return last_used_idx;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
> +

You can implement virtqueue_enable_cb_prepare_urgent
simply by clearing ~VRING_AVAIL_F_NO_INTERRUPT.

The effect is same: host sends interrupts only if there
is an urgent descriptor.

>  /**
>   * virtqueue_poll - query pending used buffers
>   * @vq: the struct virtqueue we're talking about.
> @@ -662,6 +721,14 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
>  
> +bool virtqueue_enable_cb_urgent(struct virtqueue *_vq)
> +{
> +	unsigned last_used_idx = virtqueue_enable_cb_prepare_urgent(_vq);
> +
> +	return !virtqueue_poll(_vq, last_used_idx);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_enable_cb_urgent);
> +
>  /**
>   * virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
>   * @vq: the struct virtqueue we're talking about.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index b46671e..68be5f2 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -39,6 +39,12 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
>  			 void *data,
>  			 gfp_t gfp);
>  
> +int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
> +				struct scatterlist sg[], unsigned int num,
> +				void *data,
> +				gfp_t gfp);
> +
> +
>  int virtqueue_add_inbuf(struct virtqueue *vq,
>  			struct scatterlist sg[], unsigned int num,
>  			void *data,
> @@ -61,10 +67,18 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
>  
>  void virtqueue_disable_cb(struct virtqueue *vq);
>  
> +void virtqueue_disable_cb_urgent(struct virtqueue *vq);
> +
>  bool virtqueue_enable_cb(struct virtqueue *vq);
>  
> +bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
> +
> +bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
> +
>  unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
>  
> +unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *vq);
> +
>  bool virtqueue_poll(struct virtqueue *vq, unsigned);
>  
>  bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
> diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> index a99f9b7..daf5bb0 100644
> --- a/include/uapi/linux/virtio_ring.h
> +++ b/include/uapi/linux/virtio_ring.h
> @@ -39,6 +39,9 @@
>  #define VRING_DESC_F_WRITE	2
>  /* This means the buffer contains a list of buffer descriptors. */
>  #define VRING_DESC_F_INDIRECT	4
> +/* This means the descriptor should cause an interrupt
> + * ignoring avail event idx. */
> +#define VRING_DESC_F_URGENT	8
>  
>  /* The Host uses this in used->flags to advise the Guest: don't kick me when
>   * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
> @@ -48,7 +51,7 @@
>   * when you consume a buffer.  It's unreliable, so it's simply an
>   * optimization.  */
>  #define VRING_AVAIL_F_NO_INTERRUPT	1
> -
> +#define VRING_AVAIL_F_NO_URGENT_INTERRUPT	2
>  /* We support indirect buffer descriptors */
>  #define VIRTIO_RING_F_INDIRECT_DESC	28
>  
> -- 
> 1.8.3.1

^ permalink raw reply

* [PATCH] kvm: drop unsupported capabilities, fix documentation
From: Michael S. Tsirkin @ 2014-10-12  8:34 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Gleb Natapov, Paolo Bonzini, Jiri Kosina,
	kvm-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

No kernel ever reported KVM_CAP_DEVICE_MSIX, KVM_CAP_DEVICE_MSI,
KVM_CAP_DEVICE_ASSIGNMENT, KVM_CAP_DEVICE_DEASSIGNMENT.

This makes the documentation wrong, and no application ever
written to use these capabilities has a chance to work correctly.
The only way to detect support is to try, and test errno for ENOTTY.
That's unfortunate, but we can't fix the past.

Document the actual semantics, and drop the definitions from
the exported header to make it easier for application
developers to note and fix the bug.

Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/uapi/linux/kvm.h          |  8 --------
 Documentation/virtual/kvm/api.txt | 40 ++++++++++++++++++++++++++++++++-------
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index cf3a2ff..98d72f7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -647,11 +647,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_MP_STATE 14
 #define KVM_CAP_COALESCED_MMIO 15
 #define KVM_CAP_SYNC_MMU 16  /* Changes to host mmap are reflected in guest */
-#define KVM_CAP_DEVICE_ASSIGNMENT 17
 #define KVM_CAP_IOMMU 18
-#ifdef __KVM_HAVE_MSI
-#define KVM_CAP_DEVICE_MSI 20
-#endif
 /* Bug in KVM_SET_USER_MEMORY_REGION fixed: */
 #define KVM_CAP_DESTROY_MEMORY_REGION_WORKS 21
 #ifdef __KVM_HAVE_USER_NMI
@@ -665,10 +661,6 @@ struct kvm_ppc_smmu_info {
 #endif
 #define KVM_CAP_IRQ_ROUTING 25
 #define KVM_CAP_IRQ_INJECT_STATUS 26
-#define KVM_CAP_DEVICE_DEASSIGNMENT 27
-#ifdef __KVM_HAVE_MSIX
-#define KVM_CAP_DEVICE_MSIX 28
-#endif
 #define KVM_CAP_ASSIGN_DEV_IRQ 29
 /* Another bug in KVM_SET_USER_MEMORY_REGION fixed: */
 #define KVM_CAP_JOIN_MEMORY_REGIONS_WORKS 30
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index beae3fd..0fe195c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -68,9 +68,12 @@ description:
 
   Capability: which KVM extension provides this ioctl.  Can be 'basic',
       which means that is will be provided by any kernel that supports
-      API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which
+      API version 12 (see section 4.1), a KVM_CAP_xyz constant, which
       means availability needs to be checked with KVM_CHECK_EXTENSION
-      (see section 4.4).
+      (see section 4.4), or 'none' which means that while not all kernels
+      support this ioctl, there's no capability bit to check its
+      availability: for kernels that don't support the ioctl,
+      the ioctl returns -ENOTTY.
 
   Architectures: which instruction set architectures provide this ioctl.
       x86 includes both i386 and x86_64.
@@ -1257,7 +1260,7 @@ The flags bitmap is defined as:
 
 4.48 KVM_ASSIGN_PCI_DEVICE
 
-Capability: KVM_CAP_DEVICE_ASSIGNMENT
+Capability: none
 Architectures: x86 ia64
 Type: vm ioctl
 Parameters: struct kvm_assigned_pci_dev (in)
@@ -1298,10 +1301,16 @@ Only PCI header type 0 devices with PCI BAR resources are supported by
 device assignment.  The user requesting this ioctl must have read/write
 access to the PCI sysfs resource files associated with the device.
 
+Errors:
+  ENOTTY: kernel does not support this ioctl
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
 
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
-Capability: KVM_CAP_DEVICE_DEASSIGNMENT
+Capability: none
 Architectures: x86 ia64
 Type: vm ioctl
 Parameters: struct kvm_assigned_pci_dev (in)
@@ -1309,9 +1318,14 @@ Returns: 0 on success, -1 on error
 
 Ends PCI device assignment, releasing all associated resources.
 
-See KVM_CAP_DEVICE_ASSIGNMENT for the data structure. Only assigned_dev_id is
+See KVM_ASSIGN_PCI_DEVICE for the data structure. Only assigned_dev_id is
 used in kvm_assigned_pci_dev to identify the device.
 
+Errors:
+  ENOTTY: kernel does not support this ioctl
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
 
 4.50 KVM_ASSIGN_DEV_IRQ
 
@@ -1346,6 +1360,12 @@ The following flags are defined:
 It is not valid to specify multiple types per host or guest IRQ. However, the
 IRQ type of host and guest can differ or can even be null.
 
+Errors:
+  ENOTTY: kernel does not support this ioctl
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
 
 4.51 KVM_DEASSIGN_DEV_IRQ
 
@@ -1423,7 +1443,7 @@ struct kvm_irq_routing_s390_adapter {
 
 4.53 KVM_ASSIGN_SET_MSIX_NR
 
-Capability: KVM_CAP_DEVICE_MSIX
+Capability: none
 Architectures: x86 ia64
 Type: vm ioctl
 Parameters: struct kvm_assigned_msix_nr (in)
@@ -1445,7 +1465,7 @@ struct kvm_assigned_msix_nr {
 
 4.54 KVM_ASSIGN_SET_MSIX_ENTRY
 
-Capability: KVM_CAP_DEVICE_MSIX
+Capability: none
 Architectures: x86 ia64
 Type: vm ioctl
 Parameters: struct kvm_assigned_msix_entry (in)
@@ -1461,6 +1481,12 @@ struct kvm_assigned_msix_entry {
 	__u16 padding[3];
 };
 
+Errors:
+  ENOTTY: kernel does not support this ioctl
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
 
 4.55 KVM_SET_TSC_KHZ
 
-- 
MST

^ permalink raw reply related

* Re: [PATCH 3/8] iio: Add support for DA9150 GPADC
From: Hartmut Knaack @ 2014-10-11 22:47 UTC (permalink / raw)
  To: Adam Thomson, Lee Jones, Samuel Ortiz, Jonathan Cameron,
	linux-iio-u79uwXL29TY76Z2rM5mHXA, Sebastian Reichel,
	Dmitry Eremin-Solenikov, David Woodhouse,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Joe Perches,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	support.opensource-WBD+wuPFNBhBDgjK7y7TUQ
In-Reply-To: <ee4c56931f326d8ee9c1313c44456bce2dec8b8c.1411396719.git.Adam.Thomson.Opensource-WBD+wuPFNBhBDgjK7y7TUQ@public.gmane.org>

Hi,
I have put a few comments inline.

Adam Thomson schrieb am 23.09.2014 12:53:
> This patch adds support for DA9150 Charger & Fuel-Gauge IC GPADC.
> 
> Signed-off-by: Adam Thomson <Adam.Thomson.Opensource-WBD+wuPFNBhBDgjK7y7TUQ@public.gmane.org>
> ---
>  drivers/iio/adc/Kconfig        |   9 +
>  drivers/iio/adc/Makefile       |   1 +
>  drivers/iio/adc/da9150-gpadc.c | 406 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 416 insertions(+)
>  create mode 100644 drivers/iio/adc/da9150-gpadc.c
> 
> diff --git a/drivers/iio/adc/Kconfig b/drivers/iio/adc/Kconfig
> index 11b048a..8041347 100644
> --- a/drivers/iio/adc/Kconfig
> +++ b/drivers/iio/adc/Kconfig
> @@ -127,6 +127,15 @@ config AT91_ADC
>  	help
>  	  Say yes here to build support for Atmel AT91 ADC.
> 
> +config DA9150_GPADC
> +	tristate "Dialog DA9150 GPADC driver support"
> +	depends on MFD_DA9150
> +	help
> +	  Say yes here to build support for Dialog DA9150 GPADC.
> +
> +	  This driver can also be built as a module. If chosen, the module name
> +	  will be da9150-gpadc.
> +
>  config EXYNOS_ADC
>  	tristate "Exynos ADC driver support"
>  	depends on ARCH_EXYNOS || (OF && COMPILE_TEST)
> diff --git a/drivers/iio/adc/Makefile b/drivers/iio/adc/Makefile
> index ad81b51..48413d2 100644
> --- a/drivers/iio/adc/Makefile
> +++ b/drivers/iio/adc/Makefile
> @@ -14,6 +14,7 @@ obj-$(CONFIG_AD7793) += ad7793.o
>  obj-$(CONFIG_AD7887) += ad7887.o
>  obj-$(CONFIG_AD799X) += ad799x.o
>  obj-$(CONFIG_AT91_ADC) += at91_adc.o
> +obj-$(CONFIG_DA9150_GPADC) += da9150-gpadc.o
>  obj-$(CONFIG_EXYNOS_ADC) += exynos_adc.o
>  obj-$(CONFIG_LP8788_ADC) += lp8788_adc.o
>  obj-$(CONFIG_MAX1027) += max1027.o
> diff --git a/drivers/iio/adc/da9150-gpadc.c b/drivers/iio/adc/da9150-gpadc.c
> new file mode 100644
> index 0000000..2b83ee0
> --- /dev/null
> +++ b/drivers/iio/adc/da9150-gpadc.c
> @@ -0,0 +1,406 @@
> +/*
> + * DA9150 GPADC Driver
> + *
> + * Copyright (c) 2014 Dialog Semiconductor
> + *
> + * Author: Adam Thomson <Adam.Thomson.Opensource-WBD+wuPFNBhBDgjK7y7TUQ@public.gmane.org>
> + *
> + * This program is free software; you can redistribute  it and/or modify it
> + * under  the terms of  the GNU General  Public License as published by the
> + * Free Software Foundation;  either version 2 of the  License, or (at your
> + * option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +#include <linux/interrupt.h>
> +#include <linux/mutex.h>
> +#include <linux/completion.h>
> +#include <linux/iio/iio.h>
> +#include <linux/iio/machine.h>
> +#include <linux/iio/driver.h>
> +#include <linux/mfd/da9150/core.h>
> +#include <linux/mfd/da9150/registers.h>
> +
> +/* Channels */
> +enum da9150_gpadc_hw_channel {
> +	DA9150_GPADC_HW_CHAN_GPIOA_2V = 0,
> +	DA9150_GPADC_HW_CHAN_GPIOA_2V_,
> +	DA9150_GPADC_HW_CHAN_GPIOB_2V,
> +	DA9150_GPADC_HW_CHAN_GPIOB_2V_,
> +	DA9150_GPADC_HW_CHAN_GPIOC_2V,
> +	DA9150_GPADC_HW_CHAN_GPIOC_2V_,
> +	DA9150_GPADC_HW_CHAN_GPIOD_2V,
> +	DA9150_GPADC_HW_CHAN_GPIOD_2V_,
> +	DA9150_GPADC_HW_CHAN_IBUS_SENSE,
> +	DA9150_GPADC_HW_CHAN_IBUS_SENSE_,
> +	DA9150_GPADC_HW_CHAN_VBUS_DIV,
> +	DA9150_GPADC_HW_CHAN_VBUS_DIV_,
> +	DA9150_GPADC_HW_CHAN_ID,
> +	DA9150_GPADC_HW_CHAN_ID_,
> +	DA9150_GPADC_HW_CHAN_VSYS,
> +	DA9150_GPADC_HW_CHAN_VSYS_,
> +	DA9150_GPADC_HW_CHAN_GPIOA_6V,
> +	DA9150_GPADC_HW_CHAN_GPIOA_6V_,
> +	DA9150_GPADC_HW_CHAN_GPIOB_6V,
> +	DA9150_GPADC_HW_CHAN_GPIOB_6V_,
> +	DA9150_GPADC_HW_CHAN_GPIOC_6V,
> +	DA9150_GPADC_HW_CHAN_GPIOC_6V_,
> +	DA9150_GPADC_HW_CHAN_GPIOD_6V,
> +	DA9150_GPADC_HW_CHAN_GPIOD_6V_,
> +	DA9150_GPADC_HW_CHAN_VBAT,
> +	DA9150_GPADC_HW_CHAN_VBAT_,
> +	DA9150_GPADC_HW_CHAN_TBAT,
> +	DA9150_GPADC_HW_CHAN_TBAT_,
> +	DA9150_GPADC_HW_CHAN_TJUNC_CORE,
> +	DA9150_GPADC_HW_CHAN_TJUNC_CORE_,
> +	DA9150_GPADC_HW_CHAN_TJUNC_OVP,
> +	DA9150_GPADC_HW_CHAN_TJUNC_OVP_,
> +};
> +
> +enum da9150_gpadc_channel {
> +	DA9150_GPADC_CHAN_GPIOA = 0,
> +	DA9150_GPADC_CHAN_GPIOB,
> +	DA9150_GPADC_CHAN_GPIOC,
> +	DA9150_GPADC_CHAN_GPIOD,
> +	DA9150_GPADC_CHAN_IBUS,
> +	DA9150_GPADC_CHAN_VBUS,
> +	DA9150_GPADC_CHAN_ID,
> +	DA9150_GPADC_CHAN_VSYS,
> +	DA9150_GPADC_CHAN_VBAT,
> +	DA9150_GPADC_CHAN_TBAT,
> +	DA9150_GPADC_CHAN_TJUNC_CORE,
> +	DA9150_GPADC_CHAN_TJUNC_OVP,
> +};
> +
> +/* Private data */
> +struct da9150_gpadc {
> +	struct da9150 *da9150;
> +	struct device *dev;
> +
> +	struct mutex lock;
> +	struct completion complete;
> +};
> +
> +
> +static irqreturn_t da9150_gpadc_irq(int irq, void *data)
> +{
> +
> +	struct da9150_gpadc *gpadc = data;
> +
> +	complete(&gpadc->complete);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static int da9150_gpadc_read_adc(struct da9150_gpadc *gpadc, int hw_chan)
> +{
> +	u8 result_regs[2];
> +	int result;
> +
> +	mutex_lock(&gpadc->lock);
> +
> +	/* Set channel & enable measurement */
> +	da9150_reg_write(gpadc->da9150, DA9150_GPADC_MAN,
> +			 (DA9150_GPADC_EN_MASK |
> +			  hw_chan << DA9150_GPADC_MUX_SHIFT));
> +
> +	/* Consume left-over completion from a previous timeout */
> +	try_wait_for_completion(&gpadc->complete);
> +
> +	/* Check for actual completion */
> +	wait_for_completion_timeout(&gpadc->complete, msecs_to_jiffies(5));
> +
> +	/* Read result and status from device */
> +	da9150_bulk_read(gpadc->da9150, DA9150_GPADC_RES_A, 2, result_regs);
> +
> +	mutex_unlock(&gpadc->lock);
> +
> +	/* Check to make sure device really has completed reading */
> +	if (result_regs[1] & DA9150_GPADC_RUN_MASK) {
> +		dev_err(gpadc->dev, "Timeout on channel %d of GPADC\n",
> +			hw_chan);
> +		return -ETIMEDOUT;
> +	}
> +
> +	/* LSBs - 2 bits */
> +	result = (result_regs[1] & DA9150_GPADC_RES_L_MASK) >>
> +		 DA9150_GPADC_RES_L_SHIFT;
> +	/* MSBs - 8 bits */
> +	result |= result_regs[0] << DA9150_GPADC_RES_L_BITS;
> +
> +	return result;
> +}
> +
> +static inline int da9150_gpadc_gpio_6v_voltage_now(int raw_val)
> +{
> +	/* Convert to mV */
> +	return (6 * ((raw_val * 1000) + 500)) / 1024;
> +}
> +
> +static inline int da9150_gpadc_ibus_current_avg(int raw_val)
> +{
> +	/* Convert to mA */
> +	return (4 * ((raw_val * 1000) + 500)) / 2048;
> +}
> +
> +static inline int da9150_gpadc_vbus_21v_voltage_now(int raw_val)
> +{
> +	/* Convert to mV */
> +	return (21 * ((raw_val * 1000) + 500)) / 1024;
> +}
> +
> +static inline int da9150_gpadc_vsys_6v_voltage_now(int raw_val)
> +{
> +	/* Convert to mV */
> +	return (3 * ((raw_val * 1000) + 500)) / 512;
> +}
> +
> +static int da9150_gpadc_read_processed(struct da9150_gpadc *gpadc, int channel,
> +				       int hw_chan, int *val)
> +{
> +	int raw_val;
> +
> +	raw_val = da9150_gpadc_read_adc(gpadc, hw_chan);
> +	if (raw_val < 0)
> +		return raw_val;
> +
> +	switch (channel) {
> +	case DA9150_GPADC_CHAN_GPIOA:
> +	case DA9150_GPADC_CHAN_GPIOB:
> +	case DA9150_GPADC_CHAN_GPIOC:
> +	case DA9150_GPADC_CHAN_GPIOD:
> +		*val = da9150_gpadc_gpio_6v_voltage_now(raw_val);
> +		break;
> +	case DA9150_GPADC_CHAN_IBUS:
> +		*val = da9150_gpadc_ibus_current_avg(raw_val);
> +		break;
> +	case DA9150_GPADC_CHAN_VBUS:
> +		*val = da9150_gpadc_vbus_21v_voltage_now(raw_val);
> +		break;
> +	case DA9150_GPADC_CHAN_VSYS:
> +		*val = da9150_gpadc_vsys_6v_voltage_now(raw_val);
> +		break;
> +	default:
> +		/* No processing for other channels so return raw value */
> +		*val = raw_val;
> +		break;
> +	}
> +
> +	return IIO_VAL_INT;
> +}
> +
> +static int da9150_gpadc_read_scale(int channel, int *val, int *val2)
> +{
> +	switch (channel) {
> +	case DA9150_GPADC_CHAN_VBAT:
> +		*val = 2932;
> +		*val2 = 1000;
> +		return IIO_VAL_FRACTIONAL;
> +	case DA9150_GPADC_CHAN_TJUNC_CORE:
> +	case DA9150_GPADC_CHAN_TJUNC_OVP:
> +		*val = 1000000;
> +		*val2 = 4420;
> +		return IIO_VAL_FRACTIONAL;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static int da9150_gpadc_read_offset(int channel, int *val)
> +{
> +	switch (channel) {
> +	case DA9150_GPADC_CHAN_VBAT:
> +		*val = 1500000 / 2932;
> +		return IIO_VAL_INT;
> +	case DA9150_GPADC_CHAN_TJUNC_CORE:
> +	case DA9150_GPADC_CHAN_TJUNC_OVP:
> +		*val = -144;
> +		return IIO_VAL_INT;
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static int da9150_gpadc_read_raw(struct iio_dev *indio_dev,
> +				 struct iio_chan_spec const *chan,
> +				 int *val, int *val2, long mask)
> +{
> +	struct da9150_gpadc *gpadc = iio_priv(indio_dev);
> +
> +	if ((chan->channel < DA9150_GPADC_CHAN_GPIOA) ||
> +	    (chan->channel > DA9150_GPADC_CHAN_TJUNC_OVP))
> +		return -EINVAL;
> +
> +	switch (mask) {
> +	case IIO_CHAN_INFO_RAW:
> +	case IIO_CHAN_INFO_PROCESSED:
> +		return da9150_gpadc_read_processed(gpadc, chan->channel,
> +						   chan->address, val);
> +	case IIO_CHAN_INFO_SCALE:
> +		return da9150_gpadc_read_scale(chan->channel, val, val2);
> +	case IIO_CHAN_INFO_OFFSET:
> +		return da9150_gpadc_read_offset(chan->channel, val);
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static const struct iio_info da9150_gpadc_info = {
> +	.read_raw = &da9150_gpadc_read_raw,
> +	.driver_module = THIS_MODULE,
> +};
> +
> +#define DA9150_GPADC_CHANNEL(_id, _hw_id, _type, chan_info,	\
> +			     _ext_name) {			\
> +	.type = _type,						\
> +	.indexed = 1,						\
> +	.channel = DA9150_GPADC_CHAN_##_id,			\
> +	.address = DA9150_GPADC_HW_CHAN_##_hw_id,		\
> +	.info_mask_separate = chan_info,			\
> +	.extend_name = _ext_name,				\
> +	.datasheet_name = #_id,					\
> +}
> +
> +#define DA9150_GPADC_CHANNEL_RAW(_id, _hw_id, _type, _ext_name)	\
> +	DA9150_GPADC_CHANNEL(_id, _hw_id, _type,		\
> +			     BIT(IIO_CHAN_INFO_RAW), _ext_name)
> +
> +#define DA9150_GPADC_CHANNEL_SCALED(_id, _hw_id, _type, _ext_name)	\
> +	DA9150_GPADC_CHANNEL(_id, _hw_id, _type,			\
> +			     BIT(IIO_CHAN_INFO_RAW) |			\
> +			     BIT(IIO_CHAN_INFO_SCALE) |			\
> +			     BIT(IIO_CHAN_INFO_OFFSET),			\
> +			     _ext_name)
> +
> +#define DA9150_GPADC_CHANNEL_PROCESSED(_id, _hw_id, _type, _ext_name)	\
> +	DA9150_GPADC_CHANNEL(_id, _hw_id, _type,			\
> +			     BIT(IIO_CHAN_INFO_PROCESSED), _ext_name)
> +
> +/* Supported channels */
> +static const struct iio_chan_spec da9150_gpadc_channels[] = {
> +	DA9150_GPADC_CHANNEL_PROCESSED(GPIOA, GPIOA_6V, IIO_VOLTAGE, "GPIOA"),
> +	DA9150_GPADC_CHANNEL_PROCESSED(GPIOB, GPIOB_6V, IIO_VOLTAGE, "GPIOB"),
> +	DA9150_GPADC_CHANNEL_PROCESSED(GPIOC, GPIOC_6V, IIO_VOLTAGE, "GPIOC"),
> +	DA9150_GPADC_CHANNEL_PROCESSED(GPIOD, GPIOD_6V, IIO_VOLTAGE, "GPIOD"),
> +	DA9150_GPADC_CHANNEL_PROCESSED(IBUS, IBUS_SENSE, IIO_CURRENT, "IBUS"),
> +	DA9150_GPADC_CHANNEL_PROCESSED(VBUS, VBUS_DIV_, IIO_VOLTAGE, "VBUS"),
> +	DA9150_GPADC_CHANNEL_RAW(ID, ID, IIO_VOLTAGE, "ID"),
> +	DA9150_GPADC_CHANNEL_PROCESSED(VSYS, VSYS, IIO_VOLTAGE, "VSYS"),
> +	DA9150_GPADC_CHANNEL_SCALED(VBAT, VBAT, IIO_VOLTAGE, "VBAT"),
> +	DA9150_GPADC_CHANNEL_RAW(TBAT, TBAT, IIO_VOLTAGE, "TBAT"),
> +	DA9150_GPADC_CHANNEL_SCALED(TJUNC_CORE, TJUNC_CORE, IIO_TEMP,
> +				    "TJUNC_CORE"),
> +	DA9150_GPADC_CHANNEL_SCALED(TJUNC_OVP, TJUNC_OVP, IIO_TEMP,
> +				    "TJUNC_OVP"),
> +};
> +
> +/* Default maps used by da9150-charger */
> +static struct iio_map da9150_gpadc_default_maps[] = {
> +	{
> +		.consumer_dev_name = "da9150-charger",
> +		.consumer_channel = "CHAN_IBUS",
> +		.adc_channel_label = "IBUS",
> +	},
> +	{
> +		.consumer_dev_name = "da9150-charger",
> +		.consumer_channel = "CHAN_VBUS",
> +		.adc_channel_label = "VBUS",
> +	},
> +	{
> +		.consumer_dev_name = "da9150-charger",
> +		.consumer_channel = "CHAN_TJUNC",
> +		.adc_channel_label = "TJUNC_CORE",
> +	},
> +	{
> +		.consumer_dev_name = "da9150-charger",
> +		.consumer_channel = "CHAN_VBAT",
> +		.adc_channel_label = "VBAT",
> +	},
> +	{},
> +};
> +
> +static int da9150_gpadc_probe(struct platform_device *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct da9150 *da9150 = dev_get_drvdata(dev->parent);
Maybe you could find a slightly different name for the instance of this da9150 than its type name.
> +	struct da9150_gpadc *gpadc;
> +	struct iio_dev *indio_dev;
> +	int irq, ret;
> +
> +	indio_dev = devm_iio_device_alloc(&pdev->dev,
> +					  sizeof(struct da9150_gpadc));
> +	if (!indio_dev) {
> +		dev_err(&pdev->dev, "Failed to allocate IIO device\n");
> +		return -ENOMEM;
> +	}
> +	gpadc = iio_priv(indio_dev);
> +
> +	platform_set_drvdata(pdev, indio_dev);
> +	gpadc->da9150 = da9150;
> +	gpadc->dev = dev;
> +
> +	ret = iio_map_array_register(indio_dev, da9150_gpadc_default_maps);
> +	if (ret) {
> +		dev_err(dev, "Failed to register IIO maps: %d\n", ret);
> +		return ret;
> +	}
> +
> +	mutex_init(&gpadc->lock);
> +	init_completion(&gpadc->complete);
> +
> +	irq = platform_get_irq_byname(pdev, "GPADC");
Shouldn't you check irq for possible error codes? devm_request_threaded_irq() expects irq to be unsigned.
> +	ret = devm_request_threaded_irq(dev, irq, NULL, da9150_gpadc_irq,
> +					IRQF_ONESHOT, "GPADC", gpadc);
> +	if (ret) {
> +		dev_err(dev, "Failed to request IRQ %d: %d\n", irq, ret);
> +		goto iio_map_unreg;
> +	}
> +
> +	indio_dev->name = dev_name(dev);
> +	indio_dev->dev.parent = dev;
> +	indio_dev->dev.of_node = pdev->dev.of_node;
> +	indio_dev->info = &da9150_gpadc_info;
> +	indio_dev->modes = INDIO_DIRECT_MODE;
> +	indio_dev->channels = da9150_gpadc_channels;
> +	indio_dev->num_channels = ARRAY_SIZE(da9150_gpadc_channels);
> +
> +	ret = iio_device_register(indio_dev);
> +	if (ret) {
> +		dev_err(dev, "Failed to register IIO device: %d\n", ret);
> +		goto iio_map_unreg;
> +	}
> +
> +	return 0;
> +
> +iio_map_unreg:
> +	iio_map_array_unregister(indio_dev);
> +
> +	return ret;
> +}
> +
> +static int da9150_gpadc_remove(struct platform_device *pdev)
> +{
> +	struct iio_dev *indio_dev = platform_get_drvdata(pdev);
> +
> +	iio_map_array_unregister(indio_dev);
> +	iio_device_unregister(indio_dev);
Not sure if that was the intention of Jonathans comment, but these unregister calls should be switched to be in proper reverse order.
> +
> +	return 0;
> +}
> +
> +static struct platform_driver da9150_gpadc_driver = {
> +	.driver = {
> +		.name = "da9150-gpadc",
> +	},
> +	.probe = da9150_gpadc_probe,
> +	.remove = da9150_gpadc_remove,
> +};
> +
> +module_platform_driver(da9150_gpadc_driver);
> +
> +MODULE_DESCRIPTION("GPADC Driver for DA9150");
> +MODULE_AUTHOR("Adam Thomson <Adam.Thomson.Opensource-WBD+wuPFNBhBDgjK7y7TUQ@public.gmane.org");
You are missing a > at the end of your email-address.
> +MODULE_LICENSE("GPL");
> --
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-iio" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
From: Eric Dumazet @ 2014-10-11 14:48 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <1413011806-3813-4-git-send-email-jasowang@redhat.com>

On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
> We free transmitted packets in ndo_start_xmit() in the past to get better
> performance in the past. One side effect is that skb_orphan() needs to be
> called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
> fact. For TCP protocol, this means several optimization could not work well
> such as TCP small queue and auto corking. This can lead extra low
> throughput of small packets stream.
> 
> Thanks to the urgent descriptor support. This patch tries to solve this
> issue by enable the tx interrupt selectively for stream packets. This means
> we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
> tx interrupt for those packets. After we get tx interrupt, a tx napi was
> scheduled to free those packets.
> 
> With this method, sk_wmem_alloc of TCP socket were more accurate than in
> the past which let TCP can batch more through TSQ and auto corking.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 128 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5810841..b450fc4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>  
>  	/* Name of the send queue: output.$index */
>  	char name[40];
> +
> +	struct napi_struct napi;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  	return p;
>  }
>  
> +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +{
> +	struct sk_buff *skb;
> +	unsigned int len;
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> +	int sent = 0;
> +
> +	while (sent < budget &&
> +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		pr_debug("Sent skb %p\n", skb);
> +
> +		u64_stats_update_begin(&stats->tx_syncp);
> +		stats->tx_bytes += skb->len;
> +		stats->tx_packets++;
> +		u64_stats_update_end(&stats->tx_syncp);
> +
> +		dev_kfree_skb_any(skb);
> +		sent++;
> +	}
> +

You could accumulate skb->len in a totlen var, and perform a single

	u64_stats_update_begin(&stats->tx_syncp);
	stats->tx_bytes += totlen;
	stats->tx_packets += sent;
	u64_stats_update_end(&stats->tx_syncp);

after the loop.


> +	return sent;
> +}
> +

...

> +
> +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
> +{
> +	union {
> +		unsigned char *network;
> +		struct iphdr *ipv4;
> +		struct ipv6hdr *ipv6;
> +	} hdr;
> +	struct tcphdr *th = tcp_hdr(skb);
> +	u16 payload_len;
> +
> +	hdr.network = skb_network_header(skb);
> +
> +	/* Only IPv4/IPv6 with TCP is supported */

	Oh well, yet another packet flow dissector :)

	If most packets were caught by your implementation, you could use it
for fast patj and fallback to skb_flow_dissect() for encapsulated
traffic.

	struct flow_keys keys;   

	if (!skb_flow_dissect(skb, &keys)) 
		return false;

	if (keys.ip_proto != IPPROTO_TCP)
		return false;

	then check __skb_get_poff() how to get th, and check if there is some
payload...


> +	if ((skb->protocol == htons(ETH_P_IP)) &&
> +	    hdr.ipv4->protocol == IPPROTO_TCP) {
> +		payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
> +			      th->doff * 4;
> +	} else if ((skb->protocol == htons(ETH_P_IPV6) ||
> +		   hdr.ipv6->nexthdr == IPPROTO_TCP)) {
> +		payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
> +	} else {
> +		return false;
> +	}
> +
> +	/* We don't want to dealy packet with PUSH bit and pure ACK packet */
> +	if (!th->psh && payload_len)
> +		return true;
> +
> +	return false;
>  }

^ permalink raw reply

* [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm
In-Reply-To: <1413011806-3813-1-git-send-email-jasowang@redhat.com>

We free transmitted packets in ndo_start_xmit() in the past to get better
performance in the past. One side effect is that skb_orphan() needs to be
called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
fact. For TCP protocol, this means several optimization could not work well
such as TCP small queue and auto corking. This can lead extra low
throughput of small packets stream.

Thanks to the urgent descriptor support. This patch tries to solve this
issue by enable the tx interrupt selectively for stream packets. This means
we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
tx interrupt for those packets. After we get tx interrupt, a tx napi was
scheduled to free those packets.

With this method, sk_wmem_alloc of TCP socket were more accurate than in
the past which let TCP can batch more through TSQ and auto corking.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 164 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 128 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5810841..b450fc4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {
 
 	/* Name of the send queue: output.$index */
 	char name[40];
+
+	struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
+static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+	int sent = 0;
+
+	while (sent < budget &&
+	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+
+		u64_stats_update_begin(&stats->tx_syncp);
+		stats->tx_bytes += skb->len;
+		stats->tx_packets++;
+		u64_stats_update_end(&stats->tx_syncp);
+
+		dev_kfree_skb_any(skb);
+		sent++;
+	}
+
+	return sent;
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
 	struct virtnet_info *vi = vq->vdev->priv;
+	struct send_queue *sq = &vi->sq[vq2txq(vq)];
 
-	/* Suppress further interrupts. */
-	virtqueue_disable_cb(vq);
-
-	/* We were probably waiting for more output buffers. */
-	netif_wake_subqueue(vi->dev, vq2txq(vq));
+	if (napi_schedule_prep(&sq->napi)) {
+		virtqueue_disable_cb(vq);
+		virtqueue_disable_cb_urgent(vq);
+		__napi_schedule(&sq->napi);
+	}
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -772,7 +799,38 @@ again:
 	return received;
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct send_queue *sq =
+		container_of(napi, struct send_queue, napi);
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+	unsigned int r, sent = 0;
+
+again:
+	__netif_tx_lock(txq, smp_processor_id());
+	sent += free_old_xmit_skbs(sq, budget - sent);
+
+	if (sent < budget) {
+		r = virtqueue_enable_cb_prepare_urgent(sq->vq);
+		napi_complete(napi);
+		__netif_tx_unlock(txq);
+		if (unlikely(virtqueue_poll(sq->vq, r)) &&
+		    napi_schedule_prep(napi)) {
+			virtqueue_disable_cb_urgent(sq->vq);
+			__napi_schedule(napi);
+			goto again;
+		}
+	} else {
+		__netif_tx_unlock(txq);
+	}
+
+	netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+	return sent;
+}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
+
 /* must be called with local_bh_disable()d */
 static int virtnet_busy_poll(struct napi_struct *napi)
 {
@@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 		virtnet_napi_enable(&vi->rq[i]);
+		napi_enable(&vi->sq[i].napi);
 	}
 
 	return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-	struct sk_buff *skb;
-	unsigned int len;
-	struct virtnet_info *vi = sq->vq->vdev->priv;
-	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-
-	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
-
-		u64_stats_update_begin(&stats->tx_syncp);
-		stats->tx_bytes += skb->len;
-		stats->tx_packets++;
-		u64_stats_update_end(&stats->tx_syncp);
-
-		dev_kfree_skb_any(skb);
-	}
-}
-
-static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
+static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool urgent)
 {
 	struct skb_vnet_hdr *hdr;
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -908,7 +948,43 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		sg_set_buf(sq->sg, hdr, hdr_len);
 		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
 	}
-	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
+	if (urgent)
+		return virtqueue_add_outbuf_urgent(sq->vq, sq->sg, num_sg,
+						   skb, GFP_ATOMIC);
+	else
+		return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
+					    GFP_ATOMIC);
+}
+
+static bool virtnet_skb_needs_intr(struct sk_buff *skb)
+{
+	union {
+		unsigned char *network;
+		struct iphdr *ipv4;
+		struct ipv6hdr *ipv6;
+	} hdr;
+	struct tcphdr *th = tcp_hdr(skb);
+	u16 payload_len;
+
+	hdr.network = skb_network_header(skb);
+
+	/* Only IPv4/IPv6 with TCP is supported */
+	if ((skb->protocol == htons(ETH_P_IP)) &&
+	    hdr.ipv4->protocol == IPPROTO_TCP) {
+		payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
+			      th->doff * 4;
+	} else if ((skb->protocol == htons(ETH_P_IPV6) ||
+		   hdr.ipv6->nexthdr == IPPROTO_TCP)) {
+		payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
+	} else {
+		return false;
+	}
+
+	/* We don't want to dealy packet with PUSH bit and pure ACK packet */
+	if (!th->psh && payload_len)
+		return true;
+
+	return false;
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -916,13 +992,15 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
-	int err;
+	bool urgent = virtnet_skb_needs_intr(skb);
+	int err, qsize = virtqueue_get_vring_size(sq->vq);
 
+	virtqueue_disable_cb_urgent(sq->vq);
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq);
+	free_old_xmit_skbs(sq, qsize);
 
 	/* Try to transmit */
-	err = xmit_skb(sq, skb);
+	err = xmit_skb(sq, skb, urgent);
 
 	/* This should not happen! */
 	if (unlikely(err)) {
@@ -935,22 +1013,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		return NETDEV_TX_OK;
 	}
 
-	/* Don't wait up for transmitted skbs to be freed. */
-	skb_orphan(skb);
-	nf_reset(skb);
+	if (!urgent) {
+		skb_orphan(skb);
+		nf_reset(skb);
+	}
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 		netif_stop_subqueue(dev, qnum);
+		virtqueue_disable_cb_urgent(sq->vq);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq);
+			free_old_xmit_skbs(sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
+	} else if (virtqueue_enable_cb_urgent(sq->vq)) {
+		free_old_xmit_skbs(sq, qsize);
 	}
 
 	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
@@ -1132,8 +1214,10 @@ static int virtnet_close(struct net_device *dev)
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		napi_disable(&vi->rq[i].napi);
+		napi_disable(&vi->sq[i].napi);
+	}
 
 	return 0;
 }
@@ -1452,8 +1536,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 {
 	int i;
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		netif_napi_del(&vi->rq[i].napi);
+		netif_napi_del(&vi->sq[i].napi);
+	}
 
 	kfree(vi->rq);
 	kfree(vi->sq);
@@ -1607,6 +1693,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
 		napi_hash_add(&vi->rq[i].napi);
+		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
@@ -1912,8 +2000,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	if (netif_running(vi->dev)) {
 		for (i = 0; i < vi->max_queue_pairs; i++) {
 			napi_disable(&vi->rq[i].napi);
+			napi_disable(&vi->sq[i].napi);
 			napi_hash_del(&vi->rq[i].napi);
 			netif_napi_del(&vi->rq[i].napi);
+			netif_napi_del(&vi->sq[i].napi);
 		}
 	}
 
@@ -1938,8 +2028,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(&vi->rq[i]);
+			napi_enable(&vi->sq[i].napi);
+		}
 	}
 
 	netif_device_attach(vi->dev);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next RFC 2/3] vhost: support urgent descriptors
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm
In-Reply-To: <1413011806-3813-1-git-send-email-jasowang@redhat.com>

This patches let vhost-net support urgent descriptors. For zerocopy case,
two new types of length was introduced to make it work.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   | 43 +++++++++++++++++++++++++++++++------------
 drivers/vhost/scsi.c  | 23 +++++++++++++++--------
 drivers/vhost/test.c  |  5 +++--
 drivers/vhost/vhost.c | 44 +++++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h | 19 +++++++++++++------
 5 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..37b0bb5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,9 +48,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN	3
+#define VHOST_DMA_FAILED_LEN	5
+/* Lower device DMA doen, urgent bit set */
+#define VHOST_DMA_DONE_LEN_URGENT	4
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN	2
+#define VHOST_DMA_DONE_LEN	3
+/* Lower device DMA in progress, urgent bit set */
+#define VHOST_DMA_URGENT	2
 /* Lower device DMA in progress */
 #define VHOST_DMA_IN_PROGRESS	1
 /* Buffer unused */
@@ -284,11 +288,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 		container_of(vq, struct vhost_net_virtqueue, vq);
 	int i, add;
 	int j = 0;
+	bool urgent = false;
 
 	for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
 		if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
 			vhost_net_tx_err(net);
 		if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+			urgent = urgent || vq->heads[i].len == VHOST_DMA_DONE_LEN_URGENT;
 			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
 			++j;
 		} else
@@ -296,7 +302,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
 	}
 	while (j) {
 		add = min(UIO_MAXIOV - nvq->done_idx, j);
-		vhost_add_used_and_signal_n(vq->dev, vq,
+		vhost_add_used_and_signal_n(vq->dev, vq, urgent,
 					    &vq->heads[nvq->done_idx], add);
 		nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
 		j -= add;
@@ -311,9 +317,14 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
 
 	rcu_read_lock_bh();
 
-	/* set len to mark this desc buffers done DMA */
-	vq->heads[ubuf->desc].len = success ?
-		VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
+	if (success) {
+		if (vq->heads[ubuf->desc].len == VHOST_DMA_IN_PROGRESS)
+			vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+		else
+			vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN_URGENT;
+	} else {
+		vq->heads[ubuf->desc].len = VHOST_DMA_FAILED_LEN;
+	}
 	cnt = vhost_net_ubuf_put(ubufs);
 
 	/*
@@ -363,6 +374,7 @@ static void handle_tx(struct vhost_net *net)
 	zcopy = nvq->ubufs;
 
 	for (;;) {
+		bool urgent;
 		/* Release DMAs done buffers first */
 		if (zcopy)
 			vhost_zerocopy_signal_used(net, vq);
@@ -374,7 +386,7 @@ static void handle_tx(struct vhost_net *net)
 			      % UIO_MAXIOV == nvq->done_idx))
 			break;
 
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
 					 NULL, NULL);
@@ -417,7 +429,8 @@ static void handle_tx(struct vhost_net *net)
 			ubuf = nvq->ubuf_info + nvq->upend_idx;
 
 			vq->heads[nvq->upend_idx].id = head;
-			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+			vq->heads[nvq->upend_idx].len = urgent ?
+				VHOST_DMA_URGENT : VHOST_DMA_IN_PROGRESS;
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
@@ -445,7 +458,7 @@ static void handle_tx(struct vhost_net *net)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
 		if (!zcopy_used)
-			vhost_add_used_and_signal(&net->dev, vq, head, 0);
+			vhost_add_used_and_signal(&net->dev, vq, urgent, head, 0);
 		else
 			vhost_zerocopy_signal_used(net, vq);
 		total_len += len;
@@ -488,6 +501,7 @@ static int peek_head_len(struct sock *sk)
  *	returns number of buffer heads allocated, negative on error
  */
 static int get_rx_bufs(struct vhost_virtqueue *vq,
+		       bool *urgentp,
 		       struct vring_used_elem *heads,
 		       int datalen,
 		       unsigned *iovcount,
@@ -502,11 +516,13 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 	int r, nlogs = 0;
 
 	while (datalen > 0 && headcount < quota) {
+		bool urgent = false;
+
 		if (unlikely(seg >= UIO_MAXIOV)) {
 			r = -ENOBUFS;
 			goto err;
 		}
-		r = vhost_get_vq_desc(vq, vq->iov + seg,
+		r = vhost_get_vq_desc(vq, &urgent, vq->iov + seg,
 				      ARRAY_SIZE(vq->iov) - seg, &out,
 				      &in, log, log_num);
 		if (unlikely(r < 0))
@@ -527,6 +543,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 			nlogs += *log_num;
 			log += *log_num;
 		}
+		*urgentp = *urgentp || urgent;
 		heads[headcount].id = d;
 		heads[headcount].len = iov_length(vq->iov + seg, in);
 		datalen -= heads[headcount].len;
@@ -590,9 +607,11 @@ static void handle_rx(struct vhost_net *net)
 	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
 	while ((sock_len = peek_head_len(sock->sk))) {
+		bool urgent = false;
+
 		sock_len += sock_hlen;
 		vhost_len = sock_len + vhost_hlen;
-		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
+		headcount = get_rx_bufs(vq, &urgent, vq->heads, vhost_len,
 					&in, vq_log, &log,
 					likely(mergeable) ? UIO_MAXIOV : 1);
 		/* On error, stop handling until the next kick. */
@@ -654,7 +673,7 @@ static void handle_rx(struct vhost_net *net)
 			vhost_discard_vq_desc(vq, headcount);
 			break;
 		}
-		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
+		vhost_add_used_and_signal_n(&net->dev, vq, urgent, vq->heads,
 					    headcount);
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, vhost_len);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 69906ca..0a7e5bc 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -72,6 +72,8 @@ struct tcm_vhost_cmd {
 	int tvc_vq_desc;
 	/* virtio-scsi initiator task attribute */
 	int tvc_task_attr;
+	/* Descriptor urgent? */
+	bool tvc_vq_desc_urgent;
 	/* virtio-scsi initiator data direction */
 	enum dma_data_direction tvc_data_direction;
 	/* Expected data transfer length from virtio-scsi header */
@@ -606,6 +608,7 @@ tcm_vhost_do_evt_work(struct vhost_scsi *vs, struct tcm_vhost_evt *evt)
 	struct virtio_scsi_event __user *eventp;
 	unsigned out, in;
 	int head, ret;
+	bool urgent;
 
 	if (!vq->private_data) {
 		vs->vs_events_missed = true;
@@ -614,7 +617,7 @@ tcm_vhost_do_evt_work(struct vhost_scsi *vs, struct tcm_vhost_evt *evt)
 
 again:
 	vhost_disable_notify(&vs->dev, vq);
-	head = vhost_get_vq_desc(vq, vq->iov,
+	head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 			ARRAY_SIZE(vq->iov), &out, &in,
 			NULL, NULL);
 	if (head < 0) {
@@ -643,7 +646,7 @@ again:
 	eventp = vq->iov[out].iov_base;
 	ret = __copy_to_user(eventp, event, sizeof(*event));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_add_used_and_signal(&vs->dev, vq, urgent, head, 0);
 	else
 		vq_err(vq, "Faulted on tcm_vhost_send_event\n");
 }
@@ -704,7 +707,8 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 		ret = copy_to_user(cmd->tvc_resp, &v_rsp, sizeof(v_rsp));
 		if (likely(ret == 0)) {
 			struct vhost_scsi_virtqueue *q;
-			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
+			vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc_urgent,
+				       cmd->tvc_vq_desc, 0);
 			q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq);
 			vq = q - vs->vqs;
 			__set_bit(vq, signal);
@@ -947,6 +951,7 @@ static void tcm_vhost_submission_work(struct work_struct *work)
 static void
 vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 			   struct vhost_virtqueue *vq,
+			   bool urgent,
 			   int head, unsigned out)
 {
 	struct virtio_scsi_cmd_resp __user *resp;
@@ -958,7 +963,7 @@ vhost_scsi_send_bad_target(struct vhost_scsi *vs,
 	resp = vq->iov[out].iov_base;
 	ret = __copy_to_user(resp, &rsp, sizeof(rsp));
 	if (!ret)
-		vhost_add_used_and_signal(&vs->dev, vq, head, 0);
+		vhost_add_used_and_signal(&vs->dev, vq, urgent, head, 0);
 	else
 		pr_err("Faulted on virtio_scsi_cmd_resp\n");
 }
@@ -980,6 +985,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 	u8 *target, *lunp, task_attr;
 	bool hdr_pi;
 	void *req, *cdb;
+	bool urgent;
 
 	mutex_lock(&vq->mutex);
 	/*
@@ -993,7 +999,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 	vhost_disable_notify(&vs->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					ARRAY_SIZE(vq->iov), &out, &in,
 					NULL, NULL);
 		pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
@@ -1067,7 +1073,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 
 		/* virtio-scsi spec requires byte 0 of the lun to be 1 */
 		if (unlikely(*lunp != 1)) {
-			vhost_scsi_send_bad_target(vs, vq, head, out);
+			vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 			continue;
 		}
 
@@ -1075,7 +1081,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 
 		/* Target does not exist, fail the request */
 		if (unlikely(!tpg)) {
-			vhost_scsi_send_bad_target(vs, vq, head, out);
+			vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 			continue;
 		}
 
@@ -1187,6 +1193,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		 * tcm_vhost_queue_data_in() and tcm_vhost_queue_status()
 		 */
 		cmd->tvc_vq_desc = head;
+		cmd->tvc_vq_desc_urgent = urgent;
 		/*
 		 * Dispatch tv_cmd descriptor for cmwq execution in process
 		 * context provided by tcm_vhost_workqueue.  This also ensures
@@ -1203,7 +1210,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 err_free:
 	vhost_scsi_free_cmd(cmd);
 err_cmd:
-	vhost_scsi_send_bad_target(vs, vq, head, out);
+	vhost_scsi_send_bad_target(vs, vq, urgent, head, out);
 out:
 	mutex_unlock(&vq->mutex);
 }
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index d9c501e..757f3a2 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -42,6 +42,7 @@ static void handle_vq(struct vhost_test *n)
 	int head;
 	size_t len, total_len = 0;
 	void *private;
+	bool urgent;
 
 	mutex_lock(&vq->mutex);
 	private = vq->private_data;
@@ -53,7 +54,7 @@ static void handle_vq(struct vhost_test *n)
 	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
-		head = vhost_get_vq_desc(vq, vq->iov,
+		head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
 					 NULL, NULL);
@@ -79,7 +80,7 @@ static void handle_vq(struct vhost_test *n)
 			vq_err(vq, "Unexpected 0 len for TX\n");
 			break;
 		}
-		vhost_add_used_and_signal(&n->dev, vq, head, 0);
+		vhost_add_used_and_signal(&n->dev, vq, urgent,  head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..8a35e14 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -186,6 +186,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_used_idx = 0;
 	vq->signalled_used = 0;
 	vq->signalled_used_valid = false;
+	vq->urgent = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -1201,7 +1202,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
  * This function returns the descriptor number found, or vq->num (which is
  * never a valid descriptor number) if none was found.  A negative code is
  * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+int vhost_get_vq_desc(struct vhost_virtqueue *vq, bool *urgent,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num)
@@ -1211,6 +1212,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 	u16 last_avail_idx;
 	int ret;
 
+	*urgent = false;
+
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
 	if (unlikely(__get_user(vq->avail_idx, &vq->avail->idx))) {
@@ -1274,6 +1277,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->desc + i);
 			return -EFAULT;
 		}
+		if (desc.flags & VRING_DESC_F_URGENT)
+			*urgent = true;
 		if (desc.flags & VRING_DESC_F_INDIRECT) {
 			ret = get_indirect(vq, iov, iov_size,
 					   out_num, in_num,
@@ -1333,11 +1338,11 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
 
 /* After we've used one of their buffers, we tell them about it.  We'll then
  * want to notify the guest, using eventfd. */
-int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
+int vhost_add_used(struct vhost_virtqueue *vq, bool urgent, unsigned int head, int len)
 {
 	struct vring_used_elem heads = { head, len };
 
-	return vhost_add_used_n(vq, &heads, 1);
+	return vhost_add_used_n(vq, urgent, &heads, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used);
 
@@ -1386,7 +1391,8 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
 /* After we've used one of their buffers, we tell them about it.  We'll then
  * want to notify the guest, using eventfd. */
-int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+int vhost_add_used_n(struct vhost_virtqueue *vq, bool urgent,
+		     struct vring_used_elem *heads,
 		     unsigned count)
 {
 	int start, n, r;
@@ -1416,13 +1422,14 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 		if (vq->log_ctx)
 			eventfd_signal(vq->log_ctx, 1);
 	}
+	vq->urgent = vq->urgent || urgent;
 	return r;
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
 static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 old, new, event;
+	__u16 old, new, event, flags;
 	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
@@ -1433,14 +1440,17 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	    unlikely(vq->avail_idx == vq->last_avail_idx))
 		return true;
 
-	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
-		__u16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
-			vq_err(vq, "Failed to get flags");
-			return true;
-		}
-		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
+	if (__get_user(flags, &vq->avail->flags)) {
+		vq_err(vq, "Failed to get flags");
+		return true;
 	}
+
+	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX))
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
+
+	if (vq->urgent && !(flags & VRING_AVAIL_F_NO_URGENT_INTERRUPT))
+		return true;
+
 	old = vq->signalled_used;
 	v = vq->signalled_used_valid;
 	new = vq->signalled_used = vq->last_used_idx;
@@ -1460,17 +1470,20 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx && vhost_notify(dev, vq))
+	if (vq->call_ctx && vhost_notify(dev, vq)) {
 		eventfd_signal(vq->call_ctx, 1);
+		vq->urgent = false;
+	}
 }
 EXPORT_SYMBOL_GPL(vhost_signal);
 
 /* And here's the combo meal deal.  Supersize me! */
 void vhost_add_used_and_signal(struct vhost_dev *dev,
 			       struct vhost_virtqueue *vq,
+			       bool urgent,
 			       unsigned int head, int len)
 {
-	vhost_add_used(vq, head, len);
+	vhost_add_used(vq, urgent, head, len);
 	vhost_signal(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
@@ -1478,9 +1491,10 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
 /* multi-buffer version of vhost_add_used_and_signal */
 void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 				 struct vhost_virtqueue *vq,
+				 bool urgent,
 				 struct vring_used_elem *heads, unsigned count)
 {
-	vhost_add_used_n(vq, heads, count);
+	vhost_add_used_n(vq, urgent, heads, count);
 	vhost_signal(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..61ca542 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -96,6 +96,9 @@ struct vhost_virtqueue {
 	/* Last used index value we have signalled on */
 	bool signalled_used_valid;
 
+	/* Urgent descriptor was used */
+	bool urgent;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -138,20 +141,24 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
 
-int vhost_get_vq_desc(struct vhost_virtqueue *,
+int vhost_get_vq_desc(struct vhost_virtqueue *, bool *urgent,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
 		      struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_init_used(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
-		     unsigned count);
-void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
+int vhost_add_used(struct vhost_virtqueue *, bool urgent,
+		   unsigned int head, int len);
+int vhost_add_used_n(struct vhost_virtqueue *, bool urgent,
+		     struct vring_used_elem *heads, unsigned count);
+void vhost_add_used_and_signal(struct vhost_dev *,
+			       struct vhost_virtqueue *,
+			       bool urgent,
 			       unsigned int id, int len);
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
-			       struct vring_used_elem *heads, unsigned count);
+				 bool urgent,
+				 struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm
In-Reply-To: <1413011806-3813-1-git-send-email-jasowang@redhat.com>

Below should be useful for some experiments Jason is doing.
I thought I'd send it out for early review/feedback.

event idx feature allows us to defer interrupts until
a specific # of descriptors were used.
Sometimes it might be useful to get an interrupt after
a specific descriptor, regardless.
This adds a descriptor flag for this, and an API
to create an urgent output descriptor.
This is still an RFC:
we'll need a feature bit for drivers to detect this,
but we've run out of feature bits for virtio 0.X.
For experimentation purposes, drivers can assume
this is set, or add a driver-specific feature bit.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_ring.c     | 75 +++++++++++++++++++++++++++++++++++++---
 include/linux/virtio.h           | 14 ++++++++
 include/uapi/linux/virtio_ring.h |  5 ++-
 3 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a..a5188c6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
 
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
+				     bool urgent,
 				     struct scatterlist *sgs[],
 				     struct scatterlist *(*next)
 				       (struct scatterlist *, unsigned int *),
@@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	/* Use a single buffer which doesn't continue */
 	head = vq->free_head;
 	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+	if (urgent)
+		vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
 	vq->vring.desc[head].addr = virt_to_phys(desc);
 	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
 	kmemleak_ignore(desc);
@@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
+			        bool urgent,
 				struct scatterlist *sgs[],
 				struct scatterlist *(*next)
 				  (struct scatterlist *, unsigned int *),
@@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
+		head = vring_add_indirect(vq, urgent, sgs, next, total_sg, total_out,
 					  total_in,
 					  out_sgs, in_sgs, gfp);
 		if (likely(head >= 0))
@@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
+			if (urgent) {
+				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
+				urgent = false;
+			}
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
@@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+			if (urgent) {
+				vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
+				urgent = false;
+			}
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
 			prev = i;
@@ -305,6 +317,8 @@ add_head:
 
 /**
  * virtqueue_add_sgs - expose buffers to other end
+ * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *          after this descriptor was completed
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
  * @out_num: the number of scatterlists readable by other side
@@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
 		for (sg = sgs[i]; sg; sg = sg_next(sg))
 			total_in++;
 	}
-	return virtqueue_add(_vq, sgs, sg_next_chained,
+	return virtqueue_add(_vq, false, sgs, sg_next_chained,
 			     total_out, total_in, out_sgs, in_sgs, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
@@ -360,11 +374,35 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
 			 void *data,
 			 gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+	return virtqueue_add(vq, false, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
 
 /**
+ * virtqueue_add_outbuf - expose output buffers to other end
+ *          in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *          after this descriptor was completed
+ * @vq: the struct virtqueue we're talking about.
+ * @sgs: array of scatterlists (need not be terminated!)
+ * @num: the number of scatterlists readable by other side
+ * @data: the token identifying the buffer.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
+ */
+int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
+			 struct scatterlist sg[], unsigned int num,
+			 void *data,
+			 gfp_t gfp)
+{
+	return virtqueue_add(vq, true, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+}
+EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_urgent);
+
+/**
  * virtqueue_add_inbuf - expose input buffers to other end
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of scatterlists (need not be terminated!)
@@ -382,7 +420,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
 			void *data,
 			gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
+	return virtqueue_add(vq, false, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
 
@@ -595,6 +633,14 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
+void virtqueue_disable_cb_urgent(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+
+	vq->vring.avail->flags |= VRING_AVAIL_F_NO_URGENT_INTERRUPT;
+}
+EXPORT_SYMBOL_GPL(virtqueue_disable_cb_urgent);
+
 /**
  * virtqueue_enable_cb_prepare - restart callbacks after disable_cb
  * @vq: the struct virtqueue we're talking about.
@@ -626,6 +672,19 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
+unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	u16 last_used_idx;
+
+	START_USE(vq);
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_URGENT_INTERRUPT;
+	last_used_idx = vq->last_used_idx;
+	END_USE(vq);
+	return last_used_idx;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare_urgent);
+
 /**
  * virtqueue_poll - query pending used buffers
  * @vq: the struct virtqueue we're talking about.
@@ -662,6 +721,14 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb);
 
+bool virtqueue_enable_cb_urgent(struct virtqueue *_vq)
+{
+	unsigned last_used_idx = virtqueue_enable_cb_prepare_urgent(_vq);
+
+	return !virtqueue_poll(_vq, last_used_idx);
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_urgent);
+
 /**
  * virtqueue_enable_cb_delayed - restart callbacks after disable_cb.
  * @vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b46671e..68be5f2 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -39,6 +39,12 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
 			 void *data,
 			 gfp_t gfp);
 
+int virtqueue_add_outbuf_urgent(struct virtqueue *vq,
+				struct scatterlist sg[], unsigned int num,
+				void *data,
+				gfp_t gfp);
+
+
 int virtqueue_add_inbuf(struct virtqueue *vq,
 			struct scatterlist sg[], unsigned int num,
 			void *data,
@@ -61,10 +67,18 @@ void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 
 void virtqueue_disable_cb(struct virtqueue *vq);
 
+void virtqueue_disable_cb_urgent(struct virtqueue *vq);
+
 bool virtqueue_enable_cb(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
+
+bool virtqueue_enable_cb_urgent(struct virtqueue *vq);
+
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
 
+unsigned virtqueue_enable_cb_prepare_urgent(struct virtqueue *vq);
+
 bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index a99f9b7..daf5bb0 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -39,6 +39,9 @@
 #define VRING_DESC_F_WRITE	2
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT	4
+/* This means the descriptor should cause an interrupt
+ * ignoring avail event idx. */
+#define VRING_DESC_F_URGENT	8
 
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
@@ -48,7 +51,7 @@
  * when you consume a buffer.  It's unreliable, so it's simply an
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT	1
-
+#define VRING_AVAIL_F_NO_URGENT_INTERRUPT	2
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
From: Jason Wang @ 2014-10-11  7:16 UTC (permalink / raw)
  To: rusty-8n+1lVoiYb80n/F98K4Iww, mst-H+wXaHxf7aLQT0dZR+AlfA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA,
	Jason Wang

Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c         | 164 ++++++++++++++++++++++++++++++---------
 drivers/vhost/net.c              |  43 +++++++---
 drivers/vhost/scsi.c             |  23 ++++--
 drivers/vhost/test.c             |   5 +-
 drivers/vhost/vhost.c            |  44 +++++++----
 drivers/vhost/vhost.h            |  19 +++--
 drivers/virtio/virtio_ring.c     |  75 +++++++++++++++++-
 include/linux/virtio.h           |  14 ++++
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [RFC PATCH 3/3] i2c: show and change bus frequency via sysfs
From: Octavian Purdila @ 2014-10-09 20:07 UTC (permalink / raw)
  To: wsa-z923LK4zBo2bacvFa/9K2g
  Cc: johan-DgEjT+Ai2ygdnm+yROfE0A, linux-i2c-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Octavian Purdila
In-Reply-To: <1412885235-14026-1-git-send-email-octavian.purdila-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This patch adds two new sysfs files, bus_frequency and
bus_supported_frequencies which allows the user to view or change the
bus frequency on a per bus level.

This is required for e.g. USB I2C buses where we can have multiple
device plugged in a system and where a module parameter does not allow
controlling the bus speed individually.

Signed-off-by: Octavian Purdila <octavian.purdila-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 Documentation/ABI/testing/sysfs-bus-i2c | 24 ++++++++++
 drivers/i2c/i2c-core.c                  | 77 +++++++++++++++++++++++++++++++++
 include/linux/i2c.h                     | 10 +++++
 3 files changed, 111 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-i2c b/Documentation/ABI/testing/sysfs-bus-i2c
index 22c621a..058ac68 100644
--- a/Documentation/ABI/testing/sysfs-bus-i2c
+++ b/Documentation/ABI/testing/sysfs-bus-i2c
@@ -43,3 +43,27 @@ Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
 Description:
 		An i2c device attached to bus X that is enumerated via
 		ACPI. Y is the ACPI device name and its format is "%s".
+
+What:		/sys/bus/i2c/devices/i2c-X/bus_frequency
+KernelVersion:	3.19
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		The current bus frequency for bus X. Can be updated if
+		the bus supports it. The unit is HZ and format is
+		"%d\n".
+		If the bus does not support showing/changing the
+		frequency then reading/writting to this entry will
+		fail with -EOPNOTSUPP.
+		When updating the bus frequency that value must be one
+		of the values in bus_supported_frequencies otherwise
+		writting will fail with -EINVAL.
+
+What:		/sys/bus/i2c/devices/i2c-X/bus_supported_frequencies
+KernelVersion:	3.19
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		Supported frequencies for bus X. The unit is HZ and
+		format is "%d %d %d ... %d\n".
+		If the bus does not support showing/changing the
+		frequency then reading to this entry will fail with
+		-EOPNOTSUPP.
diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c
index 632057a..32b918a 100644
--- a/drivers/i2c/i2c-core.c
+++ b/drivers/i2c/i2c-core.c
@@ -940,10 +940,87 @@ static DEVICE_ATTR(new_device, S_IWUSR, NULL, i2c_sysfs_new_device);
 static DEVICE_ATTR_IGNORE_LOCKDEP(delete_device, S_IWUSR, NULL,
 				   i2c_sysfs_delete_device);
 
+static ssize_t
+i2c_sysfs_freq_show(struct device *dev, struct device_attribute *attr,
+		    char *buf)
+{
+	struct i2c_adapter *adap = to_i2c_adapter(dev);
+
+	if (!adap->freq)
+		return -EOPNOTSUPP;
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", adap->freq);
+}
+
+static ssize_t
+i2c_sysfs_freq_store(struct device *dev, struct device_attribute *attr,
+		     const char *buf, size_t count)
+{
+	struct i2c_adapter *adap = to_i2c_adapter(dev);
+	unsigned int freq;
+	int ret;
+	int i;
+
+	if (!adap->set_freq)
+		return -EOPNOTSUPP;
+
+	if (kstrtouint(buf, 0, &freq))
+		return -EINVAL;
+
+	if (freq == 0)
+		return -EINVAL;
+
+	for (i = 0; i < I2C_MAX_FREQS && adap->supported_freqs[i]; i++)
+		if (adap->supported_freqs[i] >= freq)
+			break;
+
+	if (adap->supported_freqs[i] != freq)
+		return -EINVAL;
+
+	i2c_lock_adapter(adap);
+	ret = adap->set_freq(adap, freq);
+	i2c_unlock_adapter(adap);
+
+	if (ret)
+		return ret;
+
+	adap->freq = freq;
+
+	return count;
+}
+
+static DEVICE_ATTR(bus_frequency, S_IWUSR | S_IRUGO, i2c_sysfs_freq_show,
+		   i2c_sysfs_freq_store);
+
+
+static ssize_t
+i2c_sysfs_supp_freqs_show(struct device *dev, struct device_attribute *attr,
+			  char *buf)
+{
+	struct i2c_adapter *adap = to_i2c_adapter(dev);
+	ssize_t count = 0;
+	int i;
+
+	if (!adap->supported_freqs[0])
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < I2C_MAX_FREQS && adap->supported_freqs[i]; i++)
+		count += snprintf(buf + count, PAGE_SIZE - count, "%d ",
+				  adap->supported_freqs[i]);
+	buf[count - 1] = '\n';
+
+	return count;
+}
+
+static DEVICE_ATTR(bus_supported_frequencies, S_IRUGO,
+		   i2c_sysfs_supp_freqs_show, NULL);
+
 static struct attribute *i2c_adapter_attrs[] = {
 	&dev_attr_name.attr,
 	&dev_attr_new_device.attr,
 	&dev_attr_delete_device.attr,
+	&dev_attr_bus_frequency.attr,
+	&dev_attr_bus_supported_frequencies.attr,
 	NULL
 };
 
diff --git a/include/linux/i2c.h b/include/linux/i2c.h
index 36041e2..e410637 100644
--- a/include/linux/i2c.h
+++ b/include/linux/i2c.h
@@ -418,6 +418,8 @@ int i2c_recover_bus(struct i2c_adapter *adap);
 int i2c_generic_gpio_recovery(struct i2c_adapter *adap);
 int i2c_generic_scl_recovery(struct i2c_adapter *adap);
 
+#define I2C_MAX_FREQS 16
+
 /**
  * struct i2c_adapter - represents an I2C physical bus
  *
@@ -431,6 +433,10 @@ int i2c_generic_scl_recovery(struct i2c_adapter *adap);
  *	usb->interface->dev, platform_device->dev etc.)
  * @name: name of this i2c bus
  * @bus_recovery_info: see struct i2c_bus_recovery_info. Optional.
+ * @supported_freqs: supported bus frequencies (in HZ). Must be sorted
+ *	in ascending order. Optional.
+ * @freq: initial bus frequency. Optional.
+ * @set_bus_freq: set the bus frequency (in HZ). Optional.
  */
 struct i2c_adapter {
 	struct module *owner;
@@ -438,6 +444,10 @@ struct i2c_adapter {
 	const struct i2c_algorithm *algo; /* the algorithm to access the bus */
 	void *algo_data;
 
+	unsigned int supported_freqs[I2C_MAX_FREQS];
+	unsigned int freq;
+	int (*set_freq)(struct i2c_adapter *a, unsigned int freq);
+
 	/* data fields that are valid for all devices	*/
 	struct rt_mutex bus_lock;
 
-- 
1.9.1

^ permalink raw reply related

* [RFC PATCH 2/3] i2c: document struct i2c_adapter
From: Octavian Purdila @ 2014-10-09 20:07 UTC (permalink / raw)
  To: wsa-z923LK4zBo2bacvFa/9K2g
  Cc: johan-DgEjT+Ai2ygdnm+yROfE0A, linux-i2c-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Octavian Purdila
In-Reply-To: <1412885235-14026-1-git-send-email-octavian.purdila-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Document the i2c_adapter fields that must be initialized before
calling i2c_add_adapter().

Signed-off-by: Octavian Purdila <octavian.purdila-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 include/linux/i2c.h | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/i2c.h b/include/linux/i2c.h
index a95efeb..36041e2 100644
--- a/include/linux/i2c.h
+++ b/include/linux/i2c.h
@@ -418,9 +418,19 @@ int i2c_recover_bus(struct i2c_adapter *adap);
 int i2c_generic_gpio_recovery(struct i2c_adapter *adap);
 int i2c_generic_scl_recovery(struct i2c_adapter *adap);
 
-/*
- * i2c_adapter is the structure used to identify a physical i2c bus along
- * with the access algorithms necessary to access it.
+/**
+ * struct i2c_adapter - represents an I2C physical bus
+ *
+ * The following members must be set by the adapter driver before
+ * calling i2c_add_adapter():
+ *
+ * @owner: the module owner, usually THIS_MODULE
+ * @class: bitmask of I2C_CLASS_*
+ * @algo: see struct i2c_algorithm
+ * @dev.parent: set this to the struct device of the driver (e.g. pci_dev->dev,
+ *	usb->interface->dev, platform_device->dev etc.)
+ * @name: name of this i2c bus
+ * @bus_recovery_info: see struct i2c_bus_recovery_info. Optional.
  */
 struct i2c_adapter {
 	struct module *owner;
-- 
1.9.1

^ permalink raw reply related

* [RFC PATCH 1/3] i2c: document the existing i2c sysfs ABI
From: Octavian Purdila @ 2014-10-09 20:07 UTC (permalink / raw)
  To: wsa-z923LK4zBo2bacvFa/9K2g
  Cc: johan-DgEjT+Ai2ygdnm+yROfE0A, linux-i2c-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Octavian Purdila
In-Reply-To: <1412885235-14026-1-git-send-email-octavian.purdila-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This patch adds Documentation/ABI/testing/sysfs-bus-i2c which
documents the existing i2c sysfs ABI.

Signed-off-by: Octavian Purdila <octavian.purdila-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 Documentation/ABI/testing/sysfs-bus-i2c | 45 +++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-i2c

diff --git a/Documentation/ABI/testing/sysfs-bus-i2c b/Documentation/ABI/testing/sysfs-bus-i2c
new file mode 100644
index 0000000..22c621a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-i2c
@@ -0,0 +1,45 @@
+What:		/sys/bus/i2c/devices/i2c-X
+KernelVersion:	since at least 2.6.12
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		This entry represents a registered i2c bus. X is the
+		bus number and its format is "%d".
+
+What:		/sys/bus/i2c/devices/i2c-X/Y
+What:		/sys/bus/i2c/devices/Y
+KernelVersion:	since at least 2.6.12
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		An i2c device attached to bus X. Format of Y is
+		"%d-%04x" where the first number is the bus number (X)
+		and the second number is the device i2c address.
+
+What:		/sys/bus/i2c/devices/i2c-X/new_device
+KernelVersion:	2.6.31
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		Write only entry that allows the instantiation of a
+		new i2c device to bus X. This is to be used when
+		enumeration mechanism such as ACPI or DT are not
+		present or not used for this device.
+		Format: "%s %hi\n" where the first argument is the
+		device name (no spaces allowed) and the second is the
+		i2c address of the device.
+
+What:		/sys/bus/i2c/devices/i2c-X/delete_device
+KernelVersion:	2.6.31
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		Write only entry that allows the removal of an i2c
+		device from bus X.
+		Format: "%s %hi\n" where the first argument is the
+		device name (no spaces allowed) and the second is the
+		i2c address of the device.
+
+What:		/sys/bus/i2c/devices/i2c-X/i2c-Y
+What:		/sys/bus/i2c/devices/i2c-Y
+KernelVersion:	3.13
+Contact:	linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+Description:
+		An i2c device attached to bus X that is enumerated via
+		ACPI. Y is the ACPI device name and its format is "%s".
-- 
1.9.1

^ permalink raw reply related

* [RFC PATCH 0/3] i2c: show and change bus frequency via sysfs
From: Octavian Purdila @ 2014-10-09 20:07 UTC (permalink / raw)
  To: wsa-z923LK4zBo2bacvFa/9K2g
  Cc: johan-DgEjT+Ai2ygdnm+yROfE0A, linux-i2c-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Octavian Purdila

The first two patches are just general cleanup the actual
implementation is in patch 3. If the general direction looks
acceptable, I will send a follow-up series that adds sysfs bus
frequency support to the existing drivers.

Octavian Purdila (3):
  i2c: document the existing i2c sysfs ABI
  i2c: document struct i2c_adapter
  i2c: show and change bus frequency via sysfs

 Documentation/ABI/testing/sysfs-bus-i2c | 69 +++++++++++++++++++++++++++++
 drivers/i2c/i2c-core.c                  | 77 +++++++++++++++++++++++++++++++++
 include/linux/i2c.h                     | 26 +++++++++--
 3 files changed, 169 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-i2c

-- 
1.9.1

^ permalink raw reply

* RE: [PATCH 3/8] iio: Add support for DA9150 GPADC
From: Opensource [Adam Thomson] @ 2014-10-09 14:29 UTC (permalink / raw)
  To: Jonathan Cameron, Opensource [Adam Thomson], Lee Jones,
	Samuel Ortiz, linux-iio-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sebastian Reichel, Dmitry Eremin-Solenikov, David Woodhouse,
	linux-pm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Rob Herring,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton,
	Joe Perches, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Support Opensource
In-Reply-To: <3B049606-083E-4220-ADBF-71D74C0A9D9F-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On October 7, 2014 20:37, Jonathan Cameron wrote:

> On October 7, 2014 3:55:55 PM GMT+01:00, "Opensource [Adam Thomson]"
> <Adam.Thomson.Opensource@diasemi.com> wrote:
> >On September 27, 2014 11:50, Jonathan Cameron wrote:
> >
> >> On 23/09/14 11:53, Adam Thomson wrote:
> >> > This patch adds support for DA9150 Charger & Fuel-Gauge IC GPADC.
> >
> >> > +
> >> > +static inline int da9150_gpadc_gpio_6v_voltage_now(int raw_val)
> >> > +{
> >> > +	/* Convert to mV */
> >> > +	return (6 * ((raw_val * 1000) + 500)) / 1024;
> >> These could all be expressed as raw values with offsets
> >> and scales (and that would be preferred).
> >> E.g. This one has offset 500000 and scale 6000/1024 or even
> >> better use IIO_VAL_FRACTIONAL_LOG2 for scale with val1 = 6000
> >> and val2 = (log_2 1024) = 10.
> >>
> >
> >What you've suggested isn't correct. The problem here is that the
> >offset is
> >added first to the raw ADC reading, without factoring the ADC value
> >accordingly
> >to match the factor of the offset. If we take the original equation
> >provided for
> >this channel of the ADC, the offset is actually 0.5 which should be
> >added to the
> >raw ADC value. This doesn't fit into the implementation in the kernel
> >as we
> >can't use floating point. If we multiply the offset but not the raw ADC
> >value,
> >then add them before applying the scale factor, then the result is
> >wrong at the
> >end. Basically you need a scale for the raw ADC value to match the
> >offset scale
> >so you can achieve the correct results, which is what my calculation
> >does.
> >But that seems impossible with the current raw|offset|scale method.
> Oops got that wrong.  The fixed point maths to fix the in kernel interface isn't exactly
>  difficult but indeed it does not handle this currently.

I did have a quick look when I had a spare moment, and I guess you could do
something like having an offset scale/factor which can be applied to the raw
reading, prior to adding the offset. May be an option but I think this would
also have to be exposed to user-space as I believe the same problem would reside
there as well.

> >
> >> > +	ret = iio_map_array_register(indio_dev,
> >da9150_gpadc_default_maps);
> >> > +	if (ret) {
> >> > +		dev_err(dev, "Failed to register IIO maps: %d\n", ret);
> >> > +		return ret;
> >> > +	}
> >> I'd suggest doing the devm_request_thread_irq before the
> >iio_map_array
> >> stuff.  This is purely to avoid the order during remove not being
> >> obviously correct as it isn't the reverse of during probe.
> >
> >Ok, should still work ok that way so can update.
> >
> >> > +static int da9150_gpadc_remove(struct platform_device *pdev)
> >> > +{
> >> > +	struct iio_dev *indio_dev = platform_get_drvdata(pdev);
> >> > +
> >> > +	iio_map_array_unregister(indio_dev);
> >> Twice in one day.  I'm definitely thinking we should add a
> >> devm version of iio_map_array_register...
> >
> >I assume you mean here that iio_device_unregister() should come first?
> >Will
> >update.
> Nope just that such a new function might be useful.

:) Ok fair enough.

> 
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andrew Vagin @ 2014-10-09 10:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Alexander Viro, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Serge Hallyn, Rob Landley
In-Reply-To: <87vbnue56f.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Wed, Oct 08, 2014 at 12:23:52PM -0700, Eric W. Biederman wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> 
> > On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin <avagin-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> >> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
> >>> Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> writes:
> >>>
> >>> > From: Andrey Vagin <avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >>> >
> >>> > Currently when we create a new container with a separate root,
> >>> > we need to clone the current mount namespace with all mounts and then
> >>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
> >>> > only to be umounted.
> >>>
> >>> Is the motivation performance?  Because if that is the motivation we
> >>> need numbers.
> >>
> >> The major motivation to create a clean mount namespace which contains
> >> only required mounts.
> >>
> >> Now you want to convince us that there is nothing wrong if we use
> >> userns, because all inherited mounts are locked. My point is that all
> >> useless mounts should be umounted.  If the current root isn't on rootfs,
> >> pivot_root() allows us to umount all useless points. But pivot_root()
> >> doesn't work, if the current root is on rootfs. How can we umount
> >> useless points in this case?
> 
> One of your justifications for a new system call was so you could do
> less.  Doing less to get to where you want to go is only justified when
> your doing less to get better performance.
> 
> >> Maybe we want to say that rootfs should not be used if we are going to
> >> create containers...
> 
> Today it is an assumption of the vfs that rootfs is mounted.  With
> rootfs mounted and pivot_root at the base of the mount stack you can
> make as minimal of a set of mounts as the vfs allows.

You have misunderstood me.
For most system /proc/self/mountinfo looks like this:
[root@dhcp-10-30-23-214 ~]# cat /proc/self/mountinfo 
17 22 0:3 / /proc rw,relatime - proc proc rw
18 22 0:0 / /sys rw,relatime - sysfs sysfs rw
19 22 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=502324k,nr_inodes=125581,mode=755
20 19 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
21 19 0:17 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw
22 1 253:2 / / rw,relatime - ext4 /dev/vda2 rw,barrier=1,data=ordered
24 22 253:1 / /boot rw,relatime - ext3 /dev/vda1 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered

/ isn't a rootfs mount here and pivot_root() works fine in this case. Here is
no problem for such system.

Now look at the second case:
hell@android:/ $ cat /proc/self/mountinfo
1 1 0:1 / / ro,relatime - rootfs rootfs ro
11 1 0:11 / /dev rw,nosuid,relatime - tmpfs tmpfs rw,mode=755
12 11 0:9 / /dev/pts rw,relatime - devpts devpts rw,mode=600
13 1 0:3 / /proc rw,relatime - proc proc rw
14 1 0:12 / /sys rw,relatime - sysfs sysfs rw

Now / is an rootfs mount. pivot_root() doesn't work in this case and we
need to do some tricks to get a minimal set of mounts.

Thanks,
Andrew

> 
> Removing rootfs from the vfs requires an audit of everything that
> manipulates mounts.  It is not remotely a local excercise.
> 
> One of the things that needs to be considered is that if you really want
> to audit mounts is the code that needs manipulates them needs to be
> audited every bit as much as the mounts themselves.
> 
> > Could we have an extra rootfs-like fs that is always completely empty,
> > doesn't allow any writes, and can sit at the bottom of container
> > namespace hierarchies?  If so, and if we add a new syscall that's like
> > pivot_root (or unshare) but prunes the hierarchy, then we could switch
> > to that rootfs then.
> 
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted.  That is
> certainly a much more localized change if we want to go there.
> 
> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs).  That gets us into figuring out if we break userspace
> which is a big challenge.
> 
> Eric

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v2 2/7] tpm: two-phase chip management functions
From: Jarkko Sakkinen @ 2014-10-09  9:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141007223442.GA3014-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, Oct 07, 2014 at 04:34:42PM -0600, Jason Gunthorpe wrote:
> On Wed, Oct 08, 2014 at 01:28:14AM +0300, Jarkko Sakkinen wrote:
> 
> > > > @@ -714,15 +709,10 @@ static int tpm_tis_i2c_remove(struct i2c_client *client)
> > > >  	struct tpm_chip *chip = tpm_dev.chip;
> > > >  	release_locality(chip, chip->vendor.locality, 1);
> > > >  
> > > > -	/* close file handles */
> > > > -	tpm_dev_vendor_release(chip);
> > > > -
> > > >  	/* remove hardware */
> > > >  	tpm_remove_hardware(chip->dev);
> > > 
> > > Wrong ordering here, tpm_remove_hardware should always be first -
> > > drivers should not tear down internal state before calling it, so
> > > release_locality should be second.
> > > 
> > > Noting that since we use devm the kfree will not happen until
> > > remove returns, so the chip pointer is still valid.
> > 
> > Should I fix this ordering? I was thinking to focus putting proper
> > patterns in place only in tpm_tis and tpm_crb because they are the
> > that I'm able to test easily and then they can work as guideline for
> > other drivers.
> 
> I think since this patch is already touching this function there is
> no reason not to make it be correct (especially since it was noticed)
> 
> The rest can wait till we globally replace tpm_remove_hardware with
> tpm_unregister - at that time the ordering can be audited and
> checked.
> 
> Then the drivers will be clean and the core can finally be fixed.

This makes sense. I'll also document this. And I decided to completely
wipe old tpm_register/remove_hardware() completely from v3 because they
only cause confusion.

I pushed patch that should implement fix for the ordering into tpm2-v2
branch:

https://github.com/jsakkine/linux-tpm2/commit/63ab650fa6f8dddd95100869e50275801d7d9360

> Jason

/Jarkko

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-08 23:41 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Rob Landley, Eric W. Biederman, Andrew Vagin, Andrey Vagin,
	Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Alexander Viro, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Serge Hallyn
In-Reply-To: <20141008233854.GG31366@ubuntumail>

On Wed, Oct 8, 2014 at 4:38 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
>> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <rob-VoJi6FS/r0vR7s880joybQ@public.gmane.org> wrote:
>> > On 10/08/14 14:31, Andy Lutomirski wrote:
>> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>> >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> >>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>> >>>>> Maybe we want to say that rootfs should not be used if we are going to
>> >>>>> create containers...
>> >>>
>> >>> Today it is an assumption of the vfs that rootfs is mounted.  With
>> >>> rootfs mounted and pivot_root at the base of the mount stack you can
>> >>> make as minimal of a set of mounts as the vfs allows.
>> >>>
>> >>> Removing rootfs from the vfs requires an audit of everything that
>> >>> manipulates mounts.  It is not remotely a local excercise.
>> >>
>> >> Would it be a less invasive audit to allow different mount namespaces
>> >> to have different rootfses?
>> >
>> > I.E. The same way different namespaces have different init tasks?
>> >
>> > The abstraction containers has implemented here should be logically
>> > consistent.
>> >
>> >>>> Could we have an extra rootfs-like fs that is always completely empty,
>> >>>> doesn't allow any writes, and can sit at the bottom of container
>> >>>> namespace hierarchies?  If so, and if we add a new syscall that's like
>> >>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> >>>> to that rootfs then.
>> >>>
>> >>> Or equally have something that guarantees that rootfs is empty and
>> >>> read-only at the time the normal root filesystem is mounted.  That is
>> >>> certainly a much more localized change if we want to go there.
>> >>>
>> >>> I am half tempted to suggest that mount --move /some/path / be updated
>> >>> to make the old / just go away (perhaps to be replaced with a read-only
>> >>> empty rootfs).  That gets us into figuring out if we break userspace
>> >>> which is a big challenge.
>> >>
>> >> Hence my argument for a new syscall or entirely new operation.
>> >
>> > I'm still waiting for somebody to explain to my why chroot() shouldn't
>> > be changed to do this instead of adding a new syscall. (At least when
>> > mount namespace support is enabled.)
>>
>> Because chroot has no effect on the namespace at all.  If you fork and
>> the child chroots, the parent isn't chrooted.  And, more importantly
>> for my example, is a process has it's cwd as /foo, and then it forks
>> and the child chroots, then parent's ".." isn't changed as a result of
>> the chroot.
>>
>> >
>> >> mount(2) and friends are way too multiplexed right now.  I just found
>> >> yet another security bug due to the insanely complicated semantics of
>> >> the vfs syscalls.  (Yes, a different one from the one yesterday.)
>> >
>> > As the guy who rewrote busybox mount 3 times, and who just implemented a
>> > brand new one (toybox) from scratch:
>> >
>> > It's a bit fiddly, yes.
>> >
>> >> A new operation kills several birds with one stone.  It could look like:
>> >>
>> >> int mntns_change_root(int dfd, const char *path, int flags);
>> >>
>> >> return -EPERM if chrooted.
>> >
>> > Really?
>>
>> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
>> to do if the caller is chrooted?  The current behavior is obviously
>> incorrect (it leaks memory), but it's not entirely clear to me what
>> should happen.  I think it should either be disallowed or should have
>> well-defined semantics.
>>
>> For simplicity, if a new syscall for this is added, then I think that
>> the caller-is-chrooted case should be disallowed.  If someone needs it
>> and can articulate what the semantics should be, then I have no
>> problem with allowing it going forward.
>
> It's not that I'd have a need for that, but rather if for some
> reason I started out chrooted due to some bogus initramfs, I'd
> prefer to not have to feel like a criminial and escape the chroot
> first.

You already can't create a userns if you're chrooted (even if you have
global privilege).

--Andy

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Serge Hallyn @ 2014-10-08 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rob Landley, Eric W. Biederman, Andrew Vagin, Andrey Vagin,
	Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Alexander Viro, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Serge Hallyn
In-Reply-To: <CALCETrXapWTiFw2CC1m43fs9yuHuesXxXtmHh-5F3J_bUYeRxg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <rob-VoJi6FS/r0vR7s880joybQ@public.gmane.org> wrote:
> > On 10/08/14 14:31, Andy Lutomirski wrote:
> >> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> >> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> >>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
> >>>>> Maybe we want to say that rootfs should not be used if we are going to
> >>>>> create containers...
> >>>
> >>> Today it is an assumption of the vfs that rootfs is mounted.  With
> >>> rootfs mounted and pivot_root at the base of the mount stack you can
> >>> make as minimal of a set of mounts as the vfs allows.
> >>>
> >>> Removing rootfs from the vfs requires an audit of everything that
> >>> manipulates mounts.  It is not remotely a local excercise.
> >>
> >> Would it be a less invasive audit to allow different mount namespaces
> >> to have different rootfses?
> >
> > I.E. The same way different namespaces have different init tasks?
> >
> > The abstraction containers has implemented here should be logically
> > consistent.
> >
> >>>> Could we have an extra rootfs-like fs that is always completely empty,
> >>>> doesn't allow any writes, and can sit at the bottom of container
> >>>> namespace hierarchies?  If so, and if we add a new syscall that's like
> >>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
> >>>> to that rootfs then.
> >>>
> >>> Or equally have something that guarantees that rootfs is empty and
> >>> read-only at the time the normal root filesystem is mounted.  That is
> >>> certainly a much more localized change if we want to go there.
> >>>
> >>> I am half tempted to suggest that mount --move /some/path / be updated
> >>> to make the old / just go away (perhaps to be replaced with a read-only
> >>> empty rootfs).  That gets us into figuring out if we break userspace
> >>> which is a big challenge.
> >>
> >> Hence my argument for a new syscall or entirely new operation.
> >
> > I'm still waiting for somebody to explain to my why chroot() shouldn't
> > be changed to do this instead of adding a new syscall. (At least when
> > mount namespace support is enabled.)
> 
> Because chroot has no effect on the namespace at all.  If you fork and
> the child chroots, the parent isn't chrooted.  And, more importantly
> for my example, is a process has it's cwd as /foo, and then it forks
> and the child chroots, then parent's ".." isn't changed as a result of
> the chroot.
> 
> >
> >> mount(2) and friends are way too multiplexed right now.  I just found
> >> yet another security bug due to the insanely complicated semantics of
> >> the vfs syscalls.  (Yes, a different one from the one yesterday.)
> >
> > As the guy who rewrote busybox mount 3 times, and who just implemented a
> > brand new one (toybox) from scratch:
> >
> > It's a bit fiddly, yes.
> >
> >> A new operation kills several birds with one stone.  It could look like:
> >>
> >> int mntns_change_root(int dfd, const char *path, int flags);
> >>
> >> return -EPERM if chrooted.
> >
> > Really?
> 
> Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
> to do if the caller is chrooted?  The current behavior is obviously
> incorrect (it leaks memory), but it's not entirely clear to me what
> should happen.  I think it should either be disallowed or should have
> well-defined semantics.
> 
> For simplicity, if a new syscall for this is added, then I think that
> the caller-is-chrooted case should be disallowed.  If someone needs it
> and can articulate what the semantics should be, then I have no
> problem with allowing it going forward.

It's not that I'd have a need for that, but rather if for some
reason I started out chrooted due to some bogus initramfs, I'd
prefer to not have to feel like a criminial and escape the chroot
first.

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-08 22:01 UTC (permalink / raw)
  To: Rob Landley
  Cc: Eric W. Biederman, Andrew Vagin, Andrey Vagin, Linux FS Devel,
	linux-kernel@vger.kernel.org, Linux API, Andrey Vagin,
	Alexander Viro, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn
In-Reply-To: <5435AE41.20105@landley.net>

On Wed, Oct 8, 2014 at 2:36 PM, Rob Landley <rob@landley.net> wrote:
> On 10/08/14 14:31, Andy Lutomirski wrote:
>> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>>>> Maybe we want to say that rootfs should not be used if we are going to
>>>>> create containers...
>>>
>>> Today it is an assumption of the vfs that rootfs is mounted.  With
>>> rootfs mounted and pivot_root at the base of the mount stack you can
>>> make as minimal of a set of mounts as the vfs allows.
>>>
>>> Removing rootfs from the vfs requires an audit of everything that
>>> manipulates mounts.  It is not remotely a local excercise.
>>
>> Would it be a less invasive audit to allow different mount namespaces
>> to have different rootfses?
>
> I.E. The same way different namespaces have different init tasks?
>
> The abstraction containers has implemented here should be logically
> consistent.
>
>>>> Could we have an extra rootfs-like fs that is always completely empty,
>>>> doesn't allow any writes, and can sit at the bottom of container
>>>> namespace hierarchies?  If so, and if we add a new syscall that's like
>>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>>> to that rootfs then.
>>>
>>> Or equally have something that guarantees that rootfs is empty and
>>> read-only at the time the normal root filesystem is mounted.  That is
>>> certainly a much more localized change if we want to go there.
>>>
>>> I am half tempted to suggest that mount --move /some/path / be updated
>>> to make the old / just go away (perhaps to be replaced with a read-only
>>> empty rootfs).  That gets us into figuring out if we break userspace
>>> which is a big challenge.
>>
>> Hence my argument for a new syscall or entirely new operation.
>
> I'm still waiting for somebody to explain to my why chroot() shouldn't
> be changed to do this instead of adding a new syscall. (At least when
> mount namespace support is enabled.)

Because chroot has no effect on the namespace at all.  If you fork and
the child chroots, the parent isn't chrooted.  And, more importantly
for my example, is a process has it's cwd as /foo, and then it forks
and the child chroots, then parent's ".." isn't changed as a result of
the chroot.

>
>> mount(2) and friends are way too multiplexed right now.  I just found
>> yet another security bug due to the insanely complicated semantics of
>> the vfs syscalls.  (Yes, a different one from the one yesterday.)
>
> As the guy who rewrote busybox mount 3 times, and who just implemented a
> brand new one (toybox) from scratch:
>
> It's a bit fiddly, yes.
>
>> A new operation kills several birds with one stone.  It could look like:
>>
>> int mntns_change_root(int dfd, const char *path, int flags);
>>
>> return -EPERM if chrooted.
>
> Really?

Now that CVE-2014-7970 is public: what the heck is pivot_root supposed
to do if the caller is chrooted?  The current behavior is obviously
incorrect (it leaks memory), but it's not entirely clear to me what
should happen.  I think it should either be disallowed or should have
well-defined semantics.

For simplicity, if a new syscall for this is added, then I think that
the caller-is-chrooted case should be disallowed.  If someone needs it
and can articulate what the semantics should be, then I have no
problem with allowing it going forward.

>
>>  Returns -EINVAL if path (relative to dfd) isn't a mountmount.
>
> Requiring that chroot() only be called on mountpoints would break
> existing semantics, which gets us back to new systemcall instead of
> changing behavior of existing one.

But we're talking about a pivot_root replacement.  You can always
create a bindmount.  Alternatively, the syscall could create a fresh
bind-mount and reattach all children to it if needed.

>
> If I recall, the first line of pushback against merging the openvz code
> as is was "buckets of new syscalls". Pushback against adding a new
> system call is understandable. Why can't we fix chroot() now that we
> have the tools to do so?

Because chroot already does something else.  In particularly, the new
"root"'s parents are *still there*, which is why it's so easy to
escape from a chroot.  Sensible container systems (and initramfs
code!) wants to clean up all the mounts that should be unreachable.

>
>>  Otherwise it disconnects path from the existing
>> hierarchy, attaches a permanently-empty read-only rootfs under it,
>> makes it the root of the mntns, and does the root refs fixup.  The old
>> hierarchy gets thrown out.
>
> We have a chroot() syscall. We don't use it for containers because it
> doesn't do what we want. Does it currently do what _anybody_ wants?

Irrelevant.  It's POSIX and we'll break all kinds of things if we change it.

>
>> Systemd could use this, too.
>
> While that's a strong argument against it, I'm willing to overlook it.

:)

Feel free to read that as "an initramfs /init that prepares to hand
off to /sbin/init could use this."  busybox's switch_root could do
this, too, and even my virtme tool would indirectly benefit slightly.
(virtme uses an init system that is a few tens of lines of
busybox-compatible shell script that runs in a highly controlled
environment.)

--Andy

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Rob Landley @ 2014-10-08 21:36 UTC (permalink / raw)
  To: Andy Lutomirski, Eric W. Biederman
  Cc: Andrew Vagin, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Alexander Viro, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Serge Hallyn
In-Reply-To: <CALCETrVSxYr=Oa29qHNL-GoifS26U8TfpreGY+KN7g926YgHUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 10/08/14 14:31, Andy Lutomirski wrote:
> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>> Maybe we want to say that rootfs should not be used if we are going to
>>>> create containers...
>>
>> Today it is an assumption of the vfs that rootfs is mounted.  With
>> rootfs mounted and pivot_root at the base of the mount stack you can
>> make as minimal of a set of mounts as the vfs allows.
>>
>> Removing rootfs from the vfs requires an audit of everything that
>> manipulates mounts.  It is not remotely a local excercise.
> 
> Would it be a less invasive audit to allow different mount namespaces
> to have different rootfses?

I.E. The same way different namespaces have different init tasks?

The abstraction containers has implemented here should be logically
consistent.

>>> Could we have an extra rootfs-like fs that is always completely empty,
>>> doesn't allow any writes, and can sit at the bottom of container
>>> namespace hierarchies?  If so, and if we add a new syscall that's like
>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>> to that rootfs then.
>>
>> Or equally have something that guarantees that rootfs is empty and
>> read-only at the time the normal root filesystem is mounted.  That is
>> certainly a much more localized change if we want to go there.
>>
>> I am half tempted to suggest that mount --move /some/path / be updated
>> to make the old / just go away (perhaps to be replaced with a read-only
>> empty rootfs).  That gets us into figuring out if we break userspace
>> which is a big challenge.
> 
> Hence my argument for a new syscall or entirely new operation.

I'm still waiting for somebody to explain to my why chroot() shouldn't
be changed to do this instead of adding a new syscall. (At least when
mount namespace support is enabled.)

> mount(2) and friends are way too multiplexed right now.  I just found
> yet another security bug due to the insanely complicated semantics of
> the vfs syscalls.  (Yes, a different one from the one yesterday.)

As the guy who rewrote busybox mount 3 times, and who just implemented a
brand new one (toybox) from scratch:

It's a bit fiddly, yes.

> A new operation kills several birds with one stone.  It could look like:
> 
> int mntns_change_root(int dfd, const char *path, int flags);
> 
> return -EPERM if chrooted.

Really?

>  Returns -EINVAL if path (relative to dfd) isn't a mountmount.

Requiring that chroot() only be called on mountpoints would break
existing semantics, which gets us back to new systemcall instead of
changing behavior of existing one.

If I recall, the first line of pushback against merging the openvz code
as is was "buckets of new syscalls". Pushback against adding a new
system call is understandable. Why can't we fix chroot() now that we
have the tools to do so?

>  Otherwise it disconnects path from the existing
> hierarchy, attaches a permanently-empty read-only rootfs under it,
> makes it the root of the mntns, and does the root refs fixup.  The old
> hierarchy gets thrown out.

We have a chroot() syscall. We don't use it for containers because it
doesn't do what we want. Does it currently do what _anybody_ wants?

> Systemd could use this, too.

While that's a strong argument against it, I'm willing to overlook it.

Rob

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Rob Landley @ 2014-10-08 21:23 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski
  Cc: Andrew Vagin, Andrey Vagin, Linux FS Devel,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Andrey Vagin, Alexander Viro, Andrew Morton, Cyrill Gorcunov,
	Pavel Emelyanov, Serge Hallyn
In-Reply-To: <87vbnue56f.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On 10/08/14 14:23, Eric W. Biederman wrote:
>> Could we have an extra rootfs-like fs that is always completely empty,
>> doesn't allow any writes, and can sit at the bottom of container
>> namespace hierarchies?  If so, and if we add a new syscall that's like
>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> to that rootfs then.
> 
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted.  That is
> certainly a much more localized change if we want to go there.

What do you mean "normal" root filesystem? It is entirely possible (and
in fact common in the embedded world) to run from rootfs. I pushed my
old inittmpfs patches (at the request of cray) last year because being
able to take down the system with "cat /dev/zero > /blah" (as rootfs
allows and tmpfs doesn't) is a bad thing.

Rootfs is about as special as PID 1 is. We don't filter out PID 1 from
"ps" to avoid confusing people, but for some reason whoever did
/proc/$PID/mountinfo decided that rootfs shouldn't show up because magic
magic specialness.

We show /run, which is a tmpfs instance. If I mount two different
filesystems on top of each other on /mnt, it shows both. (Overmounts
were not invented by rootfs.) But no, mountinfo filters out rootfs
because magic magic specialness.

It makes me sad that this kind of special-case thinking is allowed in
the kernel.

> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs).  That gets us into figuring out if we break userspace
> which is a big challenge.

My concern was that chroot() moving a magic "/" pointer that you can
trivially escape from with x=open("."); chroot("sub"); fdchdir(".");
chdir("../../../../../../../../.."); is having extra code in the kernel
to do it _wrong_.

We have per-process namespaces now. We can actually adjust the mount
tree (inserting a new bind mount if the directory we're changing to is
not already a mount point). If a per-process namespace needs to be
anchored by a tmpfs, fine. But requiring that to be teh SAME instance
globally for the entire system is not what containers is _about_. It's
not true for PID 1 and it shouldn't be true for rootfs.

By all means, if a filesystem is no longer accessable in a namespace,
decrement its reference count. (Keeping in mind that a bind mount should
count as a reference, and rootfs should always have a nonzero reference
count.) But "/" is not special in this regard. If you want to make all
overmounts vanish (which seems like a really bad idea and breaks 40
years of unix semantics), argue for that. Please stop treating rootfs
like it isn't potentialy usable as a full-fledged filesystem.

(Pet peeve of mine.)

> Eric

Rob

^ permalink raw reply

* Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the current root
From: Andy Lutomirski @ 2014-10-08 19:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Vagin, Andrey Vagin, Linux FS Devel,
	linux-kernel@vger.kernel.org, Linux API, Andrey Vagin,
	Alexander Viro, Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov,
	Serge Hallyn, Rob Landley
In-Reply-To: <87vbnue56f.fsf@x220.int.ebiederm.org>

On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> On Wed, Oct 8, 2014 at 4:08 AM, Andrew Vagin <avagin@parallels.com> wrote:
>>> On Tue, Oct 07, 2014 at 01:45:22PM -0700, Eric W. Biederman wrote:
>>>> Andrey Vagin <avagin@openvz.org> writes:
>>>>
>>>> > From: Andrey Vagin <avagin@gmail.com>
>>>> >
>>>> > Currently when we create a new container with a separate root,
>>>> > we need to clone the current mount namespace with all mounts and then
>>>> > clean up it by using pivot_root(). A big part of mountpoints are cloned
>>>> > only to be umounted.
>>>>
>>>> Is the motivation performance?  Because if that is the motivation we
>>>> need numbers.
>>>
>>> The major motivation to create a clean mount namespace which contains
>>> only required mounts.
>>>
>>> Now you want to convince us that there is nothing wrong if we use
>>> userns, because all inherited mounts are locked. My point is that all
>>> useless mounts should be umounted.  If the current root isn't on rootfs,
>>> pivot_root() allows us to umount all useless points. But pivot_root()
>>> doesn't work, if the current root is on rootfs. How can we umount
>>> useless points in this case?
>
> One of your justifications for a new system call was so you could do
> less.  Doing less to get to where you want to go is only justified when
> your doing less to get better performance.
>
> It sounds like your actual concern is about sandboxing and security
> audits.  That is a very legitimate concern.  That isn't however the core
> concern of containers, so it was not clear that is what you meant.
>
>>> Maybe we want to say that rootfs should not be used if we are going to
>>> create containers...
>
> Today it is an assumption of the vfs that rootfs is mounted.  With
> rootfs mounted and pivot_root at the base of the mount stack you can
> make as minimal of a set of mounts as the vfs allows.
>
> Removing rootfs from the vfs requires an audit of everything that
> manipulates mounts.  It is not remotely a local excercise.

Would it be a less invasive audit to allow different mount namespaces
to have different rootfses?

>
> One of the things that needs to be considered is that if you really want
> to audit mounts is the code that needs manipulates them needs to be
> audited every bit as much as the mounts themselves.
>
>> Could we have an extra rootfs-like fs that is always completely empty,
>> doesn't allow any writes, and can sit at the bottom of container
>> namespace hierarchies?  If so, and if we add a new syscall that's like
>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>> to that rootfs then.
>
> Or equally have something that guarantees that rootfs is empty and
> read-only at the time the normal root filesystem is mounted.  That is
> certainly a much more localized change if we want to go there.
>
> I am half tempted to suggest that mount --move /some/path / be updated
> to make the old / just go away (perhaps to be replaced with a read-only
> empty rootfs).  That gets us into figuring out if we break userspace
> which is a big challenge.

Hence my argument for a new syscall or entirely new operation.
mount(2) and friends are way too multiplexed right now.  I just found
yet another security bug due to the insanely complicated semantics of
the vfs syscalls.  (Yes, a different one from the one yesterday.)

A new operation kills several birds with one stone.  It could look like:

int mntns_change_root(int dfd, const char *path, int flags);

return -EPERM if chrooted.  Returns -EINVAL if path (relative to dfd)
isn't a mountmount.  Otherwise it disconnects path from the existing
hierarchy, attaches a permanently-empty read-only rootfs under it,
makes it the root of the mntns, and does the root refs fixup.  The old
hierarchy gets thrown out.

Benefits:
 - Plausibly slightly faster.  (Especially if we add the trivial
optimization that, if the caller holds the only reference to the
mntns, then changing root refs can avoid the big loop, but maybe
pivot_root could do that, too.)
 - Sidesteps the whole rootfs mess.
 - Much easier to use than pivot_root.
 - No userspace breakage.

Heck, it could be even simpler: it could just unconditionally skip the
fs struct walk.  Just require callers to make sure that everyone else
chroots into the new root.

Systemd could use this, too.  If it wants to keep a reference to the
initramfs, it can use a file descriptor.

--Andy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox