Linux virtualization list
 help / color / mirror / Atom feed
* Re: PROPOSAL: Extend inline asm syntax with size spec
From: Joe Perches @ 2018-11-01  9:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kate Stewart, Christopher Li, virtualization, Masahiro Yamada,
	Nadav Amit, Jan Beulich, H. Peter Anvin, Sam Ravnborg,
	Ingo Molnar, x86, linux-sparse, Ingo Molnar, linux-xtensa,
	Kees Cook, Segher Boessenkool, Chris Zankel, Michael Matz,
	Borislav Petkov, Josh Poimboeuf, Alok Kataria, Juergen Gross, gcc,
	Richard Biener, Max Filippov, Greg Kroah-Hartman, linux-kerne
In-Reply-To: <20181101090114.GE3159@hirez.programming.kicks-ass.net>

On Thu, 2018-11-01 at 10:01 +0100, Peter Zijlstra wrote:
> On Wed, Oct 31, 2018 at 10:20:00PM -0700, Joe Perches wrote:
> > On Wed, 2018-10-31 at 13:55 +0100, Peter Zijlstra wrote:
> > > Anyway, with the below patch, I get:
> > > 
> > >    text    data     bss     dec     hex filename
> > > 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> > > 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> > > 
> > > Which shows we inline more (look for asm_volatile for the actual
> > > changes).
> > []
> > >  scripts/checkpatch.pl                              |  8 ++---
> > >  scripts/genksyms/keywords.c                        |  4 +--
> > >  scripts/kernel-doc                                 |  4 +--
> > 
> > I believe these should be excluded from the conversions.
> 
> Probably, yes. It compiled, which was all I cared about :-)
> 
> BTW, if we do that conversion, we should upgrade the checkpatch warn to
> an error I suppose.

More like remove altogether as __inline and __inline__
will no longer be #defined

$ git grep -P 'define\s+__inline'
include/linux/compiler_types.h:#define __inline__ inline
include/linux/compiler_types.h:#define __inline   inline

^ permalink raw reply

* Re: PROPOSAL: Extend inline asm syntax with size spec
From: Peter Zijlstra @ 2018-11-01 11:15 UTC (permalink / raw)
  To: Joe Perches
  Cc: Kate Stewart, Christopher Li, virtualization, Masahiro Yamada,
	Nadav Amit, Jan Beulich, H. Peter Anvin, Sam Ravnborg,
	Ingo Molnar, x86, linux-sparse, Ingo Molnar, linux-xtensa,
	Kees Cook, Segher Boessenkool, Chris Zankel, Michael Matz,
	Borislav Petkov, Josh Poimboeuf, Alok Kataria, Juergen Gross, gcc,
	Richard Biener, Max Filippov, Greg Kroah-Hartman, linux-kerne
In-Reply-To: <02085d90af8a81b69cb38e0fdd6a018f151a58d8.camel@perches.com>

On Thu, Nov 01, 2018 at 02:20:40AM -0700, Joe Perches wrote:
> On Thu, 2018-11-01 at 10:01 +0100, Peter Zijlstra wrote:
> > On Wed, Oct 31, 2018 at 10:20:00PM -0700, Joe Perches wrote:
> > > On Wed, 2018-10-31 at 13:55 +0100, Peter Zijlstra wrote:
> > > > Anyway, with the below patch, I get:
> > > > 
> > > >    text    data     bss     dec     hex filename
> > > > 17385183        5064780 1953892 24403855        1745f8f defconfig-build/vmlinux
> > > > 17385678        5064780 1953892 24404350        174617e defconfig-build/vmlinux
> > > > 
> > > > Which shows we inline more (look for asm_volatile for the actual
> > > > changes).
> > > []
> > > >  scripts/checkpatch.pl                              |  8 ++---
> > > >  scripts/genksyms/keywords.c                        |  4 +--
> > > >  scripts/kernel-doc                                 |  4 +--
> > > 
> > > I believe these should be excluded from the conversions.
> > 
> > Probably, yes. It compiled, which was all I cared about :-)
> > 
> > BTW, if we do that conversion, we should upgrade the checkpatch warn to
> > an error I suppose.
> 
> More like remove altogether as __inline and __inline__
> will no longer be #defined

That's the point, therefore checkpatch should error when it sees them.
Otherwise we'll grow new instances, because it will compile just file.

^ permalink raw reply

* [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-01 21:19 UTC (permalink / raw)
  Cc: lenaic, mhocko, bijan.mottahedeh, kvm, mst, netdev, liang.z.li,
	linux-kernel, virtualization, stefanha, joe, akpm, mhocko,
	torvalds

The following changes since commit 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d:

  Linux 4.19 (2018-10-22 07:37:37 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 79f800b2e76923cd8ce0aa659cb5c019d9643bc9:

  MAINTAINERS: remove reference to bogus vsock file (2018-10-24 21:16:14 -0400)

----------------------------------------------------------------
virtio, vhost: fixes, tweaks

virtio balloon page hinting support
vhost scsi control queue

misc fixes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Bijan Mottahedeh (3):
      vhost/scsi: Respond to control queue operations
      vhost/scsi: Extract common handling code from control queue handler
      vhost/scsi: Use common handling code in request queue handler

Greg Edwards (1):
      vhost/scsi: truncate T10 PI iov_iter to prot_bytes

Lénaïc Huard (1):
      kvm_config: add CONFIG_VIRTIO_MENU

Stefan Hajnoczi (1):
      MAINTAINERS: remove reference to bogus vsock file

Wei Wang (3):
      virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
      mm/page_poison: expose page_poisoning_enabled to kernel modules
      virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

 MAINTAINERS                         |   1 -
 drivers/vhost/scsi.c                | 426 ++++++++++++++++++++++++++++--------
 drivers/virtio/virtio_balloon.c     | 380 +++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   8 +
 kernel/configs/kvm_guest.config     |   1 +
 mm/page_poison.c                    |   6 +
 6 files changed, 688 insertions(+), 134 deletions(-)
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v8] virtio_blk: add discard and write zeroes support
From: Michael S. Tsirkin @ 2018-11-01 21:25 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Jens Axboe, virtualization, linux-block, Changpeng Liu
In-Reply-To: <20181029050521.GB31012@stefanha-x1.localdomain>

On Mon, Oct 29, 2018 at 05:05:21AM +0000, Stefan Hajnoczi wrote:
> On Fri, Oct 26, 2018 at 10:47:16AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Oct 26, 2018 at 09:08:38AM +0100, Stefan Hajnoczi wrote:
> > > On Fri, Oct 12, 2018 at 02:06:28PM -0700, Daniel Verkamp wrote:
> > > > +		range[n].flags = cpu_to_le32(flags);
> > > > +		range[n].num_sectors = cpu_to_le32(num_sectors);
> > > > +		range[n].sector = cpu_to_le64(sector);
> > > ...
> > > > +/* Discard/write zeroes range for each request. */
> > > > +struct virtio_blk_discard_write_zeroes {
> > > > +	/* discard/write zeroes start sector */
> > > > +	__virtio64 sector;
> > > > +	/* number of discard/write zeroes sectors */
> > > > +	__virtio32 num_sectors;
> > > > +	/* flags for this range */
> > > > +	__virtio32 flags;
> > > 
> > > cpu_to_le32() is being used on __virtio32 fields instead of cpu_to_virtio32().
> > > 
> > > From include/uapi/linux/virtio_types.h:
> > > 
> > >   /*
> > >    * __virtio{16,32,64} have the following meaning:
> > >    * - __u{16,32,64} for virtio devices in legacy mode, accessed in native endian
> > >    * - __le{16,32,64} for standard-compliant virtio devices
> > >    */
> > > 
> > > From the VIRTIO specification:
> > > 
> > >   struct virtio_blk_discard_write_zeroes {
> > >          le64 sector;
> > >          le32 num_sectors;
> > >          struct {
> > >                  le32 unmap:1;
> > >                  le32 reserved:31;
> > >          } flags;
> > >   };
> > > 
> > > 
> > > Since the VIRTIO spec says these fields are little-endian, I think these
> > > fields should be declared just __u32 and __u64 instead of __virtio32 and
> > > __virtio64.
> > > 
> > > Stefan
> > 
> > 
> > __le32/__le64 rather?
> 
> Yes.
> 
> Stefan

I agree. And further using bitfields for this is questionable -
it is preferable to set bits in a full 32 bit field using "|".


-- 
MST

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-01 21:44 UTC (permalink / raw)
  To: mst
  Cc: lenaic, mhocko, bijan.mottahedeh, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko
In-Reply-To: <20181101171938-mutt-send-email-mst@kernel.org>

On Thu, Nov 1, 2018 at 2:19 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> virtio, vhost: fixes, tweaks

Pulled.

                   Linus

^ permalink raw reply

* Re: [PATCH v8] virtio_blk: add discard and write zeroes support
From: Daniel Verkamp @ 2018-11-01 22:18 UTC (permalink / raw)
  To: mst; +Cc: axboe, virtualization, linux-block, stefanha, Changpeng Liu
In-Reply-To: <20181101172153-mutt-send-email-mst@kernel.org>

On Thu, Nov 1, 2018 at 2:25 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Oct 29, 2018 at 05:05:21AM +0000, Stefan Hajnoczi wrote:
> > On Fri, Oct 26, 2018 at 10:47:16AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Oct 26, 2018 at 09:08:38AM +0100, Stefan Hajnoczi wrote:
> > > > On Fri, Oct 12, 2018 at 02:06:28PM -0700, Daniel Verkamp wrote:
> > > > > +               range[n].flags = cpu_to_le32(flags);
> > > > > +               range[n].num_sectors = cpu_to_le32(num_sectors);
> > > > > +               range[n].sector = cpu_to_le64(sector);
> > > > ...
> > > > > +/* Discard/write zeroes range for each request. */
> > > > > +struct virtio_blk_discard_write_zeroes {
> > > > > +       /* discard/write zeroes start sector */
> > > > > +       __virtio64 sector;
> > > > > +       /* number of discard/write zeroes sectors */
> > > > > +       __virtio32 num_sectors;
> > > > > +       /* flags for this range */
> > > > > +       __virtio32 flags;
> > > >
> > > > cpu_to_le32() is being used on __virtio32 fields instead of cpu_to_virtio32().
> > > >
> > > > From include/uapi/linux/virtio_types.h:
> > > >
> > > >   /*
> > > >    * __virtio{16,32,64} have the following meaning:
> > > >    * - __u{16,32,64} for virtio devices in legacy mode, accessed in native endian
> > > >    * - __le{16,32,64} for standard-compliant virtio devices
> > > >    */
> > > >
> > > > From the VIRTIO specification:
> > > >
> > > >   struct virtio_blk_discard_write_zeroes {
> > > >          le64 sector;
> > > >          le32 num_sectors;
> > > >          struct {
> > > >                  le32 unmap:1;
> > > >                  le32 reserved:31;
> > > >          } flags;
> > > >   };
> > > >
> > > >
> > > > Since the VIRTIO spec says these fields are little-endian, I think these
> > > > fields should be declared just __u32 and __u64 instead of __virtio32 and
> > > > __virtio64.
> > > >
> > > > Stefan
> > >
> > >
> > > __le32/__le64 rather?
> >
> > Yes.
> >
> > Stefan
>
> I agree. And further using bitfields for this is questionable -
> it is preferable to set bits in a full 32 bit field using "|".

The bitfield is only in the specification, not the code - the actual
implementation in this patch (quoted above from earlier in the thread)
uses a 32-bit field with a flag #define.

I did misunderstand the meaning of __virtio32 vs __le32 - I'll fix
that up (I think the spec definition and code for these is already
correct; the structure definition just needs to change to match).

Thanks,
-- Daniel

^ permalink raw reply

* [PATCH v9] virtio_blk: add discard and write zeroes support
From: Daniel Verkamp @ 2018-11-01 22:40 UTC (permalink / raw)
  To: virtualization, linux-block
  Cc: Jens Axboe, Michael S. Tsirkin, Christoph Hellwig,
	Stefan Hajnoczi, Paolo Bonzini, Changpeng Liu
In-Reply-To: <1528258740-6581-1-git-send-email-changpeng.liu@intel.com>

From: Changpeng Liu <changpeng.liu@intel.com>

In commit 88c85538, "virtio-blk: add discard and write zeroes features
to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
block specification has been extended to add VIRTIO_BLK_T_DISCARD and
VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
discard and write zeroes in the virtio-blk driver when the device
advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES.

Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
---
dverkamp: I've picked up this patch and made a few minor changes (as
listed below); most notably, I changed the kmalloc back to GFP_ATOMIC,
since it can be called from a context where sleeping is not allowed.
To prevent large allocations, I've also clamped the maximum number of
discard segments to 256; this results in a 4K allocation and should be
plenty of descriptors for most use cases.

I also removed most of the description from the commit message, since it
was duplicating the comments from virtio_blk.h and quoting parts of the
spec without adding any extra information.  I have tested this iteration
of the patch using crosvm with modifications to enable the new features:
https://chromium.googlesource.com/chromiumos/platform/crosvm/

v9 fixes a number of review issues; I didn't attempt to optimize the
single-element write zeroes case, so it still does an allocation per
request (I did not see any easy place to put the payload that would
avoid the allocation).

CHANGELOG:
v9: [dverkamp] fix LE types in discard struct; cleanups from Ming Lei
v8: [dverkamp] replace shifts by 9 with SECTOR_SHIFT constant
v7: [dverkamp] use GFP_ATOMIC for allocation that may not sleep; clarify
descriptor flags field; comment wording cleanups.
v6: don't set T_OUT bit to discard and write zeroes commands.
v5: use new block layer API: blk_queue_flag_set.
v4: several optimizations based on MST's comments, remove bit field
usage for command descriptor.
v3: define the virtio-blk protocol to add discard and write zeroes
support, first version implementation based on proposed specification.
v2: add write zeroes command support.
v1: initial proposal implementation for discard command.
---
 drivers/block/virtio_blk.c      | 83 ++++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_blk.h | 54 +++++++++++++++++++++
 2 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 086c6bb12baa..0f39efb4b3aa 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -18,6 +18,7 @@
 
 #define PART_BITS 4
 #define VQ_NAME_LEN 16
+#define MAX_DISCARD_SEGMENTS 256u
 
 static int major;
 static DEFINE_IDA(vd_index_ida);
@@ -172,10 +173,48 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
 	return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
 }
 
+static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
+{
+	unsigned short segments = blk_rq_nr_discard_segments(req);
+	unsigned short n = 0;
+	struct virtio_blk_discard_write_zeroes *range;
+	struct bio *bio;
+	u32 flags = 0;
+
+	if (unmap)
+		flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
+
+	range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
+	if (!range)
+		return -ENOMEM;
+
+	__rq_for_each_bio(bio, req) {
+		u64 sector = bio->bi_iter.bi_sector;
+		u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+
+		range[n].flags = cpu_to_le32(flags);
+		range[n].num_sectors = cpu_to_le32(num_sectors);
+		range[n].sector = cpu_to_le64(sector);
+		n++;
+	}
+
+	req->special_vec.bv_page = virt_to_page(range);
+	req->special_vec.bv_offset = offset_in_page(range);
+	req->special_vec.bv_len = sizeof(*range) * segments;
+	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+	return 0;
+}
+
 static inline void virtblk_request_done(struct request *req)
 {
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 
+	if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		kfree(page_address(req->special_vec.bv_page) +
+		      req->special_vec.bv_offset);
+	}
+
 	switch (req_op(req)) {
 	case REQ_OP_SCSI_IN:
 	case REQ_OP_SCSI_OUT:
@@ -225,6 +264,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 	int qid = hctx->queue_num;
 	int err;
 	bool notify = false;
+	bool unmap = false;
 	u32 type;
 
 	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -237,6 +277,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case REQ_OP_FLUSH:
 		type = VIRTIO_BLK_T_FLUSH;
 		break;
+	case REQ_OP_DISCARD:
+		type = VIRTIO_BLK_T_DISCARD;
+		break;
+	case REQ_OP_WRITE_ZEROES:
+		type = VIRTIO_BLK_T_WRITE_ZEROES;
+		unmap = !(req->cmd_flags & REQ_NOUNMAP);
+		break;
 	case REQ_OP_SCSI_IN:
 	case REQ_OP_SCSI_OUT:
 		type = VIRTIO_BLK_T_SCSI_CMD;
@@ -256,6 +303,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(req);
 
+	if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+		err = virtblk_setup_discard_write_zeroes(req, unmap);
+		if (err)
+			return BLK_STS_RESOURCE;
+	}
+
 	num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
 	if (num) {
 		if (rq_data_dir(req) == WRITE)
@@ -802,6 +855,32 @@ static int virtblk_probe(struct virtio_device *vdev)
 	if (!err && opt_io_size)
 		blk_queue_io_opt(q, blk_size * opt_io_size);
 
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+		q->limits.discard_granularity = blk_size;
+
+		virtio_cread(vdev, struct virtio_blk_config,
+			     discard_sector_alignment, &v);
+		q->limits.discard_alignment = v ? v << SECTOR_SHIFT : 0;
+
+		virtio_cread(vdev, struct virtio_blk_config,
+			     max_discard_sectors, &v);
+		blk_queue_max_discard_sectors(q, v ? v : UINT_MAX);
+
+		virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
+			     &v);
+		blk_queue_max_discard_segments(q,
+					       min_not_zero(v,
+							    MAX_DISCARD_SEGMENTS));
+
+		blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+	}
+
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+		virtio_cread(vdev, struct virtio_blk_config,
+			     max_write_zeroes_sectors, &v);
+		blk_queue_max_write_zeroes_sectors(q, v ? v : UINT_MAX);
+	}
+
 	virtblk_update_capacity(vblk, false);
 	virtio_device_ready(vdev);
 
@@ -895,14 +974,14 @@ static unsigned int features_legacy[] = {
 	VIRTIO_BLK_F_SCSI,
 #endif
 	VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
-	VIRTIO_BLK_F_MQ,
+	VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
 }
 ;
 static unsigned int features[] = {
 	VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
 	VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
 	VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
-	VIRTIO_BLK_F_MQ,
+	VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
 };
 
 static struct virtio_driver virtio_blk = {
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d968dd5..0f99d7b49ede 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
 #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
 #define VIRTIO_BLK_F_TOPOLOGY	10	/* Topology information is available */
 #define VIRTIO_BLK_F_MQ		12	/* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD	13	/* DISCARD is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES	14	/* WRITE ZEROES is supported */
 
 /* Legacy feature bits */
 #ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,39 @@ struct virtio_blk_config {
 
 	/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
 	__u16 num_queues;
+
+	/* the next 3 entries are guarded by VIRTIO_BLK_F_DISCARD */
+	/*
+	 * The maximum discard sectors (in 512-byte sectors) for
+	 * one segment.
+	 */
+	__u32 max_discard_sectors;
+	/*
+	 * The maximum number of discard segments in a
+	 * discard command.
+	 */
+	__u32 max_discard_seg;
+	/* Discard commands must be aligned to this number of sectors. */
+	__u32 discard_sector_alignment;
+
+	/* the next 3 entries are guarded by VIRTIO_BLK_F_WRITE_ZEROES */
+	/*
+	 * The maximum number of write zeroes sectors (in 512-byte sectors) in
+	 * one segment.
+	 */
+	__u32 max_write_zeroes_sectors;
+	/*
+	 * The maximum number of segments in a write zeroes
+	 * command.
+	 */
+	__u32 max_write_zeroes_seg;
+	/*
+	 * Set if a VIRTIO_BLK_T_WRITE_ZEROES request may result in the
+	 * deallocation of one or more of the sectors.
+	 */
+	__u8 write_zeroes_may_unmap;
+
+	__u8 unused1[3];
 } __attribute__((packed));
 
 /*
@@ -114,6 +149,12 @@ struct virtio_blk_config {
 /* Get device ID command */
 #define VIRTIO_BLK_T_GET_ID    8
 
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD	11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES	13
+
 #ifndef VIRTIO_BLK_NO_LEGACY
 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER	0x80000000
@@ -133,6 +174,19 @@ struct virtio_blk_outhdr {
 	__virtio64 sector;
 };
 
+/* Unmap this range (only valid for write zeroes command) */
+#define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP	0x00000001
+
+/* Discard/write zeroes range for each request. */
+struct virtio_blk_discard_write_zeroes {
+	/* discard/write zeroes start sector */
+	__le64 sector;
+	/* number of discard/write zeroes sectors */
+	__le32 num_sectors;
+	/* flags for this range */
+	__le32 flags;
+};
+
 #ifndef VIRTIO_BLK_NO_LEGACY
 struct virtio_scsi_inhdr {
 	__virtio32 errors;
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related

* Re: [PULL] vhost: cleanups and fixes
From: Kees Cook @ 2018-11-01 23:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: lenaic, Michal Hocko, bijan.mottahedeh, KVM, Network Development,
	liang.z.li, LKML, virtualization, Stefan Hajnoczi, Joe Perches,
	Andrew Morton, Michal Hocko, Linus Torvalds
In-Reply-To: <20181101171938-mutt-send-email-mst@kernel.org>

On Thu, Nov 1, 2018 at 2:19 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> The following changes since commit 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d:
>
>   Linux 4.19 (2018-10-22 07:37:37 +0100)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
>
> for you to fetch changes up to 79f800b2e76923cd8ce0aa659cb5c019d9643bc9:
>
>   MAINTAINERS: remove reference to bogus vsock file (2018-10-24 21:16:14 -0400)
>
> ----------------------------------------------------------------
> virtio, vhost: fixes, tweaks
>
> virtio balloon page hinting support
> vhost scsi control queue
>
> misc fixes.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> ----------------------------------------------------------------
> Bijan Mottahedeh (3):
>       vhost/scsi: Respond to control queue operations

+static void
+vhost_scsi_send_tmf_resp(struct vhost_scsi *vs,
+                          struct vhost_virtqueue *vq,
+                          int head, unsigned int out)
+{
+       struct virtio_scsi_ctrl_tmf_resp __user *resp;
+       struct virtio_scsi_ctrl_tmf_resp rsp;
+       int ret;
+
+       pr_debug("%s\n", __func__);
+       memset(&rsp, 0, sizeof(rsp));
+       rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
+       resp = vq->iov[out].iov_base;
+       ret = __copy_to_user(resp, &rsp, sizeof(rsp));

Is it actually safe to trust that iov_base has passed an earlier
access_ok() check here? Why not just use copy_to_user() instead?

-Kees

>       vhost/scsi: Extract common handling code from control queue handler
>       vhost/scsi: Use common handling code in request queue handler
>
> Greg Edwards (1):
>       vhost/scsi: truncate T10 PI iov_iter to prot_bytes
>
> Lénaïc Huard (1):
>       kvm_config: add CONFIG_VIRTIO_MENU
>
> Stefan Hajnoczi (1):
>       MAINTAINERS: remove reference to bogus vsock file
>
> Wei Wang (3):
>       virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
>       mm/page_poison: expose page_poisoning_enabled to kernel modules
>       virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
>
>  MAINTAINERS                         |   1 -
>  drivers/vhost/scsi.c                | 426 ++++++++++++++++++++++++++++--------
>  drivers/virtio/virtio_balloon.c     | 380 +++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |   8 +
>  kernel/configs/kvm_guest.config     |   1 +
>  mm/page_poison.c                    |   6 +
>  6 files changed, 688 insertions(+), 134 deletions(-)



-- 
Kees Cook
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-01 23:06 UTC (permalink / raw)
  To: Kees Cook
  Cc: lenaic, mhocko, bijan.mottahedeh, kvm, mst, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko
In-Reply-To: <CAGXu5jJ0HgV2qN=wohEgro6ixqXHOHBTsvS5a9Dcpz8gxVo3bA@mail.gmail.com>

On Thu, Nov 1, 2018 at 4:00 PM Kees Cook <keescook@chromium.org> wrote:
>
> +       memset(&rsp, 0, sizeof(rsp));
> +       rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
> +       resp = vq->iov[out].iov_base;
> +       ret = __copy_to_user(resp, &rsp, sizeof(rsp));
>
> Is it actually safe to trust that iov_base has passed an earlier
> access_ok() check here? Why not just use copy_to_user() instead?

Good point.

We really should have removed those double-underscore things ages ago.

Also, apart from the address, what about the size? Wouldn't it be
better to use copy_to_iter() rather than implement it badly by hand?

               Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-01 23:38 UTC (permalink / raw)
  To: Kees Cook
  Cc: lenaic, Michal Hocko, bijan.mottahedeh, KVM, Network Development,
	liang.z.li, LKML, virtualization, Stefan Hajnoczi, Joe Perches,
	Andrew Morton, Michal Hocko, Linus Torvalds
In-Reply-To: <CAGXu5jJ0HgV2qN=wohEgro6ixqXHOHBTsvS5a9Dcpz8gxVo3bA@mail.gmail.com>

On Thu, Nov 01, 2018 at 04:00:23PM -0700, Kees Cook wrote:
> On Thu, Nov 1, 2018 at 2:19 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > The following changes since commit 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d:
> >
> >   Linux 4.19 (2018-10-22 07:37:37 +0100)
> >
> > are available in the Git repository at:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
> >
> > for you to fetch changes up to 79f800b2e76923cd8ce0aa659cb5c019d9643bc9:
> >
> >   MAINTAINERS: remove reference to bogus vsock file (2018-10-24 21:16:14 -0400)
> >
> > ----------------------------------------------------------------
> > virtio, vhost: fixes, tweaks
> >
> > virtio balloon page hinting support
> > vhost scsi control queue
> >
> > misc fixes.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > ----------------------------------------------------------------
> > Bijan Mottahedeh (3):
> >       vhost/scsi: Respond to control queue operations
> 
> +static void
> +vhost_scsi_send_tmf_resp(struct vhost_scsi *vs,
> +                          struct vhost_virtqueue *vq,
> +                          int head, unsigned int out)
> +{
> +       struct virtio_scsi_ctrl_tmf_resp __user *resp;
> +       struct virtio_scsi_ctrl_tmf_resp rsp;
> +       int ret;
> +
> +       pr_debug("%s\n", __func__);
> +       memset(&rsp, 0, sizeof(rsp));
> +       rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
> +       resp = vq->iov[out].iov_base;
> +       ret = __copy_to_user(resp, &rsp, sizeof(rsp));
> 
> Is it actually safe to trust that iov_base has passed an earlier
> access_ok() check here? Why not just use copy_to_user() instead?
> 
> -Kees

I am not sure copy_to_user will do the right thing here, because all
this runs in context of a kernel thread.  We do need access_ok which
takes place way earlier in context of the task.

Another reason it is safe is because the address is not
coming from userspace at all.




> >       vhost/scsi: Extract common handling code from control queue handler
> >       vhost/scsi: Use common handling code in request queue handler
> >
> > Greg Edwards (1):
> >       vhost/scsi: truncate T10 PI iov_iter to prot_bytes
> >
> > Lénaïc Huard (1):
> >       kvm_config: add CONFIG_VIRTIO_MENU
> >
> > Stefan Hajnoczi (1):
> >       MAINTAINERS: remove reference to bogus vsock file
> >
> > Wei Wang (3):
> >       virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
> >       mm/page_poison: expose page_poisoning_enabled to kernel modules
> >       virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
> >
> >  MAINTAINERS                         |   1 -
> >  drivers/vhost/scsi.c                | 426 ++++++++++++++++++++++++++++--------
> >  drivers/virtio/virtio_balloon.c     | 380 +++++++++++++++++++++++++++++---
> >  include/uapi/linux/virtio_balloon.h |   8 +
> >  kernel/configs/kvm_guest.config     |   1 +
> >  mm/page_poison.c                    |   6 +
> >  6 files changed, 688 insertions(+), 134 deletions(-)
> 
> 
> 
> -- 
> Kees Cook

^ permalink raw reply

* Re: [PATCH v9] virtio_blk: add discard and write zeroes support
From: Dongli Zhang @ 2018-11-01 23:43 UTC (permalink / raw)
  To: Daniel Verkamp
  Cc: Jens Axboe, Christoph Hellwig, Michael S. Tsirkin, virtualization,
	linux-block, Stefan Hajnoczi, Paolo Bonzini, Changpeng Liu
In-Reply-To: <20181101224035.215548-1-dverkamp@chromium.org>

Hi Daniel,

Other than crosvm, is there any version of qemu (e.g., repositories developed in
progress on github) where I can try with this feature?

Thank you very much!

Dongli Zhang

On 11/02/2018 06:40 AM, Daniel Verkamp wrote:
> From: Changpeng Liu <changpeng.liu@intel.com>
> 
> In commit 88c85538, "virtio-blk: add discard and write zeroes features
> to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
> block specification has been extended to add VIRTIO_BLK_T_DISCARD and
> VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
> discard and write zeroes in the virtio-blk driver when the device
> advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
> VIRTIO_BLK_F_WRITE_ZEROES.
> 
> Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
> ---
> dverkamp: I've picked up this patch and made a few minor changes (as
> listed below); most notably, I changed the kmalloc back to GFP_ATOMIC,
> since it can be called from a context where sleeping is not allowed.
> To prevent large allocations, I've also clamped the maximum number of
> discard segments to 256; this results in a 4K allocation and should be
> plenty of descriptors for most use cases.
> 
> I also removed most of the description from the commit message, since it
> was duplicating the comments from virtio_blk.h and quoting parts of the
> spec without adding any extra information.  I have tested this iteration
> of the patch using crosvm with modifications to enable the new features:
> https://chromium.googlesource.com/chromiumos/platform/crosvm/
> 
> v9 fixes a number of review issues; I didn't attempt to optimize the
> single-element write zeroes case, so it still does an allocation per
> request (I did not see any easy place to put the payload that would
> avoid the allocation).
> 
> CHANGELOG:
> v9: [dverkamp] fix LE types in discard struct; cleanups from Ming Lei
> v8: [dverkamp] replace shifts by 9 with SECTOR_SHIFT constant
> v7: [dverkamp] use GFP_ATOMIC for allocation that may not sleep; clarify
> descriptor flags field; comment wording cleanups.
> v6: don't set T_OUT bit to discard and write zeroes commands.
> v5: use new block layer API: blk_queue_flag_set.
> v4: several optimizations based on MST's comments, remove bit field
> usage for command descriptor.
> v3: define the virtio-blk protocol to add discard and write zeroes
> support, first version implementation based on proposed specification.
> v2: add write zeroes command support.
> v1: initial proposal implementation for discard command.
> ---
>  drivers/block/virtio_blk.c      | 83 ++++++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_blk.h | 54 +++++++++++++++++++++
>  2 files changed, 135 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 086c6bb12baa..0f39efb4b3aa 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -18,6 +18,7 @@
>  
>  #define PART_BITS 4
>  #define VQ_NAME_LEN 16
> +#define MAX_DISCARD_SEGMENTS 256u
>  
>  static int major;
>  static DEFINE_IDA(vd_index_ida);
> @@ -172,10 +173,48 @@ static int virtblk_add_req(struct virtqueue *vq, struct virtblk_req *vbr,
>  	return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
>  }
>  
> +static int virtblk_setup_discard_write_zeroes(struct request *req, bool unmap)
> +{
> +	unsigned short segments = blk_rq_nr_discard_segments(req);
> +	unsigned short n = 0;
> +	struct virtio_blk_discard_write_zeroes *range;
> +	struct bio *bio;
> +	u32 flags = 0;
> +
> +	if (unmap)
> +		flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
> +
> +	range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> +	if (!range)
> +		return -ENOMEM;
> +
> +	__rq_for_each_bio(bio, req) {
> +		u64 sector = bio->bi_iter.bi_sector;
> +		u32 num_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> +
> +		range[n].flags = cpu_to_le32(flags);
> +		range[n].num_sectors = cpu_to_le32(num_sectors);
> +		range[n].sector = cpu_to_le64(sector);
> +		n++;
> +	}
> +
> +	req->special_vec.bv_page = virt_to_page(range);
> +	req->special_vec.bv_offset = offset_in_page(range);
> +	req->special_vec.bv_len = sizeof(*range) * segments;
> +	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> +	return 0;
> +}
> +
>  static inline void virtblk_request_done(struct request *req)
>  {
>  	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>  
> +	if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
> +		kfree(page_address(req->special_vec.bv_page) +
> +		      req->special_vec.bv_offset);
> +	}
> +
>  	switch (req_op(req)) {
>  	case REQ_OP_SCSI_IN:
>  	case REQ_OP_SCSI_OUT:
> @@ -225,6 +264,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	int qid = hctx->queue_num;
>  	int err;
>  	bool notify = false;
> +	bool unmap = false;
>  	u32 type;
>  
>  	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> @@ -237,6 +277,13 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	case REQ_OP_FLUSH:
>  		type = VIRTIO_BLK_T_FLUSH;
>  		break;
> +	case REQ_OP_DISCARD:
> +		type = VIRTIO_BLK_T_DISCARD;
> +		break;
> +	case REQ_OP_WRITE_ZEROES:
> +		type = VIRTIO_BLK_T_WRITE_ZEROES;
> +		unmap = !(req->cmd_flags & REQ_NOUNMAP);
> +		break;
>  	case REQ_OP_SCSI_IN:
>  	case REQ_OP_SCSI_OUT:
>  		type = VIRTIO_BLK_T_SCSI_CMD;
> @@ -256,6 +303,12 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>  
>  	blk_mq_start_request(req);
>  
> +	if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
> +		err = virtblk_setup_discard_write_zeroes(req, unmap);
> +		if (err)
> +			return BLK_STS_RESOURCE;
> +	}
> +
>  	num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
>  	if (num) {
>  		if (rq_data_dir(req) == WRITE)
> @@ -802,6 +855,32 @@ static int virtblk_probe(struct virtio_device *vdev)
>  	if (!err && opt_io_size)
>  		blk_queue_io_opt(q, blk_size * opt_io_size);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
> +		q->limits.discard_granularity = blk_size;
> +
> +		virtio_cread(vdev, struct virtio_blk_config,
> +			     discard_sector_alignment, &v);
> +		q->limits.discard_alignment = v ? v << SECTOR_SHIFT : 0;
> +
> +		virtio_cread(vdev, struct virtio_blk_config,
> +			     max_discard_sectors, &v);
> +		blk_queue_max_discard_sectors(q, v ? v : UINT_MAX);
> +
> +		virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
> +			     &v);
> +		blk_queue_max_discard_segments(q,
> +					       min_not_zero(v,
> +							    MAX_DISCARD_SEGMENTS));
> +
> +		blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> +	}
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
> +		virtio_cread(vdev, struct virtio_blk_config,
> +			     max_write_zeroes_sectors, &v);
> +		blk_queue_max_write_zeroes_sectors(q, v ? v : UINT_MAX);
> +	}
> +
>  	virtblk_update_capacity(vblk, false);
>  	virtio_device_ready(vdev);
>  
> @@ -895,14 +974,14 @@ static unsigned int features_legacy[] = {
>  	VIRTIO_BLK_F_SCSI,
>  #endif
>  	VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> -	VIRTIO_BLK_F_MQ,
> +	VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
>  }
>  ;
>  static unsigned int features[] = {
>  	VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
>  	VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
>  	VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> -	VIRTIO_BLK_F_MQ,
> +	VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
>  };
>  
>  static struct virtio_driver virtio_blk = {
> diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
> index 9ebe4d968dd5..0f99d7b49ede 100644
> --- a/include/uapi/linux/virtio_blk.h
> +++ b/include/uapi/linux/virtio_blk.h
> @@ -38,6 +38,8 @@
>  #define VIRTIO_BLK_F_BLK_SIZE	6	/* Block size of disk is available*/
>  #define VIRTIO_BLK_F_TOPOLOGY	10	/* Topology information is available */
>  #define VIRTIO_BLK_F_MQ		12	/* support more than one vq */
> +#define VIRTIO_BLK_F_DISCARD	13	/* DISCARD is supported */
> +#define VIRTIO_BLK_F_WRITE_ZEROES	14	/* WRITE ZEROES is supported */
>  
>  /* Legacy feature bits */
>  #ifndef VIRTIO_BLK_NO_LEGACY
> @@ -86,6 +88,39 @@ struct virtio_blk_config {
>  
>  	/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
>  	__u16 num_queues;
> +
> +	/* the next 3 entries are guarded by VIRTIO_BLK_F_DISCARD */
> +	/*
> +	 * The maximum discard sectors (in 512-byte sectors) for
> +	 * one segment.
> +	 */
> +	__u32 max_discard_sectors;
> +	/*
> +	 * The maximum number of discard segments in a
> +	 * discard command.
> +	 */
> +	__u32 max_discard_seg;
> +	/* Discard commands must be aligned to this number of sectors. */
> +	__u32 discard_sector_alignment;
> +
> +	/* the next 3 entries are guarded by VIRTIO_BLK_F_WRITE_ZEROES */
> +	/*
> +	 * The maximum number of write zeroes sectors (in 512-byte sectors) in
> +	 * one segment.
> +	 */
> +	__u32 max_write_zeroes_sectors;
> +	/*
> +	 * The maximum number of segments in a write zeroes
> +	 * command.
> +	 */
> +	__u32 max_write_zeroes_seg;
> +	/*
> +	 * Set if a VIRTIO_BLK_T_WRITE_ZEROES request may result in the
> +	 * deallocation of one or more of the sectors.
> +	 */
> +	__u8 write_zeroes_may_unmap;
> +
> +	__u8 unused1[3];
>  } __attribute__((packed));
>  
>  /*
> @@ -114,6 +149,12 @@ struct virtio_blk_config {
>  /* Get device ID command */
>  #define VIRTIO_BLK_T_GET_ID    8
>  
> +/* Discard command */
> +#define VIRTIO_BLK_T_DISCARD	11
> +
> +/* Write zeroes command */
> +#define VIRTIO_BLK_T_WRITE_ZEROES	13
> +
>  #ifndef VIRTIO_BLK_NO_LEGACY
>  /* Barrier before this op. */
>  #define VIRTIO_BLK_T_BARRIER	0x80000000
> @@ -133,6 +174,19 @@ struct virtio_blk_outhdr {
>  	__virtio64 sector;
>  };
>  
> +/* Unmap this range (only valid for write zeroes command) */
> +#define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP	0x00000001
> +
> +/* Discard/write zeroes range for each request. */
> +struct virtio_blk_discard_write_zeroes {
> +	/* discard/write zeroes start sector */
> +	__le64 sector;
> +	/* number of discard/write zeroes sectors */
> +	__le32 num_sectors;
> +	/* flags for this range */
> +	__le32 flags;
> +};
> +
>  #ifndef VIRTIO_BLK_NO_LEGACY
>  struct virtio_scsi_inhdr {
>  	__virtio32 errors;
> 

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-01 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wicvws38JqzVF6oNEZ0jGzP6RecR6yAGtyzX3AkoJ321g@mail.gmail.com>

On Thu, Nov 01, 2018 at 04:06:19PM -0700, Linus Torvalds wrote:
> On Thu, Nov 1, 2018 at 4:00 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > +       memset(&rsp, 0, sizeof(rsp));
> > +       rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
> > +       resp = vq->iov[out].iov_base;
> > +       ret = __copy_to_user(resp, &rsp, sizeof(rsp));
> >
> > Is it actually safe to trust that iov_base has passed an earlier
> > access_ok() check here? Why not just use copy_to_user() instead?
> 
> Good point.
> 
> We really should have removed those double-underscore things ages ago.

Well in case of vhost there are a bunch of reasons to keep them.

One is that all access_ok checks take place
way earlier in context of the owner task. Result is saved
and then used for access repeatedly. Skipping reding access_ok twice
did seem to give a small speedup in the past.
In fact I am looking
into switching some of the uses to unsafe_put_user/unsafe_get_user
after doing something like barrier_nospec after the
access_ok checks. Seems to give a measureable speedup.


Another is that the double underscore things actually can be done
in softirq context if you do preempt_disable/preempt_enable.
IIUC Jason's looking into using that to cut down the latency
for when the access is very small.

> Also, apart from the address, what about the size? Wouldn't it be
> better to use copy_to_iter() rather than implement it badly by hand?
> 
>                Linus

Generally size is checked when we retrieve the iov but I will recheck
this case and reply here.

^ permalink raw reply

* Re: [PATCH v9] virtio_blk: add discard and write zeroes support
From: Stefan Hajnoczi @ 2018-11-02  4:17 UTC (permalink / raw)
  To: Daniel Verkamp
  Cc: Jens Axboe, Christoph Hellwig, Michael S. Tsirkin, virtualization,
	linux-block, Paolo Bonzini, Changpeng Liu
In-Reply-To: <20181101224035.215548-1-dverkamp@chromium.org>


[-- Attachment #1.1: Type: text/plain, Size: 2736 bytes --]

On Thu, Nov 01, 2018 at 03:40:35PM -0700, Daniel Verkamp wrote:
> From: Changpeng Liu <changpeng.liu@intel.com>
> 
> In commit 88c85538, "virtio-blk: add discard and write zeroes features
> to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
> block specification has been extended to add VIRTIO_BLK_T_DISCARD and
> VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
> discard and write zeroes in the virtio-blk driver when the device
> advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
> VIRTIO_BLK_F_WRITE_ZEROES.
> 
> Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
> ---
> dverkamp: I've picked up this patch and made a few minor changes (as
> listed below); most notably, I changed the kmalloc back to GFP_ATOMIC,
> since it can be called from a context where sleeping is not allowed.
> To prevent large allocations, I've also clamped the maximum number of
> discard segments to 256; this results in a 4K allocation and should be
> plenty of descriptors for most use cases.
> 
> I also removed most of the description from the commit message, since it
> was duplicating the comments from virtio_blk.h and quoting parts of the
> spec without adding any extra information.  I have tested this iteration
> of the patch using crosvm with modifications to enable the new features:
> https://chromium.googlesource.com/chromiumos/platform/crosvm/
> 
> v9 fixes a number of review issues; I didn't attempt to optimize the
> single-element write zeroes case, so it still does an allocation per
> request (I did not see any easy place to put the payload that would
> avoid the allocation).
> 
> CHANGELOG:
> v9: [dverkamp] fix LE types in discard struct; cleanups from Ming Lei
> v8: [dverkamp] replace shifts by 9 with SECTOR_SHIFT constant
> v7: [dverkamp] use GFP_ATOMIC for allocation that may not sleep; clarify
> descriptor flags field; comment wording cleanups.
> v6: don't set T_OUT bit to discard and write zeroes commands.
> v5: use new block layer API: blk_queue_flag_set.
> v4: several optimizations based on MST's comments, remove bit field
> usage for command descriptor.
> v3: define the virtio-blk protocol to add discard and write zeroes
> support, first version implementation based on proposed specification.
> v2: add write zeroes command support.
> v1: initial proposal implementation for discard command.
> ---
>  drivers/block/virtio_blk.c      | 83 ++++++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_blk.h | 54 +++++++++++++++++++++
>  2 files changed, 135 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 0/2] i8253: Fix PIT shutdown quirk on Hyper-V
From: Juergen Gross @ 2018-11-02  7:40 UTC (permalink / raw)
  To: Michael Kelley, virtualization@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de,
	daniel.lezcano@linaro.org, akataria@vmware.com, olaf@aepfle.de,
	apw@canonical.com, vkuznets, jasowang@redhat.com,
	marcelo.cerri@canonical.com, KY Srinivasan
In-Reply-To: <1541093374-10496-1-git-send-email-mikelley@microsoft.com>

On 01/11/2018 18:30, Michael Kelley wrote:
> pit_shutdown() doesn't work on Hyper-V because of a quirk in the
> PIT emulation. This problem exists in all versions of Hyper-V and
> had not been noticed previously. When the counter register is set
> to zero, the emulated PIT continues to interrupt @18.2 HZ.
> 
> So add a test for running on Hyper-V, and use that test to skip
> setting the counter register when running on Hyper-V.
> 
> This patch replaces a previously proposed patch with a different
> approach. This new approach follows comments from Thomas Gleixner.

Did you consider using a static_key instead? You could set it in
ms_hyperv_init_platform(). This would enable you to support future
Hyper-V versions which don't require avoiding to set the count to zero.


Juergen

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Mark Rutland @ 2018-11-02 11:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: lenaic, mhocko, Kees Cook, kvm, mst, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wicvws38JqzVF6oNEZ0jGzP6RecR6yAGtyzX3AkoJ321g@mail.gmail.com>

On Thu, Nov 01, 2018 at 04:06:19PM -0700, Linus Torvalds wrote:
> On Thu, Nov 1, 2018 at 4:00 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > +       memset(&rsp, 0, sizeof(rsp));
> > +       rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
> > +       resp = vq->iov[out].iov_base;
> > +       ret = __copy_to_user(resp, &rsp, sizeof(rsp));
> >
> > Is it actually safe to trust that iov_base has passed an earlier
> > access_ok() check here? Why not just use copy_to_user() instead?
> 
> Good point.
> 
> We really should have removed those double-underscore things ages ago.

FWIW, on arm64 we always check/sanitize the user address as a result of
our sanitization of speculated values. Almost all of our uaccess
routines have an explicit access_ok().

All our uaccess routines mask the user pointer based on addr_limit,
which prevents speculative or architectural uaccess to kernel addresses
when addr_limit it USER_DS:

    4d8efc2d5ee4c9cc ("arm64: Use pointer masking to limit uaccess speculation")

We also inhibit speculative stores to addr_limit being forwarded under
speculation:

    c2f0ad4fc089cff8 ("arm64: uaccess: Prevent speculative use of the current addr_limit")

... and given all that, we folded explicit access_ok() checks into
__{get,put}_user():

    84624087dd7e3b48 ("arm64: uaccess: Don't bother eliding access_ok checks in __{get, put}_user")

IMO we could/should do the same for __copy_{to,from}_user().

Thanks,
Mark.

^ permalink raw reply

* Call for Papers - WorldCIST'19 - La Toja Island, Spain
From: ML @ 2018-11-02 12:27 UTC (permalink / raw)
  To: virtualization


[-- Attachment #1.1: Type: text/plain, Size: 9265 bytes --]

*** Best papers published in JCR/SCI/SSCI journals



------------------------------       -----------------------------
WorldCIST'19 - 7th World Conference on Information Systems and Technologies

16 - 19 April 2019 | La Toja Island, Spain

http://www.worldcist.org/ <http://www.worldcist.org/>

------------------------------       ------------------------------       ------------------------------       ------


Scope

The WorldCist'19 - 7th World Conference on Information Systems and Technologies, to be held at La Toja Island, Galicia, Spain, 16 - 19 April 2019, is a global forum for researchers and practitioners to present and discuss the most recent innovations, trends, results, experiences and concerns in the several perspectives of Information Systems and Technologies.

We are pleased to invite you to submit your papers to WorldCist'18. All submissions will be reviewed on the basis of relevance, originality, importance and clarity.



Themes

Submitted papers should be related with one or more of the main themes proposed for the Conference:

A) Information and Knowledge Management (IKM);

B) Organizational Models and Information Systems (OMIS);

C) Software and Systems Modeling (SSM);

D) Software Systems, Architectures, Applications and Tools (SSAAT);

E) Multimedia Systems and Applications (MSA);

F) Computer Networks, Mobility and Pervasive Systems (CNMPS);

G) Intelligent and Decision Support Systems (IDSS);

H) Big Data Analytics and Applications (BDAA);

I) Human-Computer Interaction (HCI);

J) Ethics, Computers and Security (ECS)

K) Health Informatics (HIS);

L) Information Technologies in Education (ITE);

M) Information Technologies in Radiocommunications (ITR);

N) Technologies for Biomedical Applications (TBA)



Types of Submissions and Decisions

Four types of papers can be submitted:

Full paper: Finished or consolidated R&D works, to be included in one of the Conference themes. These papers are assigned a 10-page limit.

Short paper: Ongoing works with relevant preliminary results, open to discussion. These papers are assigned a 7-page limit.

Poster paper: Initial work with relevant ideas, open to discussion. These papers are assigned to a 4-page limit.

Company paper: Companies' papers that show practical experience, R & D, tools, etc., focused on some topics of the conference. These papers are assigned to a 4-page limit.

Submitted papers must comply with the format of Advances in Intelligent Systems and Computing Series (see Instructions for Authors at Springer Website <https://www.springer.com/us/authors-editors/conference-proceedings/conference-proceedings-guidelines> or download a Word Template <ftp://ftp.springernature.com/cs-proceeding/llncs/word/splnproc1703.zip>or Latex Package <ftp://ftp.springernature.com/cs-proceeding/svproc/templates/ProcSci_TeX.zip>) be written in English, must not have been published before, not be under review for any other conference or publication and not include any information leading to the authors’ identification. Therefore, the authors’ names, affiliations and bibliographic references should not be included in the version for evaluation by the Program Committee. This information should only be included in the camera-ready version, saved in Word or Latex format and also in PDF format. These files must be accompanied by the Consent to Publish form <http://worldcist.org/copyright.pdf> filled out, in a ZIP file, and uploaded at the conference management system.


All papers will be subjected to a “double-blind review” by at least two members of the Program Committee.

Based on Program Committee evaluation, a paper can be rejected or accepted by the Conference Chairs. In the later case, it can be accepted as the type originally submitted or as another type. Thus, full papers can be accepted as short papers or poster papers only. Similarly, short papers can be accepted as poster papers only. In these cases, the authors will be allowed to maintain the original number of pages in the camera-ready version.

The authors of accepted poster papers must also build and print a poster to be exhibited during the Conference. This poster must follow an A1 or A2 vertical format. The Conference can includes Work Sessions where these posters are presented and orally discussed, with a 5 minute limit per poster.

The authors of accepted full papers will have 15 minutes to present their work in a Conference Work Session; approximately 5 minutes of discussion will follow each presentation. The authors of accepted short papers and company papers will have 11 minutes to present their work in a Conference Work Session; approximately 4 minutes of discussion will follow each presentation.



Publication and Indexing

To ensure that a full paper, short paper, poster paper or company paper is published, at least one of the authors must be fully registered by the 13th of January 2019, and the paper must comply with the suggested layout and page-limit. Additionally, all recommended changes must be addressed by the authors before they submit the camera-ready version.

No more than one paper per registration will be published. An extra fee must be paid for publication of additional papers, with a maximum of one additional paper per registration. One registration permits only the participation of one author in the conference.

Full and short papers will be published in Proceedings by Springer, in Advances in Intelligent Systems and Computing <http://www.springer.com/series/11156> series. Poster and company papers will not be published, just presented in the conference.

Published full and short papers will be submitted for indexation by ISI, EI-Compendex, SCOPUS, DBLP and Google Scholar, among others, and will be available in the SpringerLink Digital Library.

The authors of the best selected papers will be invited to extend them for publication in international journals indexed by ISI/SCI, SCOPUS and DBLP, among others, such as:

International Journal of Neural Systems <https://www.worldscientific.com/worldscinet/ijns> (IF: 4.58 / Q1)

Integrated Computer-Aided Engineering <http://www.iospress.nl/journal/integrated-computer-aided-engineering/> (IF: 3.667 / Q1)

Telecommunications Policy <https://www.journals.elsevier.com/telecommunications-policy> (IF: 2.087 / Q1)

Group Decision and Negotiation <https://www.springer.com/business+%26+management/operations+research/journal/10726> (IF: 1.869 / Q1)

Computers in Industry <https://www.journals.elsevier.com/computers-in-industry> (IF: 2.850 / Q2)

Journal of Medical Systems <http://www.springer.com/public+health/journal/10916> (IF: 2.098 / Q2)

Computer Languages, Systems & Structures <https://www.journals.elsevier.com/computer-languages-systems-and-structures/> (IF: 1.840 / Q2)

Cluster Computing <https://www.springer.com/computer/communication+networks/journal/10586> (IF: 1.601 / Q2)

Expert Systems - Journal of Knowledge Engineering <http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1468-0394> (IF: 1.43 / Q2)

Informatica - An International Journal <https://www.mii.lt/informatica/> (IF: 1.386 / Q2)

Journal of Intelligent & Fuzzy Systems <https://www.iospress.nl/journal/journal-of-intelligent-fuzzy-systems/> (IF: 1.426 / Q3)

Enterprise Information Systems <https://www.tandfonline.com/toc/teis20/current> (IF: 1.683 / Q3)

Data Technologies and Applications <http://www.emeraldgrouppublishing.com/products/journals/journals.htm?id=dta> (IF: 1.170 / Q3)

Innovations in Education and Teaching International <https://www.tandfonline.com/toc/riie20/current> (IF: 1.106 / Q3)

Intelligent Data Analysis <http://www.iospress.nl/journal/intelligent-data-analysis> (IF: 0.691 / Q4)

Computational and Mathematical Organization Theory <http://www.springer.com/business+%26+management/journal/10588> (IF: 0.641 / Q4)

AI Communications <https://www.iospress.nl/journal/ai-communications/> (IF: 0.461 / Q4)

Journal of Web Engineering <http://www.riverpublishers.com/journal.php?j=JWE/17/4> (IF: 0.311 / Q4)

Journal of Database Management <https://www.igi-global.com/journal/journal-database-management/1072> (IF: 0.231 / Q4)

Journal of Hospitality and Tourism Technology <http://www.emeraldgrouppublishing.com/products/journals/journals.htm?id=jhtt> (ISI - Emerging Sources Citation Index)

Computer Methods in Biomechanics and Biomedical Engineering - Imaging & Visualization <http://www.tandfonline.com/action/journalInformation?show=aimsScope&journalCode=tciv20#.V4KqBzUmMQs> (ISI - Emerging Sources Citation Index)

Journal of Information Systems Engineering & Management <http://www.lectitojournals.com/journal-for-information-systems-engineering-management>



Important Dates

Paper Submission: November 18, 2018

Notification of Acceptance: December 30, 2018

Payment of Registration, to ensure the inclusion of an accepted paper in the conference proceedings: January 13, 2019.

Camera-ready Submission: January 13, 2019



Website of WorldCIST'19: http://www.worldcist.org/ <http://www.worldcist.org/>




---
This email has been checked for viruses by AVG.
https://www.avg.com

[-- Attachment #1.2: Type: text/html, Size: 20457 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 13:04 UTC (permalink / raw)
  To: Mark Rutland
  Cc: lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, Linus Torvalds, bijan.mottahedeh
In-Reply-To: <20181102114635.hi3q53kzmz4qljsf@lakrids.cambridge.arm.com>

On Fri, Nov 02, 2018 at 11:46:36AM +0000, Mark Rutland wrote:
> On Thu, Nov 01, 2018 at 04:06:19PM -0700, Linus Torvalds wrote:
> > On Thu, Nov 1, 2018 at 4:00 PM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > +       memset(&rsp, 0, sizeof(rsp));
> > > +       rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
> > > +       resp = vq->iov[out].iov_base;
> > > +       ret = __copy_to_user(resp, &rsp, sizeof(rsp));
> > >
> > > Is it actually safe to trust that iov_base has passed an earlier
> > > access_ok() check here? Why not just use copy_to_user() instead?
> > 
> > Good point.
> > 
> > We really should have removed those double-underscore things ages ago.
> 
> FWIW, on arm64 we always check/sanitize the user address as a result of
> our sanitization of speculated values. Almost all of our uaccess
> routines have an explicit access_ok().
> 
> All our uaccess routines mask the user pointer based on addr_limit,
> which prevents speculative or architectural uaccess to kernel addresses
> when addr_limit it USER_DS:
> 
>     4d8efc2d5ee4c9cc ("arm64: Use pointer masking to limit uaccess speculation")
> 
> We also inhibit speculative stores to addr_limit being forwarded under
> speculation:
> 
>     c2f0ad4fc089cff8 ("arm64: uaccess: Prevent speculative use of the current addr_limit")
> 
> ... and given all that, we folded explicit access_ok() checks into
> __{get,put}_user():
> 
>     84624087dd7e3b48 ("arm64: uaccess: Don't bother eliding access_ok checks in __{get, put}_user")
> 
> IMO we could/should do the same for __copy_{to,from}_user().
> 
> Thanks,
> Mark.

I've tried making access_ok mask the parameter it gets.  Works because
access_ok is a macro. Most users pass in a variable so that will block
attempts to use speculation to bypass the access_ok checks.

Not 100% as someone can copy the value before access_ok, but
then it's all mitigation anyway.

Places which call access_ok on a non-lvalue need to be fixed then but
there are not too many of these.

The advantage here is that a code like this:

access_ok
for(...)
	__get_user

isn't slowed down as the masking is outside the loop.

OTOH macros changing their arguments are kind of ugly.
What do others think?

Just to show what I mean:

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index aae77eb8491c..c4d12c8f47d7 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -7,6 +7,7 @@
 #include <linux/compiler.h>
 #include <linux/kasan-checks.h>
 #include <linux/string.h>
+#include <linux/nospec.h>
 #include <asm/asm.h>
 #include <asm/page.h>
 #include <asm/smap.h>
@@ -69,6 +70,33 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
 	__chk_range_not_ok((unsigned long __force)(addr), size, limit); \
 })
 
+/*
+ * Test whether a block of memory is a valid user space address.
+ * Returns 0 if the range is valid, address itself otherwise.
+ */
+static inline unsigned long __verify_range_nospec(unsigned long addr,
+						  unsigned long size,
+						  unsigned long limit)
+{
+	/* Be careful about overflow */
+	limit = array_index_nospec(limit, size);
+
+	/*
+	 * If we have used "sizeof()" for the size,
+	 * we know it won't overflow the limit (but
+	 * it might overflow the 'addr', so it's
+	 * important to subtract the size from the
+	 * limit, not add it to the address).
+	 */
+	if (__builtin_constant_p(size)) {
+		return array_index_nospec(addr, limit - size + 1);
+	}
+
+	/* Arbitrary sizes? Be careful about overflow */
+	return array_index_mask_nospec(limit, size) &
+		array_index_nospec(addr, limit - size + 1);
+}
+
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define WARN_ON_IN_IRQ()	WARN_ON_ONCE(!in_task())
 #else
@@ -95,12 +123,46 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
  * checks that the pointer is in the user space range - after calling
  * this function, memory access functions may still return -EFAULT.
  */
-#define access_ok(type, addr, size)					\
+#define unsafe_access_ok(type, addr, size)					\
 ({									\
 	WARN_ON_IN_IRQ();						\
 	likely(!__range_not_ok(addr, size, user_addr_max()));		\
 })
 
+/**
+ * access_ok_nospec: - Checks if a user space pointer is valid
+ * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE.  Note that
+ *        %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe
+ *        to write to a block, it is always safe to read from it.
+ * @addr: User space pointer to start of block to check
+ * @size: Size of block to check
+ *
+ * Context: User context only. This function may sleep if pagefaults are
+ *          enabled.
+ *
+ * Checks if a pointer to a block of memory in user space is valid.
+ *
+ * Returns address itself (nonzero) if the memory block may be valid,
+ * zero if it is definitely invalid.
+ *
+ * To prevent speculation, the returned value must then be used
+ * for accesses.
+ *
+ * Note that, depending on architecture, this function probably just
+ * checks that the pointer is in the user space range - after calling
+ * this function, memory access functions may still return -EFAULT.
+ */
+#define access_ok_nospec(type, addr, size)			\
+({								\
+	WARN_ON_IN_IRQ();					\
+	__chk_user_ptr(addr);					\
+	addr = (typeof(addr) __force)				\
+	__verify_range_nospec((unsigned long __force)(addr),	\
+			       size, user_addr_max());		\
+})
+
+#define access_ok(type, addr, size) access_ok_nospec(type, addr, size)
+
 /*
  * These are the main single-value transfer routines.  They automatically
  * use the right size if we just have the right pointer type.
-- 
MST

^ permalink raw reply related

* Re: [PATCH 2/5] drm/virtio: add uapi for in and out explicit fences
From: Emil Velikov @ 2018-11-02 13:34 UTC (permalink / raw)
  To: Robert Foss
  Cc: Rob Herring, David Airlie, Linux-Kernel@Vger. Kernel. Org,
	ML dri-devel, open list:VIRTIO GPU DRIVER, Gustavo Padovan,
	Emil Velikov
In-Reply-To: <3a8a03d6-a544-b9f3-15a2-f6a2411bdb28@collabora.com>

On Thu, 1 Nov 2018 at 12:56, Robert Foss <robert.foss@collabora.com> wrote:
> On 2018-10-31 10:38, Emil Velikov wrote:
> > Hi Rob,
> >
> > On Thu, 25 Oct 2018 at 19:38, Robert Foss <robert.foss@collabora.com> wrote:
> >>
> >> Add a new field called fence_fd that will be used by userspace to send
> >> in-fences to the kernel and receive out-fences created by the kernel.
> >>
> >> This uapi enables virtio to take advantage of explicit synchronization of
> >> dma-bufs.
> >>
> >> There are two new flags:
> >>
> >> * VIRTGPU_EXECBUF_FENCE_FD_IN to be used when passing an in-fence fd.
> >> * VIRTGPU_EXECBUF_FENCE_FD_OUT to be used when requesting an out-fence fd
> >>
> >> The execbuffer IOCTL is now read-write to allow the userspace to read the
> >> out-fence.
> >>
> >> On error -1 should be returned in the fence_fd field.
> >>
> >> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.com>
> >> Signed-off-by: Robert Foss <robert.foss@collabora.com>
> >> ---
> >> Changes since v2:
> >>   - Since exbuf-flags is a new flag, check that unsupported
> >>     flags aren't set.
> >>
> >>   drivers/gpu/drm/virtio/virtgpu_ioctl.c |  5 +++++
> >>   include/uapi/drm/virtgpu_drm.h         | 13 ++++++++++---
> >>   2 files changed, 15 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
> >> index d01a9ed100d1..1af289b28fc4 100644
> >> --- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
> >> +++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
> >> @@ -116,9 +116,14 @@ static int virtio_gpu_execbuffer_ioctl(struct drm_device *dev, void *data,
> >>          struct ww_acquire_ctx ticket;
> >>          void *buf;
> >>
> >> +       exbuf->fence_fd = -1;
> >> +
> > Move this after the sanity checking.
>
> Agreed. Fixed in v4
>
> >
> >>          if (vgdev->has_virgl_3d == false)
> >>                  return -ENOSYS;
> >>
> >> +       if ((exbuf->flags & ~VIRTGPU_EXECBUF_FLAGS))
> >> +               return -EINVAL;
> >> +
> > I assume this did this trigger when using old userspace?
>
> No, not as far as I'm aware. This check is there to prevent userspace from
> polluting the bitspace of flag, so that all free bits can be used for new flags.
>
> As far as I understand this is pointed out by a drm driver development document
> written by danvet, which I unfortunately can't seem to find the link for at the
> moment.
>
Yes that is correct. What I was asking is:

Does a kernel with this patch, work with mesa lacking the corresponding updates?
I'd imagine things work just fine.

-Emil

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-02 16:14 UTC (permalink / raw)
  To: mst
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <20181102083018-mutt-send-email-mst@kernel.org>

On Fri, Nov 2, 2018 at 6:04 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> I've tried making access_ok mask the parameter it gets.

PLEASE don't do this.

Just use "copy_to/from_user()".

We have had lots of bugs because code bitrots.

And no, the access_ok() checks aren't expensive, not even in a loop.
They *used* to be somewhat expensive compared to the access, but that
simply isn't true any more. The real expense in copy_to_user and
friends are in the user access bit setting (STAC and CLAC on x86),
which easily an order of magnitude more expensive than access_ok().

So just get rid of the double-underscore version. It's basically
always a mis-optimization due to entirely historical reasons. I can
pretty much guarantee that it's not visible in profiles.

                  Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wh_bQK5zs+CwQ5eyodq4sQT0eOPp60qzvVL2_EtgETP-g@mail.gmail.com>

On Fri, Nov 02, 2018 at 09:14:51AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 6:04 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > I've tried making access_ok mask the parameter it gets.
> 
> PLEASE don't do this.

Okay.

> Just use "copy_to/from_user()".

Just for completeness I'd like to point out for vhost the copies are
done from the kernel thread.  So yes we can switch to copy_to/from_user
but for e.g. 32-bit userspace running on top of a 64 bit kernel it is
IIUC not sufficient - we must *also* do access_ok checks on control path
when addresses are passed to the kernel and when current points to the
correct task struct.

> We have had lots of bugs because code bitrots.

Yes, I wish we did not need these access_ok checks and could just rely
on copy_to/from_user.

> And no, the access_ok() checks aren't expensive, not even in a loop.
> They *used* to be somewhat expensive compared to the access, but that
> simply isn't true any more. The real expense in copy_to_user and
> friends are in the user access bit setting (STAC and CLAC on x86),
> which easily an order of magnitude more expensive than access_ok().
> 
> So just get rid of the double-underscore version. It's basically
> always a mis-optimization due to entirely historical reasons. I can
> pretty much guarantee that it's not visible in profiles.
> 
>                   Linus

OK. So maybe we should focus on switching to user_access_begin/end +
unsafe_get_user/unsafe_put_user in a loop which does seem to be
measureable.  That moves the barrier out of the loop, which seems to be
consistent with what you would expect.

-- 
MST

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-02 17:10 UTC (permalink / raw)
  To: mst
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <20181102122937-mutt-send-email-mst@kernel.org>

On Fri, Nov 2, 2018 at 9:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Just for completeness I'd like to point out for vhost the copies are
> done from the kernel thread.  So yes we can switch to copy_to/from_user
> but for e.g. 32-bit userspace running on top of a 64 bit kernel it is
> IIUC not sufficient - we must *also* do access_ok checks on control path
> when addresses are passed to the kernel and when current points to the
> correct task struct.

Don't you take over the VM with "use_mm()" when you do the copies? So
yes, it's a kernel thread, but it has a user VM, and though that
should have the user limits.

No?

          Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-02 17:15 UTC (permalink / raw)
  To: mst, Al Viro
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wibzqo7C+mS+BgZxRbgdWe2w5F39EhuFUhZUxvotoGLuA@mail.gmail.com>

On Fri, Nov 2, 2018 at 10:10 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Don't you take over the VM with "use_mm()" when you do the copies? So
> yes, it's a kernel thread, but it has a user VM, and though that
> should have the user limits.

Oooh. *Just* as I sent this, I realized that "use_mm()" doesn't update
the thread addr_limit.

That actually looks like a bug to me - although one that you've
apparently been aware of and worked around.

Wouldn't it be nicer to just make "use_mm()" do

        set_fs(USER_DS);

instead? And undo it on unuse_mm()?

And, in fact, maybe we should default kernel threads to have a zero
address limit, so that they can't do any user accesses at all without
doing this?

Adding Al to the cc, because I think he's been looking at set_fs() in general.

                  Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 17:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wibzqo7C+mS+BgZxRbgdWe2w5F39EhuFUhZUxvotoGLuA@mail.gmail.com>

On Fri, Nov 02, 2018 at 10:10:45AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 9:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > Just for completeness I'd like to point out for vhost the copies are
> > done from the kernel thread.  So yes we can switch to copy_to/from_user
> > but for e.g. 32-bit userspace running on top of a 64 bit kernel it is
> > IIUC not sufficient - we must *also* do access_ok checks on control path
> > when addresses are passed to the kernel and when current points to the
> > correct task struct.
> 
> Don't you take over the VM with "use_mm()" when you do the copies?

Yes we do.

> So
> yes, it's a kernel thread, but it has a user VM, and though that
> should have the user limits.
> 
> No?
> 
>           Linus

Here's what I meant: we have

#define access_ok(type, addr, size)                                     \
({                                                                      \
        WARN_ON_IN_IRQ();                                               \
        likely(!__range_not_ok(addr, size, user_addr_max()));           \
})

and

#define user_addr_max() (current->thread.addr_limit.seg)

it seems that it depends on current not on the active mm.

get_user and friends are similar:

ENTRY(__get_user_1)
        mov PER_CPU_VAR(current_task), %_ASM_DX
        cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
        jae bad_get_user
        sbb %_ASM_DX, %_ASM_DX          /* array_index_mask_nospec() */
        and %_ASM_DX, %_ASM_AX
        ASM_STAC
1:      movzbl (%_ASM_AX),%edx
        xor %eax,%eax
        ASM_CLAC
        ret
ENDPROC(__get_user_1)
EXPORT_SYMBOL(__get_user_1)

-- 
MST

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-02 18:02 UTC (permalink / raw)
  To: mst
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <20181102131334-mutt-send-email-mst@kernel.org>

On Fri, Nov 2, 2018 at 10:21 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> it seems that it depends on current not on the active mm.

Yes, see my other suggestion to just fix "use_mm()" to update the address range.

Would you mind testing that?

Because that would seem to be the *much* less error-prone model..

           Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mark.rutland, lenaic, mhocko, Kees Cook, kvm, netdev, liang.z.li,
	Linux Kernel Mailing List, virtualization, stefanha, joe,
	Andrew Morton, mhocko, bijan.mottahedeh
In-Reply-To: <CAHk-=wjAvx_rWKct5RtqRrVEgNxrFF=1qQwkz+AegDvd4yXW=g@mail.gmail.com>

On Fri, Nov 02, 2018 at 11:02:35AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 10:21 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > it seems that it depends on current not on the active mm.
> 
> Yes, see my other suggestion to just fix "use_mm()" to update the address range.
> 
> Would you mind testing that?

Sure, I'll test it.

> Because that would seem to be the *much* less error-prone model..
> 
>            Linus

I agree, it's always been bothering me.

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox