All of lore.kernel.org
 help / color / mirror / Atom feed
* (unknown), 
       [not found] <[PATCH 0/2] ceph osd: initial VMware VAAI support>
@ 2016-03-10  6:34 ` Mike Christie
  2016-03-10  6:34   ` [PATCH 1/2] ceph osd: add support for new op writesame Mike Christie
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Mike Christie @ 2016-03-10  6:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: ddiss

The following patches made over the ceph master branch
implement OSD side support for VMware VAAI's Atomic Test
and Set (ATS) and Write Same (Zero) requests.

ATS is used for operations like locking and heartbeats. It
is implemented by as the SCSI COMPARE_AND_WRITE command which
requires the device to read N blocks, compare them to data
sent with the command, and if equal, write N blocks.

Zero is used to initialize blocks to zero. It is implemented
as the SCSI WRITE_SAME command which passes the device a
block's worth of data and has it write it multiple times.

This does not include support for XCOPY/extended copy. I
am still looking into this, but it seems it might be
difficult to support due to rbd being more tuned to cloning
entire devices. When we implement VASA, the cloneVirtualVolume
might be something we can support though.

More info on VAAI can be found here:
http://www.vmware.com/resources/techresources/10337

The krbd patches which use these requests are in vaai branch of
this tree:
https://github.com/mikechristie/linux-kernel

I did not submit them in this thread, because they depend on other
patches that are still being reviewed upstream and I did not want
to waste people's time reviewing them if they change. These OSD side
patches should be ok to review and merge, because the op format 
and implemention should not change.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] ceph osd: add support for new op writesame
  2016-03-10  6:34 ` (unknown), Mike Christie
@ 2016-03-10  6:34   ` Mike Christie
  2016-03-10 12:03     ` David Disseldorp
  2016-03-10  6:34   ` [PATCH 2/2] ceph osd: add support for new op cmpext Mike Christie
  2016-03-10  6:36   ` [PATCH 0/2] ceph osd: initial VMware VAAI support Mike Christie
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2016-03-10  6:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: ddiss, Mike Christie

This adds a new ceph request writesame that writes a buffer of length
writesame.data_length bytes at writesame.offset over
writesame.length bytes.

This command maps to SCSI's WRITE SAME request, so users like LIO+rbd
can pass this to the OSD. Right now, it only saves having to transfer
writesame.length bytes over the network, but future versions will be
to fully offload it by passing it directly to the FS/devices if they
support it.

v2:
- Merge David's tracing fixes.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 src/include/rados.h     |  8 ++++++++
 src/osd/ReplicatedPG.cc | 38 ++++++++++++++++++++++++++++++++++++++
 src/osd/ReplicatedPG.h  |  2 ++
 src/tracing/osd.tp      | 18 ++++++++++++++++++
 4 files changed, 66 insertions(+)

diff --git a/src/include/rados.h b/src/include/rados.h
index f14d677..4d508c0 100644
--- a/src/include/rados.h
+++ b/src/include/rados.h
@@ -256,6 +256,9 @@ extern const char *ceph_osd_state_name(int s);
 	f(CACHE_PIN,	__CEPH_OSD_OP(WR, DATA, 36),	"cache-pin")        \
 	f(CACHE_UNPIN,	__CEPH_OSD_OP(WR, DATA, 37),	"cache-unpin")      \
 									    \
+	/* ESX/SCSI */							    \
+	f(WRITESAME,	__CEPH_OSD_OP(WR, DATA, 38),	"write-same")	    \
+									    \
 	/** multi **/							    \
 	f(CLONERANGE,	__CEPH_OSD_OP(WR, MULTI, 1),	"clonerange")	    \
 	f(ASSERT_SRC_VERSION, __CEPH_OSD_OP(RD, MULTI, 2), "assert-src-version") \
@@ -538,6 +541,11 @@ struct ceph_osd_op {
 			__le64 expected_object_size;
 			__le64 expected_write_size;
 		} __attribute__ ((packed)) alloc_hint;
+		struct {
+			__le64 offset;
+			__le64 length;
+			__le64 data_length;
+		} __attribute__ ((packed)) writesame;
 	};
 	__le32 payload_len;
 } __attribute__ ((packed));
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 5231e49..6a6112e 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -3650,6 +3650,37 @@ int ReplicatedPG::do_xattr_cmp_str(int op, string& v1s, bufferlist& xattr)
   }
 }
 
+int ReplicatedPG::do_writesame(OpContext *ctx, OSDOp& osd_op)
+{
+  ceph_osd_op& op = osd_op.op;
+  vector<OSDOp> write_ops(1);
+  OSDOp& write_op = write_ops[0];
+  uint64_t write_length = op.writesame.length;
+  int result = 0;
+
+  if (write_length % op.writesame.data_length)
+    return -EINVAL;
+
+  if (op.writesame.data_length != osd_op.indata.length()) {
+    derr << "invalid length ws data length " << op.writesame.data_length << " actual len " << osd_op.indata.length() << dendl;
+    return -EINVAL;
+  }
+
+  while (write_length) {
+    write_op.indata.append(osd_op.indata.c_str(), op.writesame.data_length);
+    write_length -= op.writesame.data_length;
+  }
+
+  write_op.op.op = CEPH_OSD_OP_WRITE;
+  write_op.op.extent.offset = op.writesame.offset;
+  write_op.op.extent.length = op.writesame.length;
+  result = do_osd_ops(ctx, write_ops);
+  if (result < 0)
+    derr << "do_writesame do_osd_ops failed " << result << dendl;
+
+  return result;
+}
+
 // ========================================================================
 // low level osd ops
 
@@ -5038,6 +5069,13 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
       }
       break;
 
+    case CEPH_OSD_OP_WRITESAME:
+      ++ctx->num_write;
+      tracepoint(osd, do_osd_op_pre_writesame, soid.oid.name.c_str(), soid.snap.val, oi.size, op.writesame.offset, op.writesame.length, op.writesame.data_length);
+
+     result = do_writesame(ctx, osd_op);
+     break;
+
     case CEPH_OSD_OP_ROLLBACK :
       ++ctx->num_write;
       tracepoint(osd, do_osd_op_pre_rollback, soid.oid.name.c_str(), soid.snap.val);
diff --git a/src/osd/ReplicatedPG.h b/src/osd/ReplicatedPG.h
index 3d24617..8004d25 100644
--- a/src/osd/ReplicatedPG.h
+++ b/src/osd/ReplicatedPG.h
@@ -1430,6 +1430,8 @@ protected:
   int do_xattr_cmp_u64(int op, __u64 v1, bufferlist& xattr);
   int do_xattr_cmp_str(int op, string& v1s, bufferlist& xattr);
 
+  int do_writesame(OpContext *ctx, OSDOp& osd_op);
+
   bool pgls_filter(PGLSFilter *filter, hobject_t& sobj, bufferlist& outdata);
   int get_pgls_filter(bufferlist::iterator& iter, PGLSFilter **pfilter);
 
diff --git a/src/tracing/osd.tp b/src/tracing/osd.tp
index 7a2ffd9..36ffa7e 100644
--- a/src/tracing/osd.tp
+++ b/src/tracing/osd.tp
@@ -381,6 +381,24 @@ TRACEPOINT_EVENT(osd, do_osd_op_pre_writefull,
     )
 )
 
+TRACEPOINT_EVENT(osd, do_osd_op_pre_writesame,
+    TP_ARGS(
+        const char*, oid,
+        uint64_t, snap,
+        uint64_t, osize,
+        uint64_t, offset,
+        uint64_t, length,
+        uint64_t, data_length),
+    TP_FIELDS(
+        ctf_string(oid, oid)
+        ctf_integer(uint64_t, snap, snap)
+        ctf_integer(uint64_t, osize, osize)
+        ctf_integer(uint64_t, offset, offset)
+        ctf_integer(uint64_t, length, length)
+        ctf_integer(uint64_t, data_length, data_length)
+    )
+)
+
 TRACEPOINT_EVENT(osd, do_osd_op_pre_rollback,
     TP_ARGS(
         const char*, oid,
-- 
2.7.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] ceph osd: add support for new op cmpext
  2016-03-10  6:34 ` (unknown), Mike Christie
  2016-03-10  6:34   ` [PATCH 1/2] ceph osd: add support for new op writesame Mike Christie
@ 2016-03-10  6:34   ` Mike Christie
  2016-03-10 12:03     ` David Disseldorp
  2016-03-10  6:36   ` [PATCH 0/2] ceph osd: initial VMware VAAI support Mike Christie
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2016-03-10  6:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: ddiss, Mike Christie

This adds support for a new op cmpext. The request will read
extent.length bytes and compare them to extent.length bytes at
extent.offset on disk. If there is a miscompare the osd will return
-EILSEQ, and the mismatched buffer that was read.

rbd will use this in a multi op request to implement the
SCSI COMPARE_AND_WRITE request which is used by VMware for
its atomic test and set request.

v2:
- Merge David's tracing fixes.
- Instead of returning the mismatch offset and buffer on matching
failure just return the buffer. The client can figure out the offset
if it needs it.

Signed-off-by: Mike Christie <mchristi@redhat.com>
---
 src/include/rados.h     |  2 ++
 src/osd/ReplicatedPG.cc | 31 +++++++++++++++++++++++++++++++
 src/osd/ReplicatedPG.h  |  1 +
 src/tracing/osd.tp      | 22 ++++++++++++++++++++++
 4 files changed, 56 insertions(+)

diff --git a/src/include/rados.h b/src/include/rados.h
index 4d508c0..229d855 100644
--- a/src/include/rados.h
+++ b/src/include/rados.h
@@ -258,6 +258,7 @@ extern const char *ceph_osd_state_name(int s);
 									    \
 	/* ESX/SCSI */							    \
 	f(WRITESAME,	__CEPH_OSD_OP(WR, DATA, 38),	"write-same")	    \
+	f(CMPEXT,	__CEPH_OSD_OP(RD, DATA, 31),	"cmpext")	    \
 									    \
 	/** multi **/							    \
 	f(CLONERANGE,	__CEPH_OSD_OP(WR, MULTI, 1),	"clonerange")	    \
@@ -358,6 +359,7 @@ static inline int ceph_osd_op_uses_extent(int op)
 	case CEPH_OSD_OP_ZERO:
 	case CEPH_OSD_OP_APPEND:
 	case CEPH_OSD_OP_TRIMTRUNC:
+	case CEPH_OSD_OP_CMPEXT:
 		return true;
 	default:
 		return false;
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 6a6112e..4593929 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -3650,6 +3650,32 @@ int ReplicatedPG::do_xattr_cmp_str(int op, string& v1s, bufferlist& xattr)
   }
 }
 
+int ReplicatedPG::do_extent_cmp(OpContext *ctx, OSDOp& osd_op)
+{
+  ceph_osd_op& op = osd_op.op;
+  vector<OSDOp> read_ops(1);
+  OSDOp& read_op = read_ops[0];
+  int result = 0;
+
+  read_op.op.op = CEPH_OSD_OP_SYNC_READ;
+  read_op.op.extent.offset = op.extent.offset;
+  read_op.op.extent.length = op.extent.length;
+  read_op.op.extent.truncate_seq = op.extent.truncate_seq;
+  read_op.op.extent.truncate_size = op.extent.truncate_size;
+
+  result = do_osd_ops(ctx, read_ops);
+  if (result < 0) {
+    derr << "do_extent_cmp do_osd_ops failed " << result << dendl;
+    return result;
+  }
+
+  if (osd_op.indata.contents_equal(read_op.outdata))
+    return 0;
+
+  osd_op.outdata.claim_append(read_op.outdata);
+  return -EILSEQ;
+}
+
 int ReplicatedPG::do_writesame(OpContext *ctx, OSDOp& osd_op)
 {
   ceph_osd_op& op = osd_op.op;
@@ -4154,6 +4180,11 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
       
       // --- READS ---
 
+    case CEPH_OSD_OP_CMPEXT:
+      tracepoint(osd, do_osd_op_pre_extent_cmp, soid.oid.name.c_str(), soid.snap.val, oi.size, oi.truncate_seq, op.extent.offset, op.extent.length, op.extent.truncate_size, op.extent.truncate_seq);
+      result = do_extent_cmp(ctx, osd_op);
+      break;
+
     case CEPH_OSD_OP_SYNC_READ:
       if (pool.info.require_rollback()) {
 	result = -EOPNOTSUPP;
diff --git a/src/osd/ReplicatedPG.h b/src/osd/ReplicatedPG.h
index 8004d25..adaf8af 100644
--- a/src/osd/ReplicatedPG.h
+++ b/src/osd/ReplicatedPG.h
@@ -1430,6 +1430,7 @@ protected:
   int do_xattr_cmp_u64(int op, __u64 v1, bufferlist& xattr);
   int do_xattr_cmp_str(int op, string& v1s, bufferlist& xattr);
 
+  int do_extent_cmp(OpContext *ctx, OSDOp& osd_op);
   int do_writesame(OpContext *ctx, OSDOp& osd_op);
 
   bool pgls_filter(PGLSFilter *filter, hobject_t& sobj, bufferlist& outdata);
diff --git a/src/tracing/osd.tp b/src/tracing/osd.tp
index 36ffa7e..e132b61 100644
--- a/src/tracing/osd.tp
+++ b/src/tracing/osd.tp
@@ -91,6 +91,28 @@ TRACEPOINT_EVENT(osd, do_osd_op_pre,
     )
 )
 
+TRACEPOINT_EVENT(osd, do_osd_op_pre_extent_cmp,
+    TP_ARGS(
+        const char*, oid,
+        uint64_t, snap,
+        uint64_t, osize,
+        uint32_t, oseq,
+        uint64_t, offset,
+        uint64_t, length,
+        uint64_t, truncate_size,
+        uint32_t, truncate_seq),
+    TP_FIELDS(
+        ctf_string(oid, oid)
+        ctf_integer(uint64_t, snap, snap)
+        ctf_integer(uint64_t, osize, osize)
+        ctf_integer(uint32_t, oseq, oseq)
+        ctf_integer(uint64_t, offset, offset)
+        ctf_integer(uint64_t, length, length)
+        ctf_integer(uint64_t, truncate_size, truncate_size)
+        ctf_integer(uint32_t, truncate_seq, truncate_seq)
+    )
+)
+
 TRACEPOINT_EVENT(osd, do_osd_op_pre_read,
     TP_ARGS(
         const char*, oid,
-- 
2.7.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 0/2] ceph osd: initial VMware VAAI support
  2016-03-10  6:34 ` (unknown), Mike Christie
  2016-03-10  6:34   ` [PATCH 1/2] ceph osd: add support for new op writesame Mike Christie
  2016-03-10  6:34   ` [PATCH 2/2] ceph osd: add support for new op cmpext Mike Christie
@ 2016-03-10  6:36   ` Mike Christie
  2016-03-10 12:04     ` David Disseldorp
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2016-03-10  6:36 UTC (permalink / raw)
  To: ceph-devel; +Cc: ddiss

Sorry. Edited the wrong line in the --compose screen. Due to the lack of
subject I guess it is going into some spam filters. Here is the mail:

On 03/10/2016 12:34 AM, Mike Christie wrote:
> The following patches made over the ceph master branch
> implement OSD side support for VMware VAAI's Atomic Test
> and Set (ATS) and Write Same (Zero) requests.
> 
> ATS is used for operations like locking and heartbeats. It
> is implemented by as the SCSI COMPARE_AND_WRITE command which
> requires the device to read N blocks, compare them to data
> sent with the command, and if equal, write N blocks.
> 
> Zero is used to initialize blocks to zero. It is implemented
> as the SCSI WRITE_SAME command which passes the device a
> block's worth of data and has it write it multiple times.
> 
> This does not include support for XCOPY/extended copy. I
> am still looking into this, but it seems it might be
> difficult to support due to rbd being more tuned to cloning
> entire devices. When we implement VASA, the cloneVirtualVolume
> might be something we can support though.
> 
> More info on VAAI can be found here:
> http://www.vmware.com/resources/techresources/10337
> 
> The krbd patches which use these requests are in vaai branch of
> this tree:
> https://github.com/mikechristie/linux-kernel
> 
> I did not submit them in this thread, because they depend on other
> patches that are still being reviewed upstream and I did not want
> to waste people's time reviewing them if they change. These OSD side
> patches should be ok to review and merge, because the op format 
> and implemention should not change.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] ceph osd: add support for new op cmpext
  2016-03-10  6:34   ` [PATCH 2/2] ceph osd: add support for new op cmpext Mike Christie
@ 2016-03-10 12:03     ` David Disseldorp
  2016-03-10 17:06       ` Mike Christie
  0 siblings, 1 reply; 12+ messages in thread
From: David Disseldorp @ 2016-03-10 12:03 UTC (permalink / raw)
  To: Mike Christie; +Cc: ceph-devel

Hi Mike,

On Thu, 10 Mar 2016 00:34:32 -0600, Mike Christie wrote:

> This adds support for a new op cmpext. The request will read
> extent.length bytes and compare them to extent.length bytes at
> extent.offset on disk. If there is a miscompare the osd will return
> -EILSEQ, and the mismatched buffer that was read.
> 
> rbd will use this in a multi op request to implement the
> SCSI COMPARE_AND_WRITE request which is used by VMware for
> its atomic test and set request.
> 
> v2:
> - Merge David's tracing fixes.
> - Instead of returning the mismatch offset and buffer on matching
> failure just return the buffer. The client can figure out the offset
> if it needs it.

What's your reason for dropping the mismatch offset? The osd has
to perform the comparison, so might as well do something with the
result, even if it's ignored by the client in some cases.

Cheers, David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] ceph osd: add support for new op writesame
  2016-03-10  6:34   ` [PATCH 1/2] ceph osd: add support for new op writesame Mike Christie
@ 2016-03-10 12:03     ` David Disseldorp
  0 siblings, 0 replies; 12+ messages in thread
From: David Disseldorp @ 2016-03-10 12:03 UTC (permalink / raw)
  To: Mike Christie; +Cc: ceph-devel

On Thu, 10 Mar 2016 00:34:31 -0600, Mike Christie wrote:

> This adds a new ceph request writesame that writes a buffer of length
> writesame.data_length bytes at writesame.offset over
> writesame.length bytes.
> 
> This command maps to SCSI's WRITE SAME request, so users like LIO+rbd
> can pass this to the OSD. Right now, it only saves having to transfer
> writesame.length bytes over the network, but future versions will be
> to fully offload it by passing it directly to the FS/devices if they
> support it.
> 
> v2:
> - Merge David's tracing fixes.
> 
> Signed-off-by: Mike Christie <mchristi@redhat.com>

Looks good Mike.
Reviewed-by: David Disseldorp <ddiss@suse.de>

Cheers, David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] ceph osd: initial VMware VAAI support
  2016-03-10  6:36   ` [PATCH 0/2] ceph osd: initial VMware VAAI support Mike Christie
@ 2016-03-10 12:04     ` David Disseldorp
  2016-03-10 22:45       ` Josh Durgin
  0 siblings, 1 reply; 12+ messages in thread
From: David Disseldorp @ 2016-03-10 12:04 UTC (permalink / raw)
  To: Mike Christie; +Cc: ceph-devel

On Thu, 10 Mar 2016 00:36:55 -0600, Mike Christie wrote:

...
> > This does not include support for XCOPY/extended copy. I
> > am still looking into this, but it seems it might be
> > difficult to support due to rbd being more tuned to cloning
> > entire devices. When we implement VASA, the cloneVirtualVolume
> > might be something we can support though.

I suppose the src-and-dest-in-same-pg requirement would complicate
things quite a bit, but wouldn't clonerange be an option for XCOPY
offloads?

Cheers, David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] ceph osd: add support for new op cmpext
  2016-03-10 12:03     ` David Disseldorp
@ 2016-03-10 17:06       ` Mike Christie
  2016-03-10 17:12         ` David Disseldorp
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2016-03-10 17:06 UTC (permalink / raw)
  To: David Disseldorp; +Cc: ceph-devel

On 03/10/2016 06:03 AM, David Disseldorp wrote:
> Hi Mike,
> 
> On Thu, 10 Mar 2016 00:34:32 -0600, Mike Christie wrote:
> 
>> This adds support for a new op cmpext. The request will read
>> extent.length bytes and compare them to extent.length bytes at
>> extent.offset on disk. If there is a miscompare the osd will return
>> -EILSEQ, and the mismatched buffer that was read.
>>
>> rbd will use this in a multi op request to implement the
>> SCSI COMPARE_AND_WRITE request which is used by VMware for
>> its atomic test and set request.
>>
>> v2:
>> - Merge David's tracing fixes.
>> - Instead of returning the mismatch offset and buffer on matching
>> failure just return the buffer. The client can figure out the offset
>> if it needs it.
> 
> What's your reason for dropping the mismatch offset?

I was not sure if anyone else was going to use it. I can add it back. It
does not matter to me.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] ceph osd: add support for new op cmpext
  2016-03-10 17:06       ` Mike Christie
@ 2016-03-10 17:12         ` David Disseldorp
  0 siblings, 0 replies; 12+ messages in thread
From: David Disseldorp @ 2016-03-10 17:12 UTC (permalink / raw)
  To: Mike Christie; +Cc: ceph-devel

On Thu, 10 Mar 2016 11:06:22 -0600, Mike Christie wrote:

> > What's your reason for dropping the mismatch offset?  
> 
> I was not sure if anyone else was going to use it. I can add it back. It
> does not matter to me.

Thanks - I'd prefer to keep it, given that it doesn't cost anything.

Cheers, David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] ceph osd: initial VMware VAAI support
  2016-03-10 12:04     ` David Disseldorp
@ 2016-03-10 22:45       ` Josh Durgin
  2016-03-11  4:46         ` Ric Wheeler
  2016-03-11 10:03         ` David Disseldorp
  0 siblings, 2 replies; 12+ messages in thread
From: Josh Durgin @ 2016-03-10 22:45 UTC (permalink / raw)
  To: David Disseldorp, Mike Christie; +Cc: ceph-devel

On 03/10/2016 04:04 AM, David Disseldorp wrote:
> On Thu, 10 Mar 2016 00:36:55 -0600, Mike Christie wrote:
>
> ...
>>> This does not include support for XCOPY/extended copy. I
>>> am still looking into this, but it seems it might be
>>> difficult to support due to rbd being more tuned to cloning
>>> entire devices. When we implement VASA, the cloneVirtualVolume
>>> might be something we can support though.
>
> I suppose the src-and-dest-in-same-pg requirement would complicate
> things quite a bit, but wouldn't clonerange be an option for XCOPY
> offloads?

It's not a good fit, since with multiple clones putting data on the
same set of osds, the workload and space utilization gets skewed for
that set of osds compared to the rest of the cluster.

It also won't give you fast cloning - it's a full copy on xfs, and
you'd need to do one for every object affected.

Due to these limitations, lack of existing clonerange use, and the
complications it brings to the osd as the only op affecting more than
one object, we've talked about removing the clonerange op.

Josh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] ceph osd: initial VMware VAAI support
  2016-03-10 22:45       ` Josh Durgin
@ 2016-03-11  4:46         ` Ric Wheeler
  2016-03-11 10:03         ` David Disseldorp
  1 sibling, 0 replies; 12+ messages in thread
From: Ric Wheeler @ 2016-03-11  4:46 UTC (permalink / raw)
  To: Josh Durgin, David Disseldorp, Mike Christie; +Cc: ceph-devel

On 03/11/2016 04:15 AM, Josh Durgin wrote:
> On 03/10/2016 04:04 AM, David Disseldorp wrote:
>> On Thu, 10 Mar 2016 00:36:55 -0600, Mike Christie wrote:
>>
>> ...
>>>> This does not include support for XCOPY/extended copy. I
>>>> am still looking into this, but it seems it might be
>>>> difficult to support due to rbd being more tuned to cloning
>>>> entire devices. When we implement VASA, the cloneVirtualVolume
>>>> might be something we can support though.
>>
>> I suppose the src-and-dest-in-same-pg requirement would complicate
>> things quite a bit, but wouldn't clonerange be an option for XCOPY
>> offloads?
>
> It's not a good fit, since with multiple clones putting data on the
> same set of osds, the workload and space utilization gets skewed for
> that set of osds compared to the rest of the cluster.
>
> It also won't give you fast cloning - it's a full copy on xfs, and
> you'd need to do one for every object affected.

Note that XFS is working on reflink code at the moment and that the kernel 
people are looking at new system calls that will allow copy offload generically.

Specifically, that will give XFS (and other file systems like btrfs) the ability 
to do a zero data movement pseudo copy (copy on write version) of a file.

That would make this interesting I think to think about doing...

Regards,

Ric

>
> Due to these limitations, lack of existing clonerange use, and the
> complications it brings to the osd as the only op affecting more than
> one object, we've talked about removing the clonerange op.
>
> Josh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] ceph osd: initial VMware VAAI support
  2016-03-10 22:45       ` Josh Durgin
  2016-03-11  4:46         ` Ric Wheeler
@ 2016-03-11 10:03         ` David Disseldorp
  1 sibling, 0 replies; 12+ messages in thread
From: David Disseldorp @ 2016-03-11 10:03 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Mike Christie, ceph-devel

Hi Josh,

On Thu, 10 Mar 2016 14:45:38 -0800, Josh Durgin wrote:

> On 03/10/2016 04:04 AM, David Disseldorp wrote:
> > On Thu, 10 Mar 2016 00:36:55 -0600, Mike Christie wrote:
> >
> > ...
> >>> This does not include support for XCOPY/extended copy. I
> >>> am still looking into this, but it seems it might be
> >>> difficult to support due to rbd being more tuned to cloning
> >>> entire devices. When we implement VASA, the cloneVirtualVolume
> >>> might be something we can support though.
> >
> > I suppose the src-and-dest-in-same-pg requirement would complicate
> > things quite a bit, but wouldn't clonerange be an option for XCOPY
> > offloads?
> 
> It's not a good fit, since with multiple clones putting data on the
> same set of osds, the workload and space utilization gets skewed for
> that set of osds compared to the rest of the cluster.
> 
> It also won't give you fast cloning - it's a full copy on xfs, and
> you'd need to do one for every object affected.

Currently the copy is being done on the LIO iSCSI gateway, so offloading
any of that to the OSDs would save a lot of network traffic.

Also as Ric mentioned, XFS has clone-range support coming, so Ceph's
dedupe COW optimisations need not only be limited the Btrfs Filestore.

> Due to these limitations, lack of existing clonerange use, and the
> complications it brings to the osd as the only op affecting more than
> one object, we've talked about removing the clonerange op.

Okay, fair enough. Thanks for the details.

Cheers, David

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-03-11 10:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[PATCH 0/2] ceph osd: initial VMware VAAI support>
2016-03-10  6:34 ` (unknown), Mike Christie
2016-03-10  6:34   ` [PATCH 1/2] ceph osd: add support for new op writesame Mike Christie
2016-03-10 12:03     ` David Disseldorp
2016-03-10  6:34   ` [PATCH 2/2] ceph osd: add support for new op cmpext Mike Christie
2016-03-10 12:03     ` David Disseldorp
2016-03-10 17:06       ` Mike Christie
2016-03-10 17:12         ` David Disseldorp
2016-03-10  6:36   ` [PATCH 0/2] ceph osd: initial VMware VAAI support Mike Christie
2016-03-10 12:04     ` David Disseldorp
2016-03-10 22:45       ` Josh Durgin
2016-03-11  4:46         ` Ric Wheeler
2016-03-11 10:03         ` David Disseldorp

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.