Re: test osd on zfs

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: test osd on zfs
       [not found] <516E7D5C.7080309@nazarianin.com>
@ 2013-04-17 15:19 ` Sage Weil
  2013-04-17 15:57   ` Henry C Chang
  0 siblings, 1 reply; 20+ messages in thread
From: Sage Weil @ 2013-04-17 15:19 UTC (permalink / raw)
  To: Aleksey Leonov; +Cc: ceph-devel

Hey,

Can you test with the wip-debug-xattr branch?  Set debug filestore = 30 
and it will dump the xattr values to the log on set and get, so we can see 
what is going on.

Also/alternatively, strace with -f -v -x, which will (I think) include the 
full value of the get/setxattr args..

Thanks!
sage


On Wed, 17 Apr 2013, Aleksey Leonov wrote:

>      Hi all,
> 
>      I create test VM for try run ceph osd on zfs.
>      mkcephfs run ok. Osd down 2 minutes after start.
> 
> ceph.conf
> [global]
>          max open files = 131072
>          log file = /var/log/ceph/$name.log
>          pid file = /var/run/ceph/$name.pid
> [mon]
>          mon data = /ceph/mon/$name
> [mon.alpha]
>          host = ct1
>          mon addr = 10.10.10.2:6789
> [mds]
> [mds.alpha]
>          host = ct1
> [osd]
>          debug filestore = 20
>          filestore xattr use omap = true
>          osd data = /ceph/osd/$name
>          osd journal = /ceph/osd/$name/journal
>          osd journal size = 2000 ; journal size, in megabytes
>          journal dio = false
>          journal aio = false
> 
> [osd.0]
>          host = ct1
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 15:19 ` test osd on zfs Sage Weil
@ 2013-04-17 15:57   ` Henry C Chang
  2013-04-17 16:37     ` Jeff Mitchell
  0 siblings, 1 reply; 20+ messages in thread
From: Henry C Chang @ 2013-04-17 15:57 UTC (permalink / raw)
  To: Sage Weil; +Cc: Aleksey Leonov, ceph-devel

I looked into this problem earlier. The problem is that zfs does not
return ERANGE when the size of value buffer passed to getxattr is too
small. zfs returns with truncated xattr value.

Regards,
Henry

2013/4/17 Sage Weil <sage@inktank.com>:
> Hey,
>
> Can you test with the wip-debug-xattr branch?  Set debug filestore = 30
> and it will dump the xattr values to the log on set and get, so we can see
> what is going on.
>
> Also/alternatively, strace with -f -v -x, which will (I think) include the
> full value of the get/setxattr args..
>
> Thanks!
> sage
>
>
> On Wed, 17 Apr 2013, Aleksey Leonov wrote:
>
>>      Hi all,
>>
>>      I create test VM for try run ceph osd on zfs.
>>      mkcephfs run ok. Osd down 2 minutes after start.
>>
>> ceph.conf
>> [global]
>>          max open files = 131072
>>          log file = /var/log/ceph/$name.log
>>          pid file = /var/run/ceph/$name.pid
>> [mon]
>>          mon data = /ceph/mon/$name
>> [mon.alpha]
>>          host = ct1
>>          mon addr = 10.10.10.2:6789
>> [mds]
>> [mds.alpha]
>>          host = ct1
>> [osd]
>>          debug filestore = 20
>>          filestore xattr use omap = true
>>          osd data = /ceph/osd/$name
>>          osd journal = /ceph/osd/$name/journal
>>          osd journal size = 2000 ; journal size, in megabytes
>>          journal dio = false
>>          journal aio = false
>>
>> [osd.0]
>>          host = ct1
>>
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 15:57   ` Henry C Chang
@ 2013-04-17 16:37     ` Jeff Mitchell
  2013-04-17 17:00       ` Henry C Chang
                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jeff Mitchell @ 2013-04-17 16:37 UTC (permalink / raw)
  To: Henry C Chang; +Cc: Sage Weil, Aleksey Leonov, ceph-devel

Henry C Chang wrote:
> I looked into this problem earlier. The problem is that zfs does not
> return ERANGE when the size of value buffer passed to getxattr is too
> small. zfs returns with truncated xattr value.

Is this a bug in ZFS, or simply different behavior?

I've used ZFSonLinux quite a bit and they do seem to be very eager to 
fix bugs related to improper behavior, so if it's actually a bug 
I/someone can talk to them and try to get them to look at it soonish.

--Jeff


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 16:37     ` Jeff Mitchell
@ 2013-04-17 17:00       ` Henry C Chang
  2013-04-17 17:00       ` Sage Weil
  2013-04-17 17:04       ` Yehuda Sadeh
  2 siblings, 0 replies; 20+ messages in thread
From: Henry C Chang @ 2013-04-17 17:00 UTC (permalink / raw)
  To: Jeff Mitchell; +Cc: Sage Weil, Aleksey Leonov, ceph-devel

getxattr linux man page says ERANGE will be returned if the size of
the value buffer is too small to hold the result. Thus, I think it is
a bug of ZFS (or ZOL, at least).

2013/4/18 Jeff Mitchell <jeffrey.mitchell@gmail.com>:
> Henry C Chang wrote:
>>
>> I looked into this problem earlier. The problem is that zfs does not
>> return ERANGE when the size of value buffer passed to getxattr is too
>> small. zfs returns with truncated xattr value.
>
>
> Is this a bug in ZFS, or simply different behavior?
>
> I've used ZFSonLinux quite a bit and they do seem to be very eager to fix
> bugs related to improper behavior, so if it's actually a bug I/someone can
> talk to them and try to get them to look at it soonish.
>
> --Jeff
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 16:37     ` Jeff Mitchell
  2013-04-17 17:00       ` Henry C Chang
@ 2013-04-17 17:00       ` Sage Weil
  2013-04-17 17:04       ` Yehuda Sadeh
  2 siblings, 0 replies; 20+ messages in thread
From: Sage Weil @ 2013-04-17 17:00 UTC (permalink / raw)
  To: behlendorf1, Jeff Mitchell; +Cc: Henry C Chang, Aleksey Leonov, ceph-devel

Adding Brian Behlendorf to the CC list, as we were just talking about this 
yesterday at LUG.  :)

I suspect this is a bug; the posix docs indicate ERANGE is apprpriate 
here:

     [ERANGE]		value (as indicated by size) is too small to hold the
			extended attribute data.

from http://www.unix.com/man-page/all/2/GETXATTR/.

sage


On Wed, 17 Apr 2013, Jeff Mitchell wrote:
> Henry C Chang wrote:
> > I looked into this problem earlier. The problem is that zfs does not
> > return ERANGE when the size of value buffer passed to getxattr is too
> > small. zfs returns with truncated xattr value.
> 
> Is this a bug in ZFS, or simply different behavior?
> 
> I've used ZFSonLinux quite a bit and they do seem to be very eager to fix bugs
> related to improper behavior, so if it's actually a bug I/someone can talk to
> them and try to get them to look at it soonish.
> 
> --Jeff
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 16:37     ` Jeff Mitchell
  2013-04-17 17:00       ` Henry C Chang
  2013-04-17 17:00       ` Sage Weil
@ 2013-04-17 17:04       ` Yehuda Sadeh
  2013-04-17 17:05         ` Sage Weil
  2 siblings, 1 reply; 20+ messages in thread
From: Yehuda Sadeh @ 2013-04-17 17:04 UTC (permalink / raw)
  To: Jeff Mitchell; +Cc: Henry C Chang, Sage Weil, Aleksey Leonov, ceph-devel

On Wed, Apr 17, 2013 at 9:37 AM, Jeff Mitchell
<jeffrey.mitchell@gmail.com> wrote:
> Henry C Chang wrote:
>>
>> I looked into this problem earlier. The problem is that zfs does not
>> return ERANGE when the size of value buffer passed to getxattr is too
>> small. zfs returns with truncated xattr value.
>
>
> Is this a bug in ZFS, or simply different behavior?

Took a brief look at the zfs code, seems like a zfs bug.

diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
index c03764f..96db7dd 100644
--- a/module/zfs/zpl_xattr.c
+++ b/module/zfs/zpl_xattr.c
@@ -263,6 +263,9 @@ zpl_xattr_get_sa(struct inode *ip, const char
*name, void *value, size_t size)
        if (!size)
                return (nv_size);

+       if (size < nv_size)
+               return (-ERANGE);
+
        memcpy(value, nv_value, MIN(size, nv_size));

        return (MIN(size, nv_size));


This should fix it. Not tested of course.

Yehuda

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 17:04       ` Yehuda Sadeh
@ 2013-04-17 17:05         ` Sage Weil
  2013-04-17 17:15           ` Yehuda Sadeh
  0 siblings, 1 reply; 20+ messages in thread
From: Sage Weil @ 2013-04-17 17:05 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Jeff Mitchell, Henry C Chang, Aleksey Leonov, ceph-devel,
	behlendorf1

[Adding Brian to CC list again :)]

On Wed, 17 Apr 2013, Yehuda Sadeh wrote:

> On Wed, Apr 17, 2013 at 9:37 AM, Jeff Mitchell
> <jeffrey.mitchell@gmail.com> wrote:
> > Henry C Chang wrote:
> >>
> >> I looked into this problem earlier. The problem is that zfs does not
> >> return ERANGE when the size of value buffer passed to getxattr is too
> >> small. zfs returns with truncated xattr value.
> >
> >
> > Is this a bug in ZFS, or simply different behavior?
> 
> Took a brief look at the zfs code, seems like a zfs bug.
> 
> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
> index c03764f..96db7dd 100644
> --- a/module/zfs/zpl_xattr.c
> +++ b/module/zfs/zpl_xattr.c
> @@ -263,6 +263,9 @@ zpl_xattr_get_sa(struct inode *ip, const char
> *name, void *value, size_t size)
>         if (!size)
>                 return (nv_size);
> 
> +       if (size < nv_size)
> +               return (-ERANGE);
> +
>         memcpy(value, nv_value, MIN(size, nv_size));
> 
>         return (MIN(size, nv_size));
> 
> 
> This should fix it. Not tested of course.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 17:05         ` Sage Weil
@ 2013-04-17 17:15           ` Yehuda Sadeh
  2013-04-17 18:06             ` Brian Behlendorf
  2013-04-17 18:57             ` Brian Behlendorf
  0 siblings, 2 replies; 20+ messages in thread
From: Yehuda Sadeh @ 2013-04-17 17:15 UTC (permalink / raw)
  To: Sage Weil
  Cc: Jeff Mitchell, Henry C Chang, Aleksey Leonov, ceph-devel,
	behlendorf1

On Wed, Apr 17, 2013 at 10:05 AM, Sage Weil <sage@inktank.com> wrote:
> [Adding Brian to CC list again :)]
>
> On Wed, 17 Apr 2013, Yehuda Sadeh wrote:
>
>> On Wed, Apr 17, 2013 at 9:37 AM, Jeff Mitchell
>> <jeffrey.mitchell@gmail.com> wrote:
>> > Henry C Chang wrote:
>> >>
>> >> I looked into this problem earlier. The problem is that zfs does not
>> >> return ERANGE when the size of value buffer passed to getxattr is too
>> >> small. zfs returns with truncated xattr value.
>> >
>> >
>> > Is this a bug in ZFS, or simply different behavior?
>>
>> Took a brief look at the zfs code, seems like a zfs bug.
>>
>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>> index c03764f..96db7dd 100644
>> --- a/module/zfs/zpl_xattr.c
>> +++ b/module/zfs/zpl_xattr.c
>> @@ -263,6 +263,9 @@ zpl_xattr_get_sa(struct inode *ip, const char
>> *name, void *value, size_t size)
>>         if (!size)
>>                 return (nv_size);
>>
>> +       if (size < nv_size)
>> +               return (-ERANGE);
>> +
>>         memcpy(value, nv_value, MIN(size, nv_size));
>>
>>         return (MIN(size, nv_size));
>>
>>
>> This should fix it. Not tested of course.

Well, looking at the code again it's not going to work, as setxattr is
going to fail with ERANGE.

Yehuda

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 17:15           ` Yehuda Sadeh
@ 2013-04-17 18:06             ` Brian Behlendorf
  2013-04-17 18:57             ` Brian Behlendorf
  1 sibling, 0 replies; 20+ messages in thread
From: Brian Behlendorf @ 2013-04-17 18:06 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Sage Weil, Jeff Mitchell, Henry C Chang, Aleksey Leonov,
	ceph-devel

On 04/17/2013 10:15 AM, Yehuda Sadeh wrote:
> On Wed, Apr 17, 2013 at 10:05 AM, Sage Weil <sage@inktank.com> wrote:
>> [Adding Brian to CC list again :)]
>>
>> On Wed, 17 Apr 2013, Yehuda Sadeh wrote:
>>
>>> On Wed, Apr 17, 2013 at 9:37 AM, Jeff Mitchell
>>> <jeffrey.mitchell@gmail.com> wrote:
>>>> Henry C Chang wrote:
>>>>>
>>>>> I looked into this problem earlier. The problem is that zfs does not
>>>>> return ERANGE when the size of value buffer passed to getxattr is too
>>>>> small. zfs returns with truncated xattr value.
>>>>
>>>>
>>>> Is this a bug in ZFS, or simply different behavior?
>>>
>>> Took a brief look at the zfs code, seems like a zfs bug.
>>>
>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>> index c03764f..96db7dd 100644
>>> --- a/module/zfs/zpl_xattr.c
>>> +++ b/module/zfs/zpl_xattr.c
>>> @@ -263,6 +263,9 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>> *name, void *value, size_t size)
>>>          if (!size)
>>>                  return (nv_size);
>>>
>>> +       if (size < nv_size)
>>> +               return (-ERANGE);
>>> +
>>>          memcpy(value, nv_value, MIN(size, nv_size));
>>>
>>>          return (MIN(size, nv_size));
>>>
>>>
>>> This should fix it. Not tested of course.
>
> Well, looking at the code again it's not going to work, as setxattr is
> going to fail with ERANGE.
>
> Yehuda
>

That does sounds like a zfs bug,  According to getxattr(2) it should 
return ERANGE if the buffer is too small.   I'll take a look but it's 
strange that this hasn't surfaced before.

        If the size of the value buffer is too small to hold the
        result,  errno is set to ERANGE.

Thanks,
Brian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 17:15           ` Yehuda Sadeh
  2013-04-17 18:06             ` Brian Behlendorf
@ 2013-04-17 18:57             ` Brian Behlendorf
  2013-04-17 19:07               ` Yehuda Sadeh
  1 sibling, 1 reply; 20+ messages in thread
From: Brian Behlendorf @ 2013-04-17 18:57 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Sage Weil, Jeff Mitchell, Henry C Chang, Aleksey Leonov,
	ceph-devel

Here's a patch for the ERANGE error (lightly tested).  Sage's patch 
looks good but only covers one of two code paths for xattrs.  With zfs 
they may either be stored as a system attribute which is usually close 
to the dnode on disk (zfs set xattr=sa pool/dataset).  Or they may be 
stored in their own object which is how it's implemented on Solaris (zfs 
set xattr=on pool/dataset).  The second method is still the default for 
compatibility reasons even though it's slower.  Sage's patch only 
covered the SA case.

 > Well, looking at the code again it's not going to work, as setxattr is
 > going to fail with ERANGE.

Why?  We support an arbitrary number of maximum sized xattrs (65536). 
What am I missing here?

Incidentally, does anybody know of an good xattr test suite we could add 
to our regression tests?

Thanks,
Brian

diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
index c03764f..9f4d63c 100644
--- a/module/zfs/zpl_xattr.c
+++ b/module/zfs/zpl_xattr.c
@@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char 
*name, void *value,
                 goto out;
         }

+       if (size < i_size_read(xip)) {
+               error = -ERANGE;
+               goto out;
+       }
+
         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
  out:
         if (xip)
@@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char 
*name, void *value, size_t size)
         if (!size)
                 return (nv_size);

-       memcpy(value, nv_value, MIN(size, nv_size));
+       if (size < nv_size)
+               return (-ERANGE);
+
+       memcpy(value, nv_value, size);

         return (MIN(size, nv_size));
  }

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 18:57             ` Brian Behlendorf
@ 2013-04-17 19:07               ` Yehuda Sadeh
  2013-04-17 19:09                 ` Stefan Priebe
  0 siblings, 1 reply; 20+ messages in thread
From: Yehuda Sadeh @ 2013-04-17 19:07 UTC (permalink / raw)
  To: Brian Behlendorf
  Cc: Sage Weil, Jeff Mitchell, Henry C Chang, Aleksey Leonov,
	ceph-devel

On Wed, Apr 17, 2013 at 11:57 AM, Brian Behlendorf <behlendorf1@llnl.gov> wrote:
>
> Here's a patch for the ERANGE error (lightly tested).  Sage's patch looks
> good but only covers one of two code paths for xattrs.  With zfs they may
> either be stored as a system attribute which is usually close to the dnode
> on disk (zfs set xattr=sa pool/dataset).  Or they may be stored in their own
> object which is how it's implemented on Solaris (zfs set xattr=on
> pool/dataset).  The second method is still the default for compatibility
> reasons even though it's slower.  Sage's patch only covered the SA case.
>
>
>> Well, looking at the code again it's not going to work, as setxattr is
>> going to fail with ERANGE.
>
> Why?  We support an arbitrary number of maximum sized xattrs (65536). What
> am I missing here?
>
> Incidentally, does anybody know of an good xattr test suite we could add to
> our regression tests?
>
> Thanks,
> Brian
>
> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
> index c03764f..9f4d63c 100644
> --- a/module/zfs/zpl_xattr.c
> +++ b/module/zfs/zpl_xattr.c
> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char *name,
> void *value,
>                 goto out;
>         }
>
> +       if (size < i_size_read(xip)) {
> +               error = -ERANGE;
> +               goto out;
> +       }
> +
>         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
>  out:
>         if (xip)
> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char *name,
> void *value, size_t size)
>         if (!size)
>                 return (nv_size);
>
> -       memcpy(value, nv_value, MIN(size, nv_size));
>
> +       if (size < nv_size)
> +               return (-ERANGE);

Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
also be called by zpl_xattr_get() to test for xattr existence. So it
needs to make sure that zpl_xattr_set() doesn't fail if getting
-ERANGE.

> +
> +       memcpy(value, nv_value, size);
>
>         return (MIN(size, nv_size));

No need for MIN() here.


Yehuda

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 19:07               ` Yehuda Sadeh
@ 2013-04-17 19:09                 ` Stefan Priebe
  2013-04-17 20:16                   ` Mark Nelson
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Priebe @ 2013-04-17 19:09 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Brian Behlendorf, Sage Weil, Jeff Mitchell, Henry C Chang,
	Aleksey Leonov, ceph-devel

Sorry to disturb, but what is the raeson / advantage of using zfs for ceph?

Greets,
Stefan
Am 17.04.2013 21:07, schrieb Yehuda Sadeh:
> On Wed, Apr 17, 2013 at 11:57 AM, Brian Behlendorf <behlendorf1@llnl.gov> wrote:
>>
>> Here's a patch for the ERANGE error (lightly tested).  Sage's patch looks
>> good but only covers one of two code paths for xattrs.  With zfs they may
>> either be stored as a system attribute which is usually close to the dnode
>> on disk (zfs set xattr=sa pool/dataset).  Or they may be stored in their own
>> object which is how it's implemented on Solaris (zfs set xattr=on
>> pool/dataset).  The second method is still the default for compatibility
>> reasons even though it's slower.  Sage's patch only covered the SA case.
>>
>>
>>> Well, looking at the code again it's not going to work, as setxattr is
>>> going to fail with ERANGE.
>>
>> Why?  We support an arbitrary number of maximum sized xattrs (65536). What
>> am I missing here?
>>
>> Incidentally, does anybody know of an good xattr test suite we could add to
>> our regression tests?
>>
>> Thanks,
>> Brian
>>
>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>> index c03764f..9f4d63c 100644
>> --- a/module/zfs/zpl_xattr.c
>> +++ b/module/zfs/zpl_xattr.c
>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char *name,
>> void *value,
>>                  goto out;
>>          }
>>
>> +       if (size < i_size_read(xip)) {
>> +               error = -ERANGE;
>> +               goto out;
>> +       }
>> +
>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
>>   out:
>>          if (xip)
>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char *name,
>> void *value, size_t size)
>>          if (!size)
>>                  return (nv_size);
>>
>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>
>> +       if (size < nv_size)
>> +               return (-ERANGE);
>
> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
> also be called by zpl_xattr_get() to test for xattr existence. So it
> needs to make sure that zpl_xattr_set() doesn't fail if getting
> -ERANGE.
>
>> +
>> +       memcpy(value, nv_value, size);
>>
>>          return (MIN(size, nv_size));
>
> No need for MIN() here.
>
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 19:09                 ` Stefan Priebe
@ 2013-04-17 20:16                   ` Mark Nelson
  2013-04-17 20:49                     ` Jeff Mitchell
  2013-04-17 21:14                     ` Brian Behlendorf
  0 siblings, 2 replies; 20+ messages in thread
From: Mark Nelson @ 2013-04-17 20:16 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Yehuda Sadeh, Brian Behlendorf, Sage Weil, Jeff Mitchell,
	Henry C Chang, Aleksey Leonov, ceph-devel

I'll let Brian talk about the virtues of ZFS, but from my perspective 
it's an interesting option as there are a lot of folks banging on it for 
NFS servers and it has some interesting capabilities.  I have no idea 
how well it will work in practice, but if we can show that Ceph can run 
on it at least people can try it out and give us feedback.

Mark

On 04/17/2013 02:09 PM, Stefan Priebe wrote:
> Sorry to disturb, but what is the raeson / advantage of using zfs for ceph?
>
> Greets,
> Stefan
> Am 17.04.2013 21:07, schrieb Yehuda Sadeh:
>> On Wed, Apr 17, 2013 at 11:57 AM, Brian Behlendorf
>> <behlendorf1@llnl.gov> wrote:
>>>
>>> Here's a patch for the ERANGE error (lightly tested).  Sage's patch
>>> looks
>>> good but only covers one of two code paths for xattrs.  With zfs they
>>> may
>>> either be stored as a system attribute which is usually close to the
>>> dnode
>>> on disk (zfs set xattr=sa pool/dataset).  Or they may be stored in
>>> their own
>>> object which is how it's implemented on Solaris (zfs set xattr=on
>>> pool/dataset).  The second method is still the default for compatibility
>>> reasons even though it's slower.  Sage's patch only covered the SA case.
>>>
>>>
>>>> Well, looking at the code again it's not going to work, as setxattr is
>>>> going to fail with ERANGE.
>>>
>>> Why?  We support an arbitrary number of maximum sized xattrs (65536).
>>> What
>>> am I missing here?
>>>
>>> Incidentally, does anybody know of an good xattr test suite we could
>>> add to
>>> our regression tests?
>>>
>>> Thanks,
>>> Brian
>>>
>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>> index c03764f..9f4d63c 100644
>>> --- a/module/zfs/zpl_xattr.c
>>> +++ b/module/zfs/zpl_xattr.c
>>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char
>>> *name,
>>> void *value,
>>>                  goto out;
>>>          }
>>>
>>> +       if (size < i_size_read(xip)) {
>>> +               error = -ERANGE;
>>> +               goto out;
>>> +       }
>>> +
>>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE,
>>> 0, cr);
>>>   out:
>>>          if (xip)
>>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>> *name,
>>> void *value, size_t size)
>>>          if (!size)
>>>                  return (nv_size);
>>>
>>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>>
>>> +       if (size < nv_size)
>>> +               return (-ERANGE);
>>
>> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
>> also be called by zpl_xattr_get() to test for xattr existence. So it
>> needs to make sure that zpl_xattr_set() doesn't fail if getting
>> -ERANGE.
>>
>>> +
>>> +       memcpy(value, nv_value, size);
>>>
>>>          return (MIN(size, nv_size));
>>
>> No need for MIN() here.
>>
>>
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 20:16                   ` Mark Nelson
@ 2013-04-17 20:49                     ` Jeff Mitchell
  2013-04-17 21:14                     ` Brian Behlendorf
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff Mitchell @ 2013-04-17 20:49 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Stefan Priebe, Yehuda Sadeh, Brian Behlendorf, Sage Weil,
	Henry C Chang, Aleksey Leonov, ceph-devel

> On 04/17/2013 02:09 PM, Stefan Priebe wrote:
>>
>> Sorry to disturb, but what is the raeson / advantage of using zfs for
>> ceph?

A few things off the top of my head:

1) Very mature filesystem with full xattr support (this bug
notwithstanding) and copy-on-write snapshots. While the port to Linux
sometimes has some rough edges (but in my experience over the past few
years is generally very good), the main code from Solaris (and now the
Illumos project) is well-tested and very well regarded. Btrfs has many
of the same features, but in my real-world experience I've had
multiple btrfs filesystems go corrupt with very innocuous usage
patterns and across a variety of kernel versions. The zfsonlinux bugs
don't tend to be data-destructive, once data is written to it.
2) Very intelligent caching; also supports external devices (like
SSDs) for a level 2 cache. This speeds up reads dramatically.
3) Very robust error-checking. There are lots of stories of ZFS
finding bad memory, bad controllers, and bad hard drives because of
its checksumming (which you can optionally turn off for speed). If you
set up the OSDs such that each OSD is based off of a ZFS mirror, you
get these benefits locally. For some people, especially when heavy on
reads (due to the intelligent caching), a solution that knocks the
remote replication level down by one but uses local mirrors for OSDs
may provide good functionality and safety compromises.

--Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 20:16                   ` Mark Nelson
  2013-04-17 20:49                     ` Jeff Mitchell
@ 2013-04-17 21:14                     ` Brian Behlendorf
  2013-04-18  2:20                       ` Henry C Chang
  2013-04-18  5:56                       ` Stefan Priebe - Profihost AG
  1 sibling, 2 replies; 20+ messages in thread
From: Brian Behlendorf @ 2013-04-17 21:14 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Stefan Priebe, Yehuda Sadeh, Sage Weil, Jeff Mitchell,
	Henry C Chang, Aleksey Leonov, ceph-devel

On 04/17/2013 01:16 PM, Mark Nelson wrote:
> I'll let Brian talk about the virtues of ZFS,

I think the virtues of ZFS have been discussed at length in various 
other forums.  But in short it brings some nice functionality to the 
table which may be useful to ceph and that's worth exploring.

>>>>
>>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>>> index c03764f..9f4d63c 100644
>>>> --- a/module/zfs/zpl_xattr.c
>>>> +++ b/module/zfs/zpl_xattr.c
>>>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char
>>>> *name,
>>>> void *value,
>>>>                  goto out;
>>>>          }
>>>>
>>>> +       if (size < i_size_read(xip)) {
>>>> +               error = -ERANGE;
>>>> +               goto out;
>>>> +       }
>>>> +
>>>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE,
>>>> 0, cr);
>>>>   out:
>>>>          if (xip)
>>>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>>> *name,
>>>> void *value, size_t size)
>>>>          if (!size)
>>>>                  return (nv_size);
>>>>
>>>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>>>
>>>> +       if (size < nv_size)
>>>> +               return (-ERANGE);
>>>
>>> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
>>> also be called by zpl_xattr_get() to test for xattr existence. So it
>>> needs to make sure that zpl_xattr_set() doesn't fail if getting
>>> -ERANGE.

This shouldn't be a problem.  The zpl_xattr_get() call from 
zpl_xattr_set() passes a NULL value and zero size which will prevent it 
from hitting the ERANGE error.  It will return instead the xattr size as 
expected.

>>>
>>>> +
>>>> +       memcpy(value, nv_value, size);
>>>>
>>>>          return (MIN(size, nv_size));
>>>
>>> No need for MIN() here.

Thanks for catching that.

I've opened a pull request at github with the updated fix and kicked it 
off for automated testing.  It would be nice to verify this resolves the 
crash.

https://github.com/zfsonlinux/zfs/pull/1409

diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
index c03764f..42a06ad 100644
--- a/module/zfs/zpl_xattr.c
+++ b/module/zfs/zpl_xattr.c
@@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char 
*name, void
                 goto out;
         }

+       if (size < i_size_read(xip)) {
+               error = -ERANGE;
+               goto out;
+       }
+
         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
  out:
         if (xip)
@@ -263,9 +268,12 @@ zpl_xattr_get_sa(struct inode *ip, const char 
*name, void *
         if (!size)
                 return (nv_size);

-       memcpy(value, nv_value, MIN(size, nv_size));
+       if (size < nv_size)
+               return (-ERANGE);
+
+       memcpy(value, nv_value, size);

-       return (MIN(size, nv_size));
+       return (size);
  }

  static int

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 21:14                     ` Brian Behlendorf
@ 2013-04-18  2:20                       ` Henry C Chang
  2013-04-18  5:56                       ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 20+ messages in thread
From: Henry C Chang @ 2013-04-18  2:20 UTC (permalink / raw)
  To: Brian Behlendorf
  Cc: Mark Nelson, Stefan Priebe, Yehuda Sadeh, Sage Weil,
	Jeff Mitchell, Aleksey Leonov, ceph-devel

Sorry, off the topic. I am wondering if we use zfs as the underlying
filesystem for ceph osd and let osd filestore do sync writes, do we
still need the osd journal?

2013/4/18 Brian Behlendorf <behlendorf1@llnl.gov>:
> On 04/17/2013 01:16 PM, Mark Nelson wrote:
>>
>> I'll let Brian talk about the virtues of ZFS,
>
>
> I think the virtues of ZFS have been discussed at length in various other
> forums.  But in short it brings some nice functionality to the table which
> may be useful to ceph and that's worth exploring.
>
>
>>>>>
>>>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>>>> index c03764f..9f4d63c 100644
>>>>> --- a/module/zfs/zpl_xattr.c
>>>>> +++ b/module/zfs/zpl_xattr.c
>>>>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char
>>>>> *name,
>>>>> void *value,
>>>>>                  goto out;
>>>>>          }
>>>>>
>>>>> +       if (size < i_size_read(xip)) {
>>>>> +               error = -ERANGE;
>>>>> +               goto out;
>>>>> +       }
>>>>> +
>>>>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE,
>>>>> 0, cr);
>>>>>   out:
>>>>>          if (xip)
>>>>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>>>> *name,
>>>>> void *value, size_t size)
>>>>>          if (!size)
>>>>>                  return (nv_size);
>>>>>
>>>>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>>>>
>>>>> +       if (size < nv_size)
>>>>> +               return (-ERANGE);
>>>>
>>>>
>>>> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
>>>> also be called by zpl_xattr_get() to test for xattr existence. So it
>>>> needs to make sure that zpl_xattr_set() doesn't fail if getting
>>>> -ERANGE.
>
>
> This shouldn't be a problem.  The zpl_xattr_get() call from zpl_xattr_set()
> passes a NULL value and zero size which will prevent it from hitting the
> ERANGE error.  It will return instead the xattr size as expected.
>
>
>>>>
>>>>> +
>>>>> +       memcpy(value, nv_value, size);
>>>>>
>>>>>          return (MIN(size, nv_size));
>>>>
>>>>
>>>> No need for MIN() here.
>
>
> Thanks for catching that.
>
> I've opened a pull request at github with the updated fix and kicked it off
> for automated testing.  It would be nice to verify this resolves the crash.
>
> https://github.com/zfsonlinux/zfs/pull/1409
>
> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
> index c03764f..42a06ad 100644
>
> --- a/module/zfs/zpl_xattr.c
> +++ b/module/zfs/zpl_xattr.c
> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char *name,
> void
>                 goto out;
>         }
>
> +       if (size < i_size_read(xip)) {
> +               error = -ERANGE;
> +               goto out;
> +       }
> +
>         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
>  out:
>         if (xip)
> @@ -263,9 +268,12 @@ zpl_xattr_get_sa(struct inode *ip, const char *name,
> void *
>
>         if (!size)
>                 return (nv_size);
>
> -       memcpy(value, nv_value, MIN(size, nv_size));
> +       if (size < nv_size)
> +               return (-ERANGE);
> +
> +       memcpy(value, nv_value, size);
>
> -       return (MIN(size, nv_size));
> +       return (size);
>  }
>
>  static int

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-17 21:14                     ` Brian Behlendorf
  2013-04-18  2:20                       ` Henry C Chang
@ 2013-04-18  5:56                       ` Stefan Priebe - Profihost AG
  2013-04-18 14:50                         ` Sage Weil
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-04-18  5:56 UTC (permalink / raw)
  To: Brian Behlendorf
  Cc: Mark Nelson, Yehuda Sadeh, Sage Weil, Jeff Mitchell,
	Henry C Chang, Aleksey Leonov, ceph-devel

Am 17.04.2013 um 23:14 schrieb Brian Behlendorf <behlendorf1@llnl.gov>:

> On 04/17/2013 01:16 PM, Mark Nelson wrote:
>> I'll let Brian talk about the virtues of ZFS,
> 
> I think the virtues of ZFS have been discussed at length in various other forums.  But in short it brings some nice functionality to the table which may be useful to ceph and that's worth exploring.
Sure I know about the advantages of zfs.

I just thought about how ceph can benefit. Right now I've no idea. The osds should be single disks so zpool, zraid does not matter. Ceph does it own scrubbing and check summing and instead of btrfs ceph does not know how to use snapshots with zfs. That's why I'm asking.

Greets,
Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-18  5:56                       ` Stefan Priebe - Profihost AG
@ 2013-04-18 14:50                         ` Sage Weil
  2013-04-18 20:07                           ` Alex Elsayed
  0 siblings, 1 reply; 20+ messages in thread
From: Sage Weil @ 2013-04-18 14:50 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Brian Behlendorf, Mark Nelson, Yehuda Sadeh, Jeff Mitchell,
	Henry C Chang, Aleksey Leonov, ceph-devel

On Thu, 18 Apr 2013, Stefan Priebe - Profihost AG wrote:
> Am 17.04.2013 um 23:14 schrieb Brian Behlendorf <behlendorf1@llnl.gov>:
> 
> > On 04/17/2013 01:16 PM, Mark Nelson wrote:
> >> I'll let Brian talk about the virtues of ZFS,
> > 
> > I think the virtues of ZFS have been discussed at length in various other forums.  But in short it brings some nice functionality to the table which may be useful to ceph and that's worth exploring.
> Sure I know about the advantages of zfs.
> 
> I just thought about how ceph can benefit. Right now I've no idea. The 
> osds should be single disks so zpool, zraid does not matter. Ceph does 
> it own scrubbing and check summing and instead of btrfs ceph does not 
> know how to use snapshots with zfs. That's why I'm asking.

The main things that come to mind:

- zfs checksumming
- ceph can eventually use zfs snapshots similarly to how it uses btrfs 
  snapshots to create stable checkpoints as journal reference points, 
  allowing parallel (instead of writeahead) journaling
- can use raidz beneath a single ceph-osd for better reliability (e.g., 2x 
  * raidz instead of 3x replication)

ZFS doesn't have a clone function that we can use to enable efficient 
cephfs/rbd/rados snaps, but maybe this will motivate someone to implement 
one. :)

sage


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-18 14:50                         ` Sage Weil
@ 2013-04-18 20:07                           ` Alex Elsayed
  2013-04-19 10:47                             ` Jeff Mitchell
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Elsayed @ 2013-04-18 20:07 UTC (permalink / raw)
  To: ceph-devel

Sage Weil wrote:

<snip>
> The main things that come to mind:
> 
> - zfs checksumming
> - ceph can eventually use zfs snapshots similarly to how it uses btrfs
>   snapshots to create stable checkpoints as journal reference points,
>   allowing parallel (instead of writeahead) journaling
> - can use raidz beneath a single ceph-osd for better reliability (e.g., 2x
>   * raidz instead of 3x replication)
> 
> ZFS doesn't have a clone function that we can use to enable efficient
> cephfs/rbd/rados snaps, but maybe this will motivate someone to implement
> one. :)

Since Btrfs has implemented raid5/6 support (meaning raidz is only a feature 
gain if you want 3x parity, which is unlikely to be useful for an OSD[1]), 
the checksumming may be the only real benefit since it supports sha256 (in 
addition to the non-cryptographic fletcher2/fletcher4), whereas btrfs only 
has crc32c at this time.

[1] A raidz3 with 4 disks is basically raid1, at which point you may as well 
use Ceph-level replication. And a 5-or-more-disk OSD strikes me as a 
questionable way to set it up, considering Ceph's strengths.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: test osd on zfs
  2013-04-18 20:07                           ` Alex Elsayed
@ 2013-04-19 10:47                             ` Jeff Mitchell
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Mitchell @ 2013-04-19 10:47 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

Alex Elsayed wrote:
> Since Btrfs has implemented raid5/6 support (meaning raidz is only a feature
> gain if you want 3x parity, which is unlikely to be useful for an OSD[1]),
> the checksumming may be the only real benefit since it supports sha256 (in
> addition to the non-cryptographic fletcher2/fletcher4), whereas btrfs only
> has crc32c at this time.

Plus (in my real-world experience) *far* better robustness. If Ceph 
could use either and both had feature parity, I'd choose ZFS in a 
heartbeat. I've had too many simple Btrfs filesystems go corrupt, not 
even using any fancy RAID features.

I wasn't aware that Ceph was using btrfs' file-scope clone command. ZFS 
doesn't have that, although in theory with the new capabilities system 
it could be supported in one implementation without requiring an on-disk 
format change.

--Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-04-19 10:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <516E7D5C.7080309@nazarianin.com>
2013-04-17 15:19 ` test osd on zfs Sage Weil
2013-04-17 15:57   ` Henry C Chang
2013-04-17 16:37     ` Jeff Mitchell
2013-04-17 17:00       ` Henry C Chang
2013-04-17 17:00       ` Sage Weil
2013-04-17 17:04       ` Yehuda Sadeh
2013-04-17 17:05         ` Sage Weil
2013-04-17 17:15           ` Yehuda Sadeh
2013-04-17 18:06             ` Brian Behlendorf
2013-04-17 18:57             ` Brian Behlendorf
2013-04-17 19:07               ` Yehuda Sadeh
2013-04-17 19:09                 ` Stefan Priebe
2013-04-17 20:16                   ` Mark Nelson
2013-04-17 20:49                     ` Jeff Mitchell
2013-04-17 21:14                     ` Brian Behlendorf
2013-04-18  2:20                       ` Henry C Chang
2013-04-18  5:56                       ` Stefan Priebe - Profihost AG
2013-04-18 14:50                         ` Sage Weil
2013-04-18 20:07                           ` Alex Elsayed
2013-04-19 10:47                             ` Jeff Mitchell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.