* Running on disks that lose their head
@ 2013-11-06 0:32 Loic Dachary
[not found] ` <52798E1D.9090807-cLsNCMjd+0JAfugRpC6u6w@public.gmane.org>
2013-11-06 16:41 ` [ceph-users] " Loic Dachary
0 siblings, 2 replies; 3+ messages in thread
From: Loic Dachary @ 2013-11-06 0:32 UTC (permalink / raw)
To: Ceph Development, ceph-users
[-- Attachment #1.1: Type: text/plain, Size: 1028 bytes --]
Hi Ceph,
People from Western Digital suggested ways to better take advantage of the disk error reporting. They gave two examples that struck my imagination. First there are errors that look like the disk is dying ( read / write failures ) but it's only a transient problem and the driver should be able to make the difference by properly interpreting the available information. They said that the prolonged life you get if you don't decommission a disk that only has a transient error is significant. The second example is when one head out of ten fails : disks can keep working with the nine remaining heads. Losing 1/10 of the disk is likely to result in a full re-install of the Ceph osd. But, again, the disk could keep going after that, with 9/10 of its original capacity. And Ceph is good at handling osd failures.
All this is news to me and sounds really cool. But I'm sure there are people who already know about it and I'm eager to hear their opinion :-)
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Running on disks that lose their head
[not found] ` <52798E1D.9090807-cLsNCMjd+0JAfugRpC6u6w@public.gmane.org>
@ 2013-11-06 9:33 ` Sage Weil
0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2013-11-06 9:33 UTC (permalink / raw)
To: Loic Dachary; +Cc: Ceph Development, ceph-users
On Wed, 6 Nov 2013, Loic Dachary wrote:
> Hi Ceph,
>
> People from Western Digital suggested ways to better take advantage of
> the disk error reporting. They gave two examples that struck my
> imagination. First there are errors that look like the disk is dying (
> read / write failures ) but it's only a transient problem and the driver
> should be able to make the difference by properly interpreting the
> available information. They said that the prolonged life you get if you
> don't decommission a disk that only has a transient error is
This make me think we really need to build or integrate with some generic
SMART reporting infrastructure so that we can identify disks that are
failing or going to fail. What to do with that information is another
question; initially I would lean toward just marking the disk out, but
there may be smarter alternatives to investigate.
> significant. The second example is when one head out of ten fails :
> disks can keep working with the nine remaining heads. Losing 1/10 of the
> disk is likely to result in a full re-install of the Ceph osd. But,
> again, the disk could keep going after that, with 9/10 of its original
> capacity. And Ceph is good at handling osd failures.
Yeah...but if you lose 1/10 of a block device any existing local file
system is going to blow up. I suspet this is something that newgangled
interfaces like Kinetic will be much better at. Even then, though, it is
challenging for anything sitting above to cope with losing some random
subset of it's data underneath. To a first approximation, for this to be
useful, the fs and disk would need to keep, say, all teh data in a
particular PG confined to a single platter, so that when a head goes the
other PGs are still fully intact and usage. It is probably a long way to
get from here to there...
> All this is news to me and sounds really cool. But I'm sure there are
> people who already know about it and I'm eager to hear their opinion :-)
>
> Cheers
>
> --
> Lo?c Dachary, Artisan Logiciel Libre
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [ceph-users] Running on disks that lose their head
2013-11-06 0:32 Running on disks that lose their head Loic Dachary
[not found] ` <52798E1D.9090807-cLsNCMjd+0JAfugRpC6u6w@public.gmane.org>
@ 2013-11-06 16:41 ` Loic Dachary
1 sibling, 0 replies; 3+ messages in thread
From: Loic Dachary @ 2013-11-06 16:41 UTC (permalink / raw)
To: Ceph Development, ceph-users
[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]
An anonymous kernel developer sends this link:
http://en.wikipedia.org/wiki/Error_recovery_control
On 06/11/2013 08:32, Loic Dachary wrote:
> Hi Ceph,
>
> People from Western Digital suggested ways to better take advantage of the disk error reporting. They gave two examples that struck my imagination. First there are errors that look like the disk is dying ( read / write failures ) but it's only a transient problem and the driver should be able to make the difference by properly interpreting the available information. They said that the prolonged life you get if you don't decommission a disk that only has a transient error is significant. The second example is when one head out of ten fails : disks can keep working with the nine remaining heads. Losing 1/10 of the disk is likely to result in a full re-install of the Ceph osd. But, again, the disk could keep going after that, with 9/10 of its original capacity. And Ceph is good at handling osd failures.
>
> All this is news to me and sounds really cool. But I'm sure there are people who already know about it and I'm eager to hear their opinion :-)
>
> Cheers
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-11-06 16:41 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-06 0:32 Running on disks that lose their head Loic Dachary
[not found] ` <52798E1D.9090807-cLsNCMjd+0JAfugRpC6u6w@public.gmane.org>
2013-11-06 9:33 ` Sage Weil
2013-11-06 16:41 ` [ceph-users] " Loic Dachary
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.