[libfdisk]: gpt_write_disklabel function robustness to sudden power off

All of lore.kernel.org
 help / color / mirror / Atom feed

* [libfdisk]: gpt_write_disklabel function robustness to sudden power off
@ 2015-03-20 10:17 Ronan CHAUVIN
  2015-03-20 11:18 ` Karel Zak
  2015-03-24  3:24 ` Dale R. Worley
  0 siblings, 2 replies; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-20 10:17 UTC (permalink / raw)
  To: util-linux; +Cc: matthieu CASTET, Alexandre Dilly

Hello everyone,

I have a question regarding the fdisk library (libfdisk) provided in the 
2.26 version of util-linux. I use it to create a MBR/GPT partition 
scheme on a eMMC memory. I also use the partition renaming mechanism to 
switch from a normal boot to an update boot (the bootloader compares 
partitions names to choose the one to boot on).

I was wondering if the gpt_write_disklabel function was robust to sudden 
power-off. In the source code, the writing procedure is as follow (UEFI 
requires writing in this specific order):

1) backup partition tables
2) backup GPT header
3) primary partition tables
4) primary GPT header
5) protective MBR

and uses the standard linux write function with a file descriptor. Is 
the writing order guaranty as operation is not synchronous ? I know that 
the linux io scheduler can "optimize" writing operations order. This can 
introduce an issue if only the primary and backup headers are written 
but not the partition tables.

Thank you,

-- 
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris  France
--------------------------------
www.parrot.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-20 10:17 [libfdisk]: gpt_write_disklabel function robustness to sudden power off Ronan CHAUVIN
@ 2015-03-20 11:18 ` Karel Zak
  2015-03-23 18:31   ` Peter Cordes
  2015-03-24  3:24 ` Dale R. Worley
  1 sibling, 1 reply; 8+ messages in thread
From: Karel Zak @ 2015-03-20 11:18 UTC (permalink / raw)
  To: Ronan CHAUVIN; +Cc: util-linux, matthieu CASTET, Alexandre Dilly

On Fri, Mar 20, 2015 at 11:17:13AM +0100, Ronan CHAUVIN wrote:
> I have a question regarding the fdisk library (libfdisk) provided in the
> 2.26 version of util-linux. I use it to create a MBR/GPT partition scheme on
> a eMMC memory. I also use the partition renaming mechanism to switch from a
> normal boot to an update boot (the bootloader compares partitions names to
> choose the one to boot on).
> 
> I was wondering if the gpt_write_disklabel function was robust to sudden
> power-off. In the source code, the writing procedure is as follow (UEFI
> requires writing in this specific order):
> 
> 1) backup partition tables
> 2) backup GPT header
> 3) primary partition tables
> 4) primary GPT header
> 5) protective MBR
> 
> and uses the standard linux write function with a file descriptor. Is the
> writing order guaranty as operation is not synchronous ? I know that the
> linux io scheduler can "optimize" writing operations order. This can
> introduce an issue if only the primary and backup headers are written but
> not the partition tables.

The order suggested by UEFI is there because GPT header contains CRC of the
array with partitions and the header is validated by another top-level CRC. If
you read things in reverse order (PMBR, header, partitions) and verify
all the CRCs then you can be sure that all is valid.

IMHO the "right" write procedure is just holy grail... in reality we have no 
any guaranty (due to storage HW).

The important is to be able to detect inconsistent stuff on the device when 
you *read* GPT.

We can add fsync() between the steps, but I still have doubts it will
improve anything. For example libparted also uses write() only.

We call fsync() before close() in libfdisk/src/context.c:
fdisk_deassign_device().

Conclusion: be pessimistic and verify all you read from disk and be 
optimistic when you write to the disk, and when when someone is talking 
about write guaranty and run far away. That's all the story.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-20 11:18 ` Karel Zak
@ 2015-03-23 18:31   ` Peter Cordes
  2015-03-24 14:05     ` Ronan CHAUVIN
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Cordes @ 2015-03-23 18:31 UTC (permalink / raw)
  To: Karel Zak; +Cc: Ronan CHAUVIN, util-linux, matthieu CASTET, Alexandre Dilly

On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
> Conclusion: be pessimistic and verify all you read from disk and be 
> optimistic when you write to the disk, and when when someone is talking 
> about write guaranty and run far away. That's all the story.

The whole GPT is what, 16kiB or so?  On most storage, you could
force data to persistent storage with a granularity of 4kiB, with
fdatasync(2) (assuming that works on block devices, not just files).  

But some SSDs lie, and will claim that data is flushed to persistent
storage when it isn't.  (According to one of Marc Merlin's BTRFS
talks).

 So I'd agree with Karel that the current method is probably
ideal.  write() everything, then fsync() so it all hits the disk in
one multi-sector write op.  Not necessarily atomic, but probably.

If we think the backup partition table / GPT header is useful,
write(backup); fsync();
sleep(1sec);
write(primary); fsync();
is potentially worthwhile.  On an SSD, there's the mapping metadata
separate from the actual data, and the write block size might be 8kiB
on some current disks.  (This is why I'm thinking that the 1sec pause
between writing the backup and primary would give a chance for
whatever write-back caching layers to actually flush for real.)

 I don't know how likely that is to help on any real storage setup;
I'm really just making that up.  I also don't know whether the backup
and primary are in separate 4kiB or 8kiB data blocks.  Even if not, it
could still be useful to always be writing blocks where one of the two
copies written matches what's already there, so there's a valid table
whether the old or new version is there when you try to read it back.

So I think there's potentially a tiny benefit to a fsync();sleep(),
but I'd wait for confirmation from a storage expert before
implementing it.  The current method probably just sends one write op
to the hardware for the whole GPT, which is nice.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter@cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-20 10:17 [libfdisk]: gpt_write_disklabel function robustness to sudden power off Ronan CHAUVIN
  2015-03-20 11:18 ` Karel Zak
@ 2015-03-24  3:24 ` Dale R. Worley
  2015-03-24 13:54   ` Ronan CHAUVIN
  1 sibling, 1 reply; 8+ messages in thread
From: Dale R. Worley @ 2015-03-24  3:24 UTC (permalink / raw)
  To: Ronan CHAUVIN; +Cc: util-linux

Ronan CHAUVIN <ronan.chauvin@parrot.com> writes:
> I was wondering if the gpt_write_disklabel function was robust to sudden 
> power-off.

gpt_write_disklabel can only be "robust to sudden power-off" if there is
some expectation that the partition tables, etc. can be in some sort of
"consistent" state if power is lost.  But it seems to me that there is
no such consistent state -- what condition would you want that
information to be in?  The old information is irretrevably lost during
the operation; the only problem that can be caused by sudden power-off
is that you have to enter the new information into fdisk again before
the disk has a valid structure.

Dale

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-24  3:24 ` Dale R. Worley
@ 2015-03-24 13:54   ` Ronan CHAUVIN
  0 siblings, 0 replies; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-24 13:54 UTC (permalink / raw)
  To: Dale R. Worley; +Cc: util-linux, matthieu CASTET, Alexandre Dilly

Thank you for your answer.

If we are not resizing/adding but just renaming a partition, we want to 
be sure that at least the primary or the backup GPT header/GPT partition 
array will not be corrupted in the case of a sudden power-off. If the 
write order is not guaranty, then the primary and backup GPT headers can 
be written to the emmc without the corresponding partition array and the 
system will not be consistent.

For example, if we have this effective write operation order on the disk:

1) backup GPT header
2) primary GPT header
--> power-off <--
3) primary partition tables
4) backup partition tables
5) protective MBR

then, CRC of partition array present in both GPT headers will be incorrect.

On 03/24/2015 04:24 AM, Dale R. Worley wrote:
> But it seems to me that there is
> no such consistent state -- what condition would you want that
> information to be in?  The old information is irretrevably lost during
> the operation;

-- 
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris  France
--------------------------------
www.parrot.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-23 18:31   ` Peter Cordes
@ 2015-03-24 14:05     ` Ronan CHAUVIN
  2015-03-24 14:25       ` Peter Cordes
  0 siblings, 1 reply; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-24 14:05 UTC (permalink / raw)
  To: Peter Cordes, Karel Zak; +Cc: util-linux, matthieu CASTET, Alexandre Dilly

Thank you for your answer.

On 03/23/2015 07:31 PM, Peter Cordes wrote:
> On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
>> Conclusion: be pessimistic and verify all you read from disk and be
>> optimistic when you write to the disk, and when when someone is talking
>> about write guaranty and run far away. That's all the story.
> The whole GPT is what, 16kiB or so?  On most storage, you could
> force data to persistent storage with a granularity of 4kiB, with
> fdatasync(2) (assuming that works on block devices, not just files).
The whole GPT is 16kiB (MBR+GPT header+partition array). There is two 
GPT systems, one at the beginning and another one at the end. The 
bootloader verifies the integrity of the header and the partition array 
with a CRC32.
>    write() everything, then fsync() so it all hits the disk in
>
> But some SSDs lie, and will claim that data is flushed to persistent
> storage when it isn't.  (According to one of Marc Merlin's BTRFS
> talks).
>
>   So I'd agree with Karel that the current method is probably
> ideal.  write() everything, then fsync() so it all hits the disk in
> one multi-sector write op.  Not necessarily atomic, but probably.
As the block will not be consecutive (primary and backup), the operation 
cannot be done in one write operation....
> If we think the backup partition table / GPT header is useful,
> write(backup); fsync();
> sleep(1sec);
> write(primary); fsync();
> is potentially worthwhile.  On an SSD, there's the mapping metadata
> separate from the actual data, and the write block size might be 8kiB
> on some current disks.  (This is why I'm thinking that the 1sec pause
> between writing the backup and primary would give a chance for
> whatever write-back caching layers to actually flush for real.)
>
>   I don't know how likely that is to help on any real storage setup;
> I'm really just making that up.  I also don't know whether the backup
> and primary are in separate 4kiB or 8kiB data blocks.  Even if not, it
> could still be useful to always be writing blocks where one of the two
> copies written matches what's already there, so there's a valid table
> whether the old or new version is there when you try to read it back.
>
> So I think there's potentially a tiny benefit to a fsync();sleep(),
> but I'd wait for confirmation from a storage expert before
> implementing it.  The current method probably just sends one write op
> to the hardware for the whole GPT, which is nice.
I agree that we should wait confirmation of a storage expert but the 
fsync() and sleep() combination should guaranty the operation order on 
most hardware.
>

Best regards,

-- 
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris  France
--------------------------------
www.parrot.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-24 14:05     ` Ronan CHAUVIN
@ 2015-03-24 14:25       ` Peter Cordes
  2015-03-26 13:07         ` Ronan CHAUVIN
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Cordes @ 2015-03-24 14:25 UTC (permalink / raw)
  To: Ronan CHAUVIN; +Cc: Karel Zak, util-linux, matthieu CASTET, Alexandre Dilly

On Tue, Mar 24, 2015 at 03:05:36PM +0100, Ronan CHAUVIN wrote:
>
> On 03/23/2015 07:31 PM, Peter Cordes wrote:
>> On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
>>> Conclusion: be pessimistic and verify all you read from disk and be
>>> optimistic when you write to the disk, and when when someone is talking
>>> about write guaranty and run far away. That's all the story.
>> The whole GPT is what, 16kiB or so?  On most storage, you could
>> force data to persistent storage with a granularity of 4kiB, with
>> fdatasync(2) (assuming that works on block devices, not just files).

> The whole GPT is 16kiB (MBR+GPT header+partition array). There is two  
> GPT systems, one at the beginning and another one at the end. The  
> bootloader verifies the integrity of the header and the partition array  
> with a CRC32.

>>    write() everything, then fsync() so it all hits the disk in
>>
>>   So I'd agree with Karel that the current method is probably
>> ideal.  write() everything, then fsync() so it all hits the disk in
>> one multi-sector write op.  Not necessarily atomic, but probably.
> As the block will not be consecutive (primary and backup), the operation  
> cannot be done in one write operation....

So at least one of the four 4kiB sectors doesn't get written at all?
Because if all the sectors are getting written, regardless of order,
Linux will merge the IOs into one write request to send over the SATA
(or whatever) wire.  Write request merging is useful even on SSDs, so
Linux does it.

 Even if there is a sector that doesn't get written, it's probably
still academic.  Sending a request in a single write OP doesn't make
it atomic.  On a magnetic disk, the data will still probably all
hit the platter on the same rotation, just by powering down the write
head as it flies over the sector you aren't writing, so the window for
a power failure to cause a problem is quite small.  I'm sure SSDs are
far more complicated.

> I agree that we should wait confirmation of a storage expert but the  
> fsync() and sleep() combination should guaranty the operation order on  
> most hardware.

 Probably 1/10th of a second is long enough, but still short enough to
not be annoying.  If you're editting the partition table of a disk
that isn't idle (in which case even 1 sec might not be long enough for
the write to hit disk after fdatasync()), and you don't have the
system on a UPS, I think we maybe don't need to waste 0.9 seconds of
everyone's time just for this hypothetical user.

-- 
#define X(x,y) x##y
Peter Cordes ;  e-mail: X(peter@cor , des.ca)

"The gods confound the man who first found out how to distinguish the hours!
 Confound him, too, who in this place set up a sundial, to cut and hack
 my day so wretchedly into small pieces!" -- Plautus, 200 BC

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
  2015-03-24 14:25       ` Peter Cordes
@ 2015-03-26 13:07         ` Ronan CHAUVIN
  0 siblings, 0 replies; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-26 13:07 UTC (permalink / raw)
  To: Peter Cordes; +Cc: Karel Zak, util-linux, matthieu CASTET, Alexandre Dilly



On 03/24/2015 03:25 PM, Peter Cordes wrote:
> On Tue, Mar 24, 2015 at 03:05:36PM +0100, Ronan CHAUVIN wrote:
>> On 03/23/2015 07:31 PM, Peter Cordes wrote:
>>> On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
>>>> Conclusion: be pessimistic and verify all you read from disk and be
>>>> optimistic when you write to the disk, and when when someone is talking
>>>> about write guaranty and run far away. That's all the story.
>>> The whole GPT is what, 16kiB or so?  On most storage, you could
>>> force data to persistent storage with a granularity of 4kiB, with
>>> fdatasync(2) (assuming that works on block devices, not just files).
>> The whole GPT is 16kiB (MBR+GPT header+partition array). There is two
>> GPT systems, one at the beginning and another one at the end. The
>> bootloader verifies the integrity of the header and the partition array
>> with a CRC32.
>>>     write() everything, then fsync() so it all hits the disk in
>>>
>>>    So I'd agree with Karel that the current method is probably
>>> ideal.  write() everything, then fsync() so it all hits the disk in
>>> one multi-sector write op.  Not necessarily atomic, but probably.
>> As the block will not be consecutive (primary and backup), the operation
>> cannot be done in one write operation....
> So at least one of the four 4kiB sectors doesn't get written at all?
> Because if all the sectors are getting written, regardless of order,
> Linux will merge the IOs into one write request to send over the SATA
> (or whatever) wire.  Write request merging is useful even on SSDs, so
> Linux does it.
>
>   Even if there is a sector that doesn't get written, it's probably
> still academic.  Sending a request in a single write OP doesn't make
> it atomic.  On a magnetic disk, the data will still probably all
> hit the platter on the same rotation, just by powering down the write
> head as it flies over the sector you aren't writing, so the window for
> a power failure to cause a problem is quite small.  I'm sure SSDs are
> far more complicated.
The guaranty of the write OP clearly depends of the hardware... The 
primary/backup mechanism and CRC checks are implemented to detect these 
hardware failures.
>> I agree that we should wait confirmation of a storage expert but the
>> fsync() and sleep() combination should guaranty the operation order on
>> most hardware.
>   Probably 1/10th of a second is long enough, but still short enough to
> not be annoying.  If you're editting the partition table of a disk
> that isn't idle (in which case even 1 sec might not be long enough for
> the write to hit disk after fdatasync()), and you don't have the
> system on a UPS, I think we maybe don't need to waste 0.9 seconds of
> everyone's time just for this hypothetical user.
>
>
I agree that we don't need to waste 1 second of everyone's time. 
Nevertheless, only a fsync() between the write operation of the backup 
and primary GTP systems will give more chances that data are directly 
written to the disk (the disk cache will be flushed).

-- 
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris  France
--------------------------------
www.parrot.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-03-26 13:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-20 10:17 [libfdisk]: gpt_write_disklabel function robustness to sudden power off Ronan CHAUVIN
2015-03-20 11:18 ` Karel Zak
2015-03-23 18:31   ` Peter Cordes
2015-03-24 14:05     ` Ronan CHAUVIN
2015-03-24 14:25       ` Peter Cordes
2015-03-26 13:07         ` Ronan CHAUVIN
2015-03-24  3:24 ` Dale R. Worley
2015-03-24 13:54   ` Ronan CHAUVIN

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.