* [libfdisk]: gpt_write_disklabel function robustness to sudden power off
@ 2015-03-20 10:17 Ronan CHAUVIN
2015-03-20 11:18 ` Karel Zak
2015-03-24 3:24 ` Dale R. Worley
0 siblings, 2 replies; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-20 10:17 UTC (permalink / raw)
To: util-linux; +Cc: matthieu CASTET, Alexandre Dilly
Hello everyone,
I have a question regarding the fdisk library (libfdisk) provided in the
2.26 version of util-linux. I use it to create a MBR/GPT partition
scheme on a eMMC memory. I also use the partition renaming mechanism to
switch from a normal boot to an update boot (the bootloader compares
partitions names to choose the one to boot on).
I was wondering if the gpt_write_disklabel function was robust to sudden
power-off. In the source code, the writing procedure is as follow (UEFI
requires writing in this specific order):
1) backup partition tables
2) backup GPT header
3) primary partition tables
4) primary GPT header
5) protective MBR
and uses the standard linux write function with a file descriptor. Is
the writing order guaranty as operation is not synchronous ? I know that
the linux io scheduler can "optimize" writing operations order. This can
introduce an issue if only the primary and backup headers are written
but not the partition tables.
Thank you,
--
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris France
--------------------------------
www.parrot.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-20 10:17 [libfdisk]: gpt_write_disklabel function robustness to sudden power off Ronan CHAUVIN
@ 2015-03-20 11:18 ` Karel Zak
2015-03-23 18:31 ` Peter Cordes
2015-03-24 3:24 ` Dale R. Worley
1 sibling, 1 reply; 8+ messages in thread
From: Karel Zak @ 2015-03-20 11:18 UTC (permalink / raw)
To: Ronan CHAUVIN; +Cc: util-linux, matthieu CASTET, Alexandre Dilly
On Fri, Mar 20, 2015 at 11:17:13AM +0100, Ronan CHAUVIN wrote:
> I have a question regarding the fdisk library (libfdisk) provided in the
> 2.26 version of util-linux. I use it to create a MBR/GPT partition scheme on
> a eMMC memory. I also use the partition renaming mechanism to switch from a
> normal boot to an update boot (the bootloader compares partitions names to
> choose the one to boot on).
>
> I was wondering if the gpt_write_disklabel function was robust to sudden
> power-off. In the source code, the writing procedure is as follow (UEFI
> requires writing in this specific order):
>
> 1) backup partition tables
> 2) backup GPT header
> 3) primary partition tables
> 4) primary GPT header
> 5) protective MBR
>
> and uses the standard linux write function with a file descriptor. Is the
> writing order guaranty as operation is not synchronous ? I know that the
> linux io scheduler can "optimize" writing operations order. This can
> introduce an issue if only the primary and backup headers are written but
> not the partition tables.
The order suggested by UEFI is there because GPT header contains CRC of the
array with partitions and the header is validated by another top-level CRC. If
you read things in reverse order (PMBR, header, partitions) and verify
all the CRCs then you can be sure that all is valid.
IMHO the "right" write procedure is just holy grail... in reality we have no
any guaranty (due to storage HW).
The important is to be able to detect inconsistent stuff on the device when
you *read* GPT.
We can add fsync() between the steps, but I still have doubts it will
improve anything. For example libparted also uses write() only.
We call fsync() before close() in libfdisk/src/context.c:
fdisk_deassign_device().
Conclusion: be pessimistic and verify all you read from disk and be
optimistic when you write to the disk, and when when someone is talking
about write guaranty and run far away. That's all the story.
Karel
--
Karel Zak <kzak@redhat.com>
http://karelzak.blogspot.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-20 11:18 ` Karel Zak
@ 2015-03-23 18:31 ` Peter Cordes
2015-03-24 14:05 ` Ronan CHAUVIN
0 siblings, 1 reply; 8+ messages in thread
From: Peter Cordes @ 2015-03-23 18:31 UTC (permalink / raw)
To: Karel Zak; +Cc: Ronan CHAUVIN, util-linux, matthieu CASTET, Alexandre Dilly
On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
> Conclusion: be pessimistic and verify all you read from disk and be
> optimistic when you write to the disk, and when when someone is talking
> about write guaranty and run far away. That's all the story.
The whole GPT is what, 16kiB or so? On most storage, you could
force data to persistent storage with a granularity of 4kiB, with
fdatasync(2) (assuming that works on block devices, not just files).
But some SSDs lie, and will claim that data is flushed to persistent
storage when it isn't. (According to one of Marc Merlin's BTRFS
talks).
So I'd agree with Karel that the current method is probably
ideal. write() everything, then fsync() so it all hits the disk in
one multi-sector write op. Not necessarily atomic, but probably.
If we think the backup partition table / GPT header is useful,
write(backup); fsync();
sleep(1sec);
write(primary); fsync();
is potentially worthwhile. On an SSD, there's the mapping metadata
separate from the actual data, and the write block size might be 8kiB
on some current disks. (This is why I'm thinking that the 1sec pause
between writing the backup and primary would give a chance for
whatever write-back caching layers to actually flush for real.)
I don't know how likely that is to help on any real storage setup;
I'm really just making that up. I also don't know whether the backup
and primary are in separate 4kiB or 8kiB data blocks. Even if not, it
could still be useful to always be writing blocks where one of the two
copies written matches what's already there, so there's a valid table
whether the old or new version is there when you try to read it back.
So I think there's potentially a tiny benefit to a fsync();sleep(),
but I'd wait for confirmation from a storage expert before
implementing it. The current method probably just sends one write op
to the hardware for the whole GPT, which is nice.
--
#define X(x,y) x##y
Peter Cordes ; e-mail: X(peter@cor , des.ca)
"The gods confound the man who first found out how to distinguish the hours!
Confound him, too, who in this place set up a sundial, to cut and hack
my day so wretchedly into small pieces!" -- Plautus, 200 BC
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-20 10:17 [libfdisk]: gpt_write_disklabel function robustness to sudden power off Ronan CHAUVIN
2015-03-20 11:18 ` Karel Zak
@ 2015-03-24 3:24 ` Dale R. Worley
2015-03-24 13:54 ` Ronan CHAUVIN
1 sibling, 1 reply; 8+ messages in thread
From: Dale R. Worley @ 2015-03-24 3:24 UTC (permalink / raw)
To: Ronan CHAUVIN; +Cc: util-linux
Ronan CHAUVIN <ronan.chauvin@parrot.com> writes:
> I was wondering if the gpt_write_disklabel function was robust to sudden
> power-off.
gpt_write_disklabel can only be "robust to sudden power-off" if there is
some expectation that the partition tables, etc. can be in some sort of
"consistent" state if power is lost. But it seems to me that there is
no such consistent state -- what condition would you want that
information to be in? The old information is irretrevably lost during
the operation; the only problem that can be caused by sudden power-off
is that you have to enter the new information into fdisk again before
the disk has a valid structure.
Dale
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-24 3:24 ` Dale R. Worley
@ 2015-03-24 13:54 ` Ronan CHAUVIN
0 siblings, 0 replies; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-24 13:54 UTC (permalink / raw)
To: Dale R. Worley; +Cc: util-linux, matthieu CASTET, Alexandre Dilly
Thank you for your answer.
If we are not resizing/adding but just renaming a partition, we want to
be sure that at least the primary or the backup GPT header/GPT partition
array will not be corrupted in the case of a sudden power-off. If the
write order is not guaranty, then the primary and backup GPT headers can
be written to the emmc without the corresponding partition array and the
system will not be consistent.
For example, if we have this effective write operation order on the disk:
1) backup GPT header
2) primary GPT header
--> power-off <--
3) primary partition tables
4) backup partition tables
5) protective MBR
then, CRC of partition array present in both GPT headers will be incorrect.
On 03/24/2015 04:24 AM, Dale R. Worley wrote:
> But it seems to me that there is
> no such consistent state -- what condition would you want that
> information to be in? The old information is irretrevably lost during
> the operation;
--
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris France
--------------------------------
www.parrot.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-23 18:31 ` Peter Cordes
@ 2015-03-24 14:05 ` Ronan CHAUVIN
2015-03-24 14:25 ` Peter Cordes
0 siblings, 1 reply; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-24 14:05 UTC (permalink / raw)
To: Peter Cordes, Karel Zak; +Cc: util-linux, matthieu CASTET, Alexandre Dilly
Thank you for your answer.
On 03/23/2015 07:31 PM, Peter Cordes wrote:
> On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
>> Conclusion: be pessimistic and verify all you read from disk and be
>> optimistic when you write to the disk, and when when someone is talking
>> about write guaranty and run far away. That's all the story.
> The whole GPT is what, 16kiB or so? On most storage, you could
> force data to persistent storage with a granularity of 4kiB, with
> fdatasync(2) (assuming that works on block devices, not just files).
The whole GPT is 16kiB (MBR+GPT header+partition array). There is two
GPT systems, one at the beginning and another one at the end. The
bootloader verifies the integrity of the header and the partition array
with a CRC32.
> write() everything, then fsync() so it all hits the disk in
>
> But some SSDs lie, and will claim that data is flushed to persistent
> storage when it isn't. (According to one of Marc Merlin's BTRFS
> talks).
>
> So I'd agree with Karel that the current method is probably
> ideal. write() everything, then fsync() so it all hits the disk in
> one multi-sector write op. Not necessarily atomic, but probably.
As the block will not be consecutive (primary and backup), the operation
cannot be done in one write operation....
> If we think the backup partition table / GPT header is useful,
> write(backup); fsync();
> sleep(1sec);
> write(primary); fsync();
> is potentially worthwhile. On an SSD, there's the mapping metadata
> separate from the actual data, and the write block size might be 8kiB
> on some current disks. (This is why I'm thinking that the 1sec pause
> between writing the backup and primary would give a chance for
> whatever write-back caching layers to actually flush for real.)
>
> I don't know how likely that is to help on any real storage setup;
> I'm really just making that up. I also don't know whether the backup
> and primary are in separate 4kiB or 8kiB data blocks. Even if not, it
> could still be useful to always be writing blocks where one of the two
> copies written matches what's already there, so there's a valid table
> whether the old or new version is there when you try to read it back.
>
> So I think there's potentially a tiny benefit to a fsync();sleep(),
> but I'd wait for confirmation from a storage expert before
> implementing it. The current method probably just sends one write op
> to the hardware for the whole GPT, which is nice.
I agree that we should wait confirmation of a storage expert but the
fsync() and sleep() combination should guaranty the operation order on
most hardware.
>
Best regards,
--
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris France
--------------------------------
www.parrot.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-24 14:05 ` Ronan CHAUVIN
@ 2015-03-24 14:25 ` Peter Cordes
2015-03-26 13:07 ` Ronan CHAUVIN
0 siblings, 1 reply; 8+ messages in thread
From: Peter Cordes @ 2015-03-24 14:25 UTC (permalink / raw)
To: Ronan CHAUVIN; +Cc: Karel Zak, util-linux, matthieu CASTET, Alexandre Dilly
On Tue, Mar 24, 2015 at 03:05:36PM +0100, Ronan CHAUVIN wrote:
>
> On 03/23/2015 07:31 PM, Peter Cordes wrote:
>> On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
>>> Conclusion: be pessimistic and verify all you read from disk and be
>>> optimistic when you write to the disk, and when when someone is talking
>>> about write guaranty and run far away. That's all the story.
>> The whole GPT is what, 16kiB or so? On most storage, you could
>> force data to persistent storage with a granularity of 4kiB, with
>> fdatasync(2) (assuming that works on block devices, not just files).
> The whole GPT is 16kiB (MBR+GPT header+partition array). There is two
> GPT systems, one at the beginning and another one at the end. The
> bootloader verifies the integrity of the header and the partition array
> with a CRC32.
>> write() everything, then fsync() so it all hits the disk in
>>
>> So I'd agree with Karel that the current method is probably
>> ideal. write() everything, then fsync() so it all hits the disk in
>> one multi-sector write op. Not necessarily atomic, but probably.
> As the block will not be consecutive (primary and backup), the operation
> cannot be done in one write operation....
So at least one of the four 4kiB sectors doesn't get written at all?
Because if all the sectors are getting written, regardless of order,
Linux will merge the IOs into one write request to send over the SATA
(or whatever) wire. Write request merging is useful even on SSDs, so
Linux does it.
Even if there is a sector that doesn't get written, it's probably
still academic. Sending a request in a single write OP doesn't make
it atomic. On a magnetic disk, the data will still probably all
hit the platter on the same rotation, just by powering down the write
head as it flies over the sector you aren't writing, so the window for
a power failure to cause a problem is quite small. I'm sure SSDs are
far more complicated.
> I agree that we should wait confirmation of a storage expert but the
> fsync() and sleep() combination should guaranty the operation order on
> most hardware.
Probably 1/10th of a second is long enough, but still short enough to
not be annoying. If you're editting the partition table of a disk
that isn't idle (in which case even 1 sec might not be long enough for
the write to hit disk after fdatasync()), and you don't have the
system on a UPS, I think we maybe don't need to waste 0.9 seconds of
everyone's time just for this hypothetical user.
--
#define X(x,y) x##y
Peter Cordes ; e-mail: X(peter@cor , des.ca)
"The gods confound the man who first found out how to distinguish the hours!
Confound him, too, who in this place set up a sundial, to cut and hack
my day so wretchedly into small pieces!" -- Plautus, 200 BC
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [libfdisk]: gpt_write_disklabel function robustness to sudden power off
2015-03-24 14:25 ` Peter Cordes
@ 2015-03-26 13:07 ` Ronan CHAUVIN
0 siblings, 0 replies; 8+ messages in thread
From: Ronan CHAUVIN @ 2015-03-26 13:07 UTC (permalink / raw)
To: Peter Cordes; +Cc: Karel Zak, util-linux, matthieu CASTET, Alexandre Dilly
On 03/24/2015 03:25 PM, Peter Cordes wrote:
> On Tue, Mar 24, 2015 at 03:05:36PM +0100, Ronan CHAUVIN wrote:
>> On 03/23/2015 07:31 PM, Peter Cordes wrote:
>>> On Fri, Mar 20, 2015 at 12:18:12PM +0100, Karel Zak wrote:
>>>> Conclusion: be pessimistic and verify all you read from disk and be
>>>> optimistic when you write to the disk, and when when someone is talking
>>>> about write guaranty and run far away. That's all the story.
>>> The whole GPT is what, 16kiB or so? On most storage, you could
>>> force data to persistent storage with a granularity of 4kiB, with
>>> fdatasync(2) (assuming that works on block devices, not just files).
>> The whole GPT is 16kiB (MBR+GPT header+partition array). There is two
>> GPT systems, one at the beginning and another one at the end. The
>> bootloader verifies the integrity of the header and the partition array
>> with a CRC32.
>>> write() everything, then fsync() so it all hits the disk in
>>>
>>> So I'd agree with Karel that the current method is probably
>>> ideal. write() everything, then fsync() so it all hits the disk in
>>> one multi-sector write op. Not necessarily atomic, but probably.
>> As the block will not be consecutive (primary and backup), the operation
>> cannot be done in one write operation....
> So at least one of the four 4kiB sectors doesn't get written at all?
> Because if all the sectors are getting written, regardless of order,
> Linux will merge the IOs into one write request to send over the SATA
> (or whatever) wire. Write request merging is useful even on SSDs, so
> Linux does it.
>
> Even if there is a sector that doesn't get written, it's probably
> still academic. Sending a request in a single write OP doesn't make
> it atomic. On a magnetic disk, the data will still probably all
> hit the platter on the same rotation, just by powering down the write
> head as it flies over the sector you aren't writing, so the window for
> a power failure to cause a problem is quite small. I'm sure SSDs are
> far more complicated.
The guaranty of the write OP clearly depends of the hardware... The
primary/backup mechanism and CRC checks are implemented to detect these
hardware failures.
>> I agree that we should wait confirmation of a storage expert but the
>> fsync() and sleep() combination should guaranty the operation order on
>> most hardware.
> Probably 1/10th of a second is long enough, but still short enough to
> not be annoying. If you're editting the partition table of a disk
> that isn't idle (in which case even 1 sec might not be long enough for
> the write to hit disk after fdatasync()), and you don't have the
> system on a UPS, I think we maybe don't need to waste 0.9 seconds of
> everyone's time just for this hypothetical user.
>
>
I agree that we don't need to waste 1 second of everyone's time.
Nevertheless, only a fsync() between the write operation of the backup
and primary GTP systems will give more chances that data are directly
written to the disk (the disk cache will be flushed).
--
Ronan CHAUVIN
Embedded Software Engineer
ASIC team
--------------------------------
Parrot
174, quai de Jemmapes
75010 Paris France
--------------------------------
www.parrot.com
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-03-26 13:07 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-20 10:17 [libfdisk]: gpt_write_disklabel function robustness to sudden power off Ronan CHAUVIN
2015-03-20 11:18 ` Karel Zak
2015-03-23 18:31 ` Peter Cordes
2015-03-24 14:05 ` Ronan CHAUVIN
2015-03-24 14:25 ` Peter Cordes
2015-03-26 13:07 ` Ronan CHAUVIN
2015-03-24 3:24 ` Dale R. Worley
2015-03-24 13:54 ` Ronan CHAUVIN
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.