From: Cesar Eduardo Barros <cesarb@cesarb.net>
To: Minchan Kim <minchan.kim@gmail.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH 0/3] mm: Swap checksum
Date: Wed, 26 May 2010 07:21:57 -0300 [thread overview]
Message-ID: <4BFCF645.2050400@cesarb.net> (raw)
In-Reply-To: <AANLkTikMTwzXt7-vQf9AG2VhwFIGs1jX-1uFoYAKSco7@mail.gmail.com>
Em 25-05-2010 20:52, Minchan Kim escreveu:
> On Mon, May 24, 2010 at 7:50 PM, Cesar Eduardo Barros<cesarb@cesarb.net> wrote:
>> And, in fact, there is a CRC code in the block layer; it is
>> CONFIG_BLK_DEV_INTEGRITY. However, it is not a generic solution; it needs
>> some extra prerequisites (like a disk low-level formatted with sectors with
>>> 512 bytes).
>
> You mean BLK_DEV_INTEGRITY has a dependency with block device driver?
> If you want to support checksum into suspend, At last, should we put
> the checksum on disk?
>
> I mean could we extend BLK_DEV_INTEGRITY by more generic solution?
> As you said, in case of swap, we don't need to put checksum on disk.
CONFIG_BLK_DEV_INTEGRITY writes the checksum to the same sector as the
data. However, for that to be possible, the sector size is increased on
the disk itself, from 512 bytes to 520 bytes (and not all disks can do
that). It is not a generic solution. It also, as far as I can see, does
nothing against the disk simply failing to write and later returning
stale data, since the stale checksum would match the stale data.
See the LWN article [1] and the presentations [2] for more detail.
For suspend, the swap checksum pages would be saved together with the
rest of the memory (they are in the memory, after all), and the suspend
snapshot would have its own separate checksum (written directly to the
disk after the image).
> If swap case, let it put the one on memory. If non-swap case, let it
> put checksum on disk,
> I am not sure it's possible.
>
> When we have a unreliable disk, your point is that let's solve it with
> (btrfs + swap) which both supports checksum. And my point is that
> let's solve it with (any file system + swap) which is put on block
> layer which supports checksum.
A generic "checksumming block device" would be less efficient.
For the swap case, it cannot exploit the fact that its state tracking is
within the swapfile code. Avi Kivity's idea of storing the checksum in
otherwise wasted bits of the pte is an example of how this could be
exploited in the future. In fact, the reason I did it on the swap layer
(instead of interposing something in the block layer) was precisely to
make it easier to enhance the state tracking in the future (and also
because it felt the most natural layer to do it).
It would also complicate adding checksums to the software suspend
snapshot. While normally you do not want to write the swap checksums to
the disk, you do want to write them when saving the memory snapshot -
which is written to the same block device. However, the checksums for
the rest of the swap pages are already being saved as part of the memory
snapshot (since the checksums were in the memory).
For the generic ("any file system") case, it is worse, since you
actually have to write the checksum to the disk, and unlike in the
software suspend case you cannot simply write them all in one pass at
the end. In the worst case, you would have to write twice for each
sector/page - once for the data, and once for the checksums
(CONFIG_BLK_DEV_INTEGRITY completely avoids this issue since with it the
checksum is together with the data in the same sector). Not to mention
fun things like write amplification.
A filesystem with data checksums can write the checksum as part of its
normal metadata updates (which it already has to do anyway).
A generic "checksumming block device" could be a way of "updating" a
filesystem without checksums (or with only metadata checksums) to have
them. However, I believe it would be more productive to add them
directly to the filesystem itself. Even more since the only way I can
see of doing it efficiently in a generic block layer is by using lots of
filesystem-style tricks (things like a log-structured list of CRC
values, dividing the device in "block groups" to keep the checksum close
to the data, and so on).
[1] Block layer: integrity checking and lots of partitions
http://lwn.net/Articles/290141/
[2] http://oss.oracle.com/projects/data-integrity/documentation/
--
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-05-26 10:22 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
2010-05-22 18:13 ` Borislav Petkov
2010-05-22 18:18 ` Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-23 15:19 ` Avi Kivity
2010-05-23 18:58 ` Cesar Eduardo Barros
2010-05-24 6:41 ` Avi Kivity
2010-05-24 7:32 ` Nick Piggin
2010-05-24 10:51 ` Avi Kivity
2010-05-24 11:24 ` Cesar Eduardo Barros
2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
2010-05-23 18:32 ` Cesar Eduardo Barros
2010-05-24 0:09 ` Minchan Kim
2010-05-24 0:57 ` Cesar Eduardo Barros
2010-05-24 2:05 ` Minchan Kim
2010-05-24 10:50 ` Cesar Eduardo Barros
2010-05-25 23:52 ` Minchan Kim
2010-05-26 10:21 ` Cesar Eduardo Barros [this message]
2010-05-26 15:31 ` Minchan Kim
2010-05-26 21:28 ` Valdis.Kletnieks
2010-05-26 22:45 ` Minchan Kim
2010-05-26 23:19 ` Cesar Eduardo Barros
2010-05-26 23:27 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4BFCF645.2050400@cesarb.net \
--to=cesarb@cesarb.net \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).