Re: [PATCH 0/3] mm: Swap checksum - Cesar Eduardo Barros

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Cesar Eduardo Barros <cesarb@cesarb.net>
To: Minchan Kim <minchan.kim@gmail.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH 0/3] mm: Swap checksum
Date: Wed, 26 May 2010 07:21:57 -0300	[thread overview]
Message-ID: <4BFCF645.2050400@cesarb.net> (raw)
In-Reply-To: <AANLkTikMTwzXt7-vQf9AG2VhwFIGs1jX-1uFoYAKSco7@mail.gmail.com>

Em 25-05-2010 20:52, Minchan Kim escreveu:
> On Mon, May 24, 2010 at 7:50 PM, Cesar Eduardo Barros<cesarb@cesarb.net>  wrote:
>> And, in fact, there is a CRC code in the block layer; it is
>> CONFIG_BLK_DEV_INTEGRITY. However, it is not a generic solution; it needs
>> some extra prerequisites (like a disk low-level formatted with sectors with
>>> 512 bytes).
>
> You mean BLK_DEV_INTEGRITY has a dependency with block device driver?
> If you want to support checksum into suspend, At last, should we put
> the checksum on disk?
>
> I mean could we extend BLK_DEV_INTEGRITY by more generic solution?
> As you said, in case of swap, we don't need to put checksum on disk.

CONFIG_BLK_DEV_INTEGRITY writes the checksum to the same sector as the 
data. However, for that to be possible, the sector size is increased on 
the disk itself, from 512 bytes to 520 bytes (and not all disks can do 
that). It is not a generic solution. It also, as far as I can see, does 
nothing against the disk simply failing to write and later returning 
stale data, since the stale checksum would match the stale data.

See the LWN article [1] and the presentations [2] for more detail.

For suspend, the swap checksum pages would be saved together with the 
rest of the memory (they are in the memory, after all), and the suspend 
snapshot would have its own separate checksum (written directly to the 
disk after the image).

> If swap case, let it put the one on memory. If non-swap case, let it
> put checksum on disk,
> I am not sure it's possible.
>
> When we have a unreliable disk, your point is that let's solve it with
> (btrfs + swap) which both supports checksum. And my point is that
> let's solve it with (any file system + swap) which is put on block
> layer which supports checksum.

A generic "checksumming block device" would be less efficient.

For the swap case, it cannot exploit the fact that its state tracking is 
within the swapfile code. Avi Kivity's idea of storing the checksum in 
otherwise wasted bits of the pte is an example of how this could be 
exploited in the future. In fact, the reason I did it on the swap layer 
(instead of interposing something in the block layer) was precisely to 
make it easier to enhance the state tracking in the future (and also 
because it felt the most natural layer to do it).

It would also complicate adding checksums to the software suspend 
snapshot. While normally you do not want to write the swap checksums to 
the disk, you do want to write them when saving the memory snapshot - 
which is written to the same block device. However, the checksums for 
the rest of the swap pages are already being saved as part of the memory 
snapshot (since the checksums were in the memory).

For the generic ("any file system") case, it is worse, since you 
actually have to write the checksum to the disk, and unlike in the 
software suspend case you cannot simply write them all in one pass at 
the end. In the worst case, you would have to write twice for each 
sector/page - once for the data, and once for the checksums 
(CONFIG_BLK_DEV_INTEGRITY completely avoids this issue since with it the 
checksum is together with the data in the same sector). Not to mention 
fun things like write amplification.

A filesystem with data checksums can write the checksum as part of its 
normal metadata updates (which it already has to do anyway).

A generic "checksumming block device" could be a way of "updating" a 
filesystem without checksums (or with only metadata checksums) to have 
them. However, I believe it would be more productive to add them 
directly to the filesystem itself. Even more since the only way I can 
see of doing it efficiently in a generic block layer is by using lots of 
filesystem-style tricks (things like a log-structured list of CRC 
values, dividing the device in "block groups" to keep the checksum close 
to the data, and so on).

[1] Block layer: integrity checking and lots of partitions
     http://lwn.net/Articles/290141/
[2] http://oss.oracle.com/projects/data-integrity/documentation/

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-05-26 10:22 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
2010-05-22 18:13   ` Borislav Petkov
2010-05-22 18:18     ` Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-23 15:19   ` Avi Kivity
2010-05-23 18:58     ` Cesar Eduardo Barros
2010-05-24  6:41       ` Avi Kivity
2010-05-24  7:32         ` Nick Piggin
2010-05-24 10:51           ` Avi Kivity
2010-05-24 11:24         ` Cesar Eduardo Barros
2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
2010-05-23 18:32   ` Cesar Eduardo Barros
2010-05-24  0:09     ` Minchan Kim
2010-05-24  0:57       ` Cesar Eduardo Barros
2010-05-24  2:05         ` Minchan Kim
2010-05-24 10:50           ` Cesar Eduardo Barros
2010-05-25 23:52             ` Minchan Kim
2010-05-26 10:21               ` Cesar Eduardo Barros [this message]
2010-05-26 15:31                 ` Minchan Kim
2010-05-26 21:28                   ` Valdis.Kletnieks
2010-05-26 22:45                     ` Minchan Kim
2010-05-26 23:19                       ` Cesar Eduardo Barros
2010-05-26 23:27                         ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BFCF645.2050400@cesarb.net \
    --to=cesarb@cesarb.net \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).