[PATCH 0/3] mm: Swap checksum

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] mm: Swap checksum
@ 2010-05-22 18:08 Cesar Eduardo Barros
  2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-22 18:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Add support for checksumming the swap pages written to disk, using the
same checksum as btrfs (crc32c). Since the contents of the swap do not
matter after a shutdown, the checksum is kept in memory only.

Note that this code does not checksum the software suspend image.

Cesar Eduardo Barros (3):
       mm/swapfile.c: better messages for swap_info_get
       kernel/power/swap.c: do not use end_swap_bio_read
       mm: Swap checksum

  include/linux/swap.h |   31 +++++++-
  kernel/power/swap.c  |   21 +++++-
  mm/Kconfig           |   22 +++++
  mm/Makefile          |    1 +
  mm/page_io.c         |   92 ++++++++++++++++++--
  mm/swapcsum.c        |   94 +++++++++++++++++++++
  mm/swapfile.c        |  186 ++++++++++++++++++++++++++++++++++++++++--
  7 files changed, 429 insertions(+), 18 deletions(-)
  create mode 100644 mm/swapcsum.c

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get
  2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
@ 2010-05-22 18:08 ` Cesar Eduardo Barros
  2010-05-22 18:13   ` Borislav Petkov
  2010-05-22 18:08 ` [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read Cesar Eduardo Barros
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-22 18:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Cesar Eduardo Barros

swap_info_get() is used for more than swap_free().

Use "swap_info_get:" instead of "swap_free:" in the error messages.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
---
 mm/swapfile.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6cd0a8f..af7d499 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -522,16 +522,16 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 
 bad_free:
-	printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
+	printk(KERN_ERR "swap_info_get: %s%08lx\n", Unused_offset, entry.val);
 	goto out;
 bad_offset:
-	printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
+	printk(KERN_ERR "swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
 bad_device:
-	printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file, entry.val);
+	printk(KERN_ERR "swap_info_get: %s%08lx\n", Unused_file, entry.val);
 	goto out;
 bad_nofile:
-	printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, entry.val);
+	printk(KERN_ERR "swap_info_get: %s%08lx\n", Bad_file, entry.val);
 out:
 	return NULL;
 }
-- 
1.6.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read
  2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
  2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
@ 2010-05-22 18:08 ` Cesar Eduardo Barros
  2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
  2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
  3 siblings, 0 replies; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-22 18:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Cesar Eduardo Barros

The swap checksum patches will change end_swap_bio_read to also verify
the page's checksum. This is not compatible with its use at submit()
from kernel/power/swap.c.

Make kernel/power/swap.c use a private copy of end_swap_bio_read, and
modify it to not say "Read-error" if the error was on a write.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
---
 include/linux/swap.h |    1 -
 kernel/power/swap.c  |   21 ++++++++++++++++++++-
 mm/page_io.c         |    2 +-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..86a0d64 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -291,7 +291,6 @@ extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *);
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct page *);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_read(struct bio *bio, int err);
 
 /* linux/mm/swap_state.c */
 extern struct address_space swapper_space;
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 66824d7..7305a3f 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -147,6 +147,25 @@ int swsusp_swap_in_use(void)
 static unsigned short root_swap = 0xffff;
 static struct block_device *resume_bdev;
 
+static void end_swap_bio(struct bio *bio, int err)
+{
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct page *page = bio->bi_io_vec[0].bv_page;
+
+	if (!uptodate) {
+		SetPageError(page);
+		ClearPageUptodate(page);
+		printk(KERN_ALERT "%s-error on swap-device (%u:%u:%Lu)\n",
+				bio->bi_rw & BIO_RW ? "Write" : "Read",
+				imajor(bio->bi_bdev->bd_inode),
+				iminor(bio->bi_bdev->bd_inode),
+				(unsigned long long)bio->bi_sector);
+	} else {
+		SetPageUptodate(page);
+	}
+	unlock_page(page);
+	bio_put(bio);
+}
 /**
  *	submit - submit BIO request.
  *	@rw:	READ or WRITE.
@@ -167,7 +186,7 @@ static int submit(int rw, pgoff_t page_off, struct page *page,
 	bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
 	bio->bi_sector = page_off * (PAGE_SIZE >> 9);
 	bio->bi_bdev = resume_bdev;
-	bio->bi_end_io = end_swap_bio_read;
+	bio->bi_end_io = end_swap_bio;
 
 	if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) {
 		printk(KERN_ERR "PM: Adding page to bio failed at %ld\n",
diff --git a/mm/page_io.c b/mm/page_io.c
index 31a3b96..0e2d4e8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -66,7 +66,7 @@ static void end_swap_bio_write(struct bio *bio, int err)
 	bio_put(bio);
 }
 
-void end_swap_bio_read(struct bio *bio, int err)
+static void end_swap_bio_read(struct bio *bio, int err)
 {
 	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 	struct page *page = bio->bi_io_vec[0].bv_page;
-- 
1.6.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] mm: Swap checksum
  2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
  2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
  2010-05-22 18:08 ` [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read Cesar Eduardo Barros
@ 2010-05-22 18:08 ` Cesar Eduardo Barros
  2010-05-23 15:19   ` Avi Kivity
  2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
  3 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-22 18:08 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Cesar Eduardo Barros

Add support for checksumming the swap pages written to disk, using the
same checksum as btrfs (crc32c). Since the contents of the swap do not
matter after a shutdown, the checksum is kept in memory only.

Note that this code does not checksum the software suspend image.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
---
 include/linux/swap.h |   30 +++++++++
 mm/Kconfig           |   22 ++++++
 mm/Makefile          |    1 +
 mm/page_io.c         |   90 +++++++++++++++++++++++---
 mm/swapcsum.c        |   94 ++++++++++++++++++++++++++
 mm/swapfile.c        |  178 +++++++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 404 insertions(+), 11 deletions(-)
 create mode 100644 mm/swapcsum.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 86a0d64..92b24d4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -15,6 +15,9 @@
 struct notifier_block;
 
 struct bio;
+struct bio_vec;
+
+struct workqueue_struct;
 
 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
 #define SWAP_FLAG_PRIO_MASK	0x7fff
@@ -180,6 +183,10 @@ struct swap_info_struct {
 	struct swap_extent *curr_swap_extent;
 	struct swap_extent first_swap_extent;
 	struct block_device *bdev;	/* swap device or bdev of swap file */
+#ifdef CONFIG_SWAP_CHECKSUM
+	unsigned short *csum_count;	/* usage count of a csum page */
+	u32 **csum;			/* vmalloc'ed array of swap csums */
+#endif
 	struct file *swap_file;		/* seldom referenced */
 	unsigned int old_block_size;	/* seldom referenced */
 };
@@ -369,6 +376,29 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
 }
 #endif
 
+#ifdef CONFIG_SWAP_CHECKSUM
+/* linux/mm/swapfile.c */
+extern int swap_csum_set(swp_entry_t entry, u32 crc);
+extern int swap_csum_get(swp_entry_t entry, u32 *crc);
+
+/* linux/mm/swapcsum.c */
+extern bool noswapcsum __read_mostly;
+extern bool swap_csum_verify(struct page *page);
+extern struct workqueue_struct *swapcsum_workqueue;
+#else
+#define noswapcsum true
+#endif
+
+/* linux/mm/swapcsum.c */
+extern int _swap_csum_write(struct page *page);
+
+static inline int swap_csum_write(struct page *page)
+{
+	if (noswapcsum)
+		return 0;
+	return _swap_csum_write(page);
+}
+
 #else /* CONFIG_SWAP */
 
 #define nr_swap_pages				0L
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..890faf4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -287,3 +287,25 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config SWAP_CHECKSUM
+	bool "Swap checksum"
+	depends on SWAP && EXPERIMENTAL
+	select LIBCRC32C
+	default n
+	help
+	  This option enables checksumming of swap pages when saved to disk.
+
+	  Use the kernel command line options "swapcsum" to enable and
+	  "noswapcsum" to disable. The default value is configurable.
+
+	  Note that this option does not checksum the software suspend image.
+
+config SWAP_CHECKSUM_DEFAULT
+	bool "Enable swap checksum by default"
+	depends on SWAP_CHECKSUM
+	default y
+	help
+	  You can use the kernel command line options "swapcsum" to enable and
+	  "noswapcsum" to disable swap checksumming. This option controls the
+	  default value.
diff --git a/mm/Makefile b/mm/Makefile
index 6c2a73a..677bc43 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,6 +17,7 @@ obj-y += init-mm.o
 
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP_CHECKSUM)	+= swapcsum.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
diff --git a/mm/page_io.c b/mm/page_io.c
index 0e2d4e8..ed0a856 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -18,6 +18,8 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags,
@@ -66,22 +68,71 @@ static void end_swap_bio_write(struct bio *bio, int err)
 	bio_put(bio);
 }
 
+static void end_swap_page_read_error(struct page *page)
+{
+	SetPageError(page);
+	ClearPageUptodate(page);
+	unlock_page(page);
+}
+
+static void end_swap_page_read(struct page *page)
+{
+	SetPageUptodate(page);
+	unlock_page(page);
+}
+
+struct swap_readpage_csum_work {
+	struct work_struct work;
+	struct page *page;
+};
+
+#ifdef CONFIG_SWAP_CHECKSUM
+static void swap_readpage_csum_work_func(struct work_struct *work)
+{
+	struct swap_readpage_csum_work *csum_work =
+		container_of(work, struct swap_readpage_csum_work, work);
+	struct page *page = csum_work->page;
+
+	kfree(csum_work);
+
+	if (unlikely(!swap_csum_verify(page)))
+		end_swap_page_read_error(page);
+	else
+		end_swap_page_read(page);
+}
+
+static void swap_readpage_queue_csum_work(struct page *page, void *bi_private)
+{
+	struct swap_readpage_csum_work *csum_work = bi_private;
+
+	INIT_WORK(&csum_work->work, swap_readpage_csum_work_func);
+	csum_work->page = page;
+	queue_work(swapcsum_workqueue, &csum_work->work);
+}
+#else
+/* The call to this function should be optimized out. */
+extern void swap_readpage_queue_csum_work(struct page *page, void *bi_private);
+#endif
+
 static void end_swap_bio_read(struct bio *bio, int err)
 {
 	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 	struct page *page = bio->bi_io_vec[0].bv_page;
 
 	if (!uptodate) {
-		SetPageError(page);
-		ClearPageUptodate(page);
+		if (!noswapcsum)
+			kfree(bio->bi_private);
+		end_swap_page_read_error(page);
 		printk(KERN_ALERT "Read-error on swap-device (%u:%u:%Lu)\n",
 				imajor(bio->bi_bdev->bd_inode),
 				iminor(bio->bi_bdev->bd_inode),
 				(unsigned long long)bio->bi_sector);
 	} else {
-		SetPageUptodate(page);
+		if (noswapcsum)
+			end_swap_page_read(page);
+		else
+			swap_readpage_queue_csum_work(page, bio->bi_private);
 	}
-	unlock_page(page);
 	bio_put(bio);
 }
 
@@ -100,11 +151,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	}
 	bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
 	if (bio == NULL) {
-		set_page_dirty(page);
-		unlock_page(page);
 		ret = -ENOMEM;
-		goto out;
+		goto out_error;
 	}
+	ret = swap_csum_write(page);
+	if (unlikely(ret))
+		goto out_error_put;
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG);
 	count_vm_event(PSWPOUT);
@@ -113,6 +165,13 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	submit_bio(rw, bio);
 out:
 	return ret;
+
+out_error_put:
+	bio_put(bio);
+out_error:
+	set_page_dirty(page);
+	unlock_page(page);
+	goto out;
 }
 
 int swap_readpage(struct page *page)
@@ -124,12 +183,25 @@ int swap_readpage(struct page *page)
 	VM_BUG_ON(PageUptodate(page));
 	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
 	if (bio == NULL) {
-		unlock_page(page);
 		ret = -ENOMEM;
-		goto out;
+		goto out_error;
+	}
+	if (!noswapcsum) {
+		bio->bi_private = kmalloc(
+			sizeof(struct swap_readpage_csum_work), GFP_KERNEL);
+		if (unlikely(!bio->bi_private)) {
+			ret = -ENOMEM;
+			goto out_error_put;
+		}
 	}
 	count_vm_event(PSWPIN);
 	submit_bio(READ, bio);
 out:
 	return ret;
+
+out_error_put:
+	bio_put(bio);
+out_error:
+	unlock_page(page);
+	goto out;
 }
diff --git a/mm/swapcsum.c b/mm/swapcsum.c
new file mode 100644
index 0000000..98ba97d
--- /dev/null
+++ b/mm/swapcsum.c
@@ -0,0 +1,94 @@
+#include <linux/crc32c.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/workqueue.h>
+
+#ifdef CONFIG_SWAP_CHECKSUM_DEFAULT
+#define NOSWAPCSUM_DEFAULT false
+#else
+#define NOSWAPCSUM_DEFAULT true
+#endif
+
+bool noswapcsum __read_mostly = NOSWAPCSUM_DEFAULT;
+
+static int __init swap_csum_enable(char *s)
+{
+	noswapcsum = false;
+	return 1;
+}
+__setup("swapcsum", swap_csum_enable);
+
+static int __init swap_csum_disable(char *s)
+{
+	noswapcsum = true;
+	return 1;
+}
+__setup("noswapcsum", swap_csum_disable);
+
+static u32 swap_csum_page(struct page *page)
+{
+	void *address;
+	u32 crc;
+
+	address = kmap_atomic(page, KM_USER0);
+	crc = ~crc32c(~(u32)0, address, PAGE_SIZE);
+	kunmap_atomic(address, KM_USER0);
+	return crc;
+}
+
+int _swap_csum_write(struct page *page)
+{
+	swp_entry_t entry;
+
+	VM_BUG_ON(!PageSwapCache(page));
+
+	entry.val = page_private(page);
+	return swap_csum_set(entry, swap_csum_page(page));
+}
+
+bool swap_csum_verify(struct page *page)
+{
+	swp_entry_t entry;
+	u32 crc, old_crc;
+
+	VM_BUG_ON(!PageSwapCache(page));
+
+	entry.val = page_private(page);
+
+	if (unlikely(swap_csum_get(entry, &old_crc))) {
+		printk(KERN_ALERT "Missing swap checksum for page "
+			"type %u offset %lu\n",
+			swp_type(entry), swp_offset(entry));
+		WARN_ON(true);
+		return false;
+	}
+
+	crc = swap_csum_page(page);
+	if (unlikely(crc != old_crc)) {
+		printk(KERN_ALERT "Wrong swap checksum for page "
+			"type %u offset %lu (0x%08x != 0x%08x)\n",
+			swp_type(entry), swp_offset(entry),
+			(unsigned)crc, (unsigned)old_crc);
+		return false;
+	}
+
+	return true;
+}
+
+struct workqueue_struct *swapcsum_workqueue;
+
+/* TODO: create the workqueue on swapon, destroy the workqueue on swapoff */
+static int __init swap_csum_init(void)
+{
+	if (noswapcsum)
+		return 0;
+
+	swapcsum_workqueue = create_workqueue("swapcsum");
+	BUG_ON(!swapcsum_workqueue);
+	return 0;
+}
+module_init(swap_csum_init)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index af7d499..50d1cce 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -63,6 +63,50 @@ static inline unsigned char swap_count(unsigned char ent)
 	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
 }
 
+#ifdef CONFIG_SWAP_CHECKSUM
+/*
+ * The swap checksums are stored in checksum pages, with CSUMS_PER_PAGE
+ * checksums per page. The checksum pages are allocated on the first
+ * write, and freed when none of the pages with checksums on that
+ * checksum page is in use anymore.
+ *
+ * To simplify the freeing of the checksum pages, si->csum_count has a
+ * count of the in-use pages corresponding to that checksum page. For
+ * the purpose of this count, pages with any count other than 0 or
+ * SWAP_MAP_BAD are in use.
+ */
+
+#define CSUMS_PER_PAGE (PAGE_SIZE / sizeof(u32))
+
+static void __swap_csum_count_inc(struct swap_info_struct *si,
+					unsigned long offset)
+{
+	if (noswapcsum)
+		return;
+
+	++si->csum_count[offset / CSUMS_PER_PAGE];
+}
+
+static void __swap_csum_count_dec(struct swap_info_struct *si,
+					unsigned long offset)
+{
+	if (noswapcsum)
+		return;
+
+	BUG_ON(!si->csum_count[offset / CSUMS_PER_PAGE]);
+
+	if (!--si->csum_count[offset / CSUMS_PER_PAGE]) {
+		free_page((unsigned long)si->csum[offset / CSUMS_PER_PAGE]);
+		si->csum[offset / CSUMS_PER_PAGE] = NULL;
+	}
+}
+#else
+static inline void __swap_csum_count_inc(struct swap_info_struct *si,
+					unsigned long offset) { }
+static inline void __swap_csum_count_dec(struct swap_info_struct *si,
+					unsigned long offset) { }
+#endif
+
 /* returns 1 if swap entry is freed */
 static int
 __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset)
@@ -340,6 +384,7 @@ checks:
 		si->highest_bit = 0;
 	}
 	si->swap_map[offset] = usage;
+	__swap_csum_count_inc(si, offset);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -500,7 +545,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *swap_info_get_unlocked(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -518,7 +563,6 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 		goto bad_offset;
 	if (!p->swap_map[offset])
 		goto bad_free;
-	spin_lock(&swap_lock);
 	return p;
 
 bad_free:
@@ -536,6 +580,14 @@ out:
 	return NULL;
 }
 
+static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p = swap_info_get_unlocked(entry);
+	if (likely(p))
+		spin_lock(&swap_lock);
+	return p;
+}
+
 static unsigned char swap_entry_free(struct swap_info_struct *p,
 				     swp_entry_t entry, unsigned char usage)
 {
@@ -574,6 +626,8 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
+		__swap_csum_count_dec(p, offset);
+
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit)
@@ -1525,6 +1579,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
 	unsigned char *swap_map;
+#ifdef CONFIG_SWAP_CHECKSUM
+	unsigned short *csum_count;
+	u32 **csum;
+#endif
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
 	struct inode *inode;
@@ -1639,10 +1697,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
+#ifdef CONFIG_SWAP_CHECKSUM
+	csum_count = p->csum_count;
+	csum = p->csum;
+#endif
 	p->flags = 0;
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
+#ifdef CONFIG_SWAP_CHECKSUM
+	vfree(csum_count);
+	vfree(csum);
+#endif
 	/* Destroy swap account informatin */
 	swap_cgroup_swapoff(type);
 
@@ -1798,6 +1864,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	unsigned long maxpages;
 	unsigned long swapfilepages;
 	unsigned char *swap_map = NULL;
+#ifdef CONFIG_SWAP_CHECKSUM
+	unsigned long csum_pages = 0;
+	unsigned short *csum_count = NULL;
+	u32 **csum = NULL;
+#endif
 	struct page *page = NULL;
 	struct inode *inode = NULL;
 	int did_down = 0;
@@ -1983,7 +2054,34 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap;
 	}
 
+#ifdef CONFIG_SWAP_CHECKSUM
+	if (!noswapcsum) {
+		csum_pages = DIV_ROUND_UP(maxpages, CSUMS_PER_PAGE);
+
+		csum = vmalloc(csum_pages * sizeof(*csum));
+		if (!csum) {
+			error = -ENOMEM;
+			goto bad_swap;
+		}
+
+		csum_count = vmalloc(csum_pages * sizeof(*csum_count));
+		if (!csum_count) {
+			error = -ENOMEM;
+			goto bad_swap;
+		}
+	}
+
+	p->csum_count = csum_count;
+	p->csum = csum;
+#endif
+
 	memset(swap_map, 0, maxpages);
+#ifdef CONFIG_SWAP_CHECKSUM
+	if (!noswapcsum) {
+		memset(csum_count, 0, csum_pages * sizeof(*csum_count));
+		memset(csum, 0, csum_pages * sizeof(*csum));
+	}
+#endif
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
@@ -2076,6 +2174,10 @@ bad_swap_2:
 	p->flags = 0;
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
+#ifdef CONFIG_SWAP_CHECKSUM
+	vfree(csum_count);
+	vfree(csum);
+#endif
 	if (swap_file)
 		filp_close(swap_file, NULL);
 out:
@@ -2487,3 +2589,75 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 		}
 	}
 }
+
+#ifdef CONFIG_SWAP_CHECKSUM
+int swap_csum_set(swp_entry_t entry, u32 crc)
+{
+	int ret = 0;
+	struct swap_info_struct *si;
+	unsigned long offset;
+	u32 *csum_page;
+
+	si = swap_info_get(entry);
+	if (unlikely(!si))
+		return -EINVAL;
+	offset = swp_offset(entry);
+
+	BUG_ON(!si->csum);
+	csum_page = si->csum[offset / CSUMS_PER_PAGE];
+	if (!csum_page) {
+		csum_page = (void *)__get_free_page(GFP_ATOMIC);
+		if (unlikely(!csum_page)) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		si->csum[offset / CSUMS_PER_PAGE] = csum_page;
+	}
+
+	csum_page[offset % CSUMS_PER_PAGE] = crc;
+
+out:
+	spin_unlock(&swap_lock);
+	return ret;
+}
+
+int swap_csum_get(swp_entry_t entry, u32 *crc)
+{
+	int ret = 0;
+	struct swap_info_struct *si;
+	unsigned long offset;
+	u32 *csum_page;
+
+	/*
+	 * Not locking swap_lock here is safe because:
+	 *
+	 * - We are within end_swap_bio_read for a page in this
+	 *   swapfile, thus it is in use and its swap_info_struct
+	 *   cannot be freed.
+	 * - If we are reading a page from the swapfile, its count must
+	 *   be nonzero, thus the corresponding csum_count must also be
+	 *   nonzero, meaning the corresponding checksum page will not
+	 *   be freed.
+	 * - The checksum value itself is only modified when the page
+	 *   is written, but doing so makes no sense since we are
+	 *   currently in the middle of reading it.
+	 */
+	si = swap_info_get_unlocked(entry);
+	if (unlikely(!si))
+		return -EINVAL;
+	offset = swp_offset(entry);
+
+	BUG_ON(!si->csum);
+	csum_page = si->csum[offset / CSUMS_PER_PAGE];
+	if (unlikely(!csum_page)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	*crc = csum_page[offset % CSUMS_PER_PAGE];
+
+out:
+	return ret;
+}
+#endif
-- 
1.6.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get
  2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
@ 2010-05-22 18:13   ` Borislav Petkov
  2010-05-22 18:18     ` Cesar Eduardo Barros
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2010-05-22 18:13 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel

From: Cesar Eduardo Barros <cesarb@cesarb.net>
Date: Sat, May 22, 2010 at 03:08:49PM -0300

> swap_info_get() is used for more than swap_free().
> 
> Use "swap_info_get:" instead of "swap_free:" in the error messages.
> 
> Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
> ---
>  mm/swapfile.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6cd0a8f..af7d499 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -522,16 +522,16 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
>  	return p;
>  
>  bad_free:
> -	printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
> +	printk(KERN_ERR "swap_info_get: %s%08lx\n", Unused_offset, entry.val);

Why not let the compiler do it for ya:

	printk(KERN_ERR "%s: %s%08lx\n", __func__, Unused_offset, entry.val);

?... etc.

-- 
Regards/Gruss,
    Boris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get
  2010-05-22 18:13   ` Borislav Petkov
@ 2010-05-22 18:18     ` Cesar Eduardo Barros
  0 siblings, 0 replies; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-22 18:18 UTC (permalink / raw)
  To: Borislav Petkov, linux-mm, linux-kernel

Em 22-05-2010 15:13, Borislav Petkov escreveu:
>> @@ -522,16 +522,16 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
>>   	return p;
>>
>>   bad_free:
>> -	printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
>> +	printk(KERN_ERR "swap_info_get: %s%08lx\n", Unused_offset, entry.val);
>
> Why not let the compiler do it for ya:
>
> 	printk(KERN_ERR "%s: %s%08lx\n", __func__, Unused_offset, entry.val);
>
> ?... etc.

See the third patch. This function becomes swap_info_get_unlocked(), and 
swap_info_get() becomes a small wrapper around it. Yet, I still want to 
keep printing swap_info_get: in the error message (whether it is locked 
or not makes no difference from the point of view of the error messsage).

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
                   ` (2 preceding siblings ...)
  2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
@ 2010-05-23 14:03 ` Minchan Kim
  2010-05-23 18:32   ` Cesar Eduardo Barros
  3 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2010-05-23 14:03 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel, Hugh Dickins

On Sat, May 22, 2010 at 03:08:07PM -0300, Cesar Eduardo Barros wrote:
> Add support for checksumming the swap pages written to disk, using the
> same checksum as btrfs (crc32c). Since the contents of the swap do not
> matter after a shutdown, the checksum is kept in memory only.
> 
> Note that this code does not checksum the software suspend image.
We have been used swap pages without checksum.

First of all, Could you explain why you need checksum on swap pages?
Do you see any problem which swap pages are broken?

I could miss your claim at old disucussion thread in LKML.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm: Swap checksum
  2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
@ 2010-05-23 15:19   ` Avi Kivity
  2010-05-23 18:58     ` Cesar Eduardo Barros
  0 siblings, 1 reply; 25+ messages in thread
From: Avi Kivity @ 2010-05-23 15:19 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel

On 05/22/2010 09:08 PM, Cesar Eduardo Barros wrote:
> Add support for checksumming the swap pages written to disk, using the
> same checksum as btrfs (crc32c). Since the contents of the swap do not
> matter after a shutdown, the checksum is kept in memory only.
>
> Note that this code does not checksum the software suspend image.
>
>
>
>   #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
>   #define SWAP_FLAG_PRIO_MASK	0x7fff
> @@ -180,6 +183,10 @@ struct swap_info_struct {
>   	struct swap_extent *curr_swap_extent;
>   	struct swap_extent first_swap_extent;
>   	struct block_device *bdev;	/* swap device or bdev of swap file */
> +#ifdef CONFIG_SWAP_CHECKSUM
> +	unsigned short *csum_count;	/* usage count of a csum page */
> +	u32 **csum;			/* vmalloc'ed array of swap csums */
> +#endif
>   	struct file *swap_file;		/* seldom referenced */
>   	unsigned int old_block_size;	/* seldom referenced */
>   };
>    

On 64-bit, we may be able to store the checksum in the pte, if the swap 
device is small enough.

If we take the trouble to touch the page, we may as well compare it 
against zero, and if so drop it instead of swapping it out.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
@ 2010-05-23 18:32   ` Cesar Eduardo Barros
  2010-05-24  0:09     ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-23 18:32 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, Hugh Dickins

Em 23-05-2010 11:03, Minchan Kim escreveu:
> On Sat, May 22, 2010 at 03:08:07PM -0300, Cesar Eduardo Barros wrote:
>> Add support for checksumming the swap pages written to disk, using the
>> same checksum as btrfs (crc32c). Since the contents of the swap do not
>> matter after a shutdown, the checksum is kept in memory only.
>>
>> Note that this code does not checksum the software suspend image.
> We have been used swap pages without checksum.
>
> First of all, Could you explain why you need checksum on swap pages?
> Do you see any problem which swap pages are broken?

The same reason we need checksums in the filesystem.

If you use btrfs as your root filesystem, you are protected by checksums 
from damage in the filesystem, but not in the swap partition (which is 
often in the same disk, and thus as vulnerable as the filesystem). It is 
better to get a checksum error when swapping in than having a silently 
corrupted page.

If you add checksums to the swap, the only piece missing (besides the 
partition table and bootloader, and the first one is solved by GPT, 
which also has a checksum) is checksumming the software suspend image. 
But it has a differente read/write path and different requirements (to 
start with, the checksums must be written to the disk too, while for the 
swap they can stay in memory only).

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm: Swap checksum
  2010-05-23 15:19   ` Avi Kivity
@ 2010-05-23 18:58     ` Cesar Eduardo Barros
  2010-05-24  6:41       ` Avi Kivity
  0 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-23 18:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-mm, linux-kernel

Em 23-05-2010 12:19, Avi Kivity escreveu:
> On 64-bit, we may be able to store the checksum in the pte, if the swap
> device is small enough.

Which pte? Correct me if I am wrong, but I do not think all pages 
written to the swap have exactly one pte pointing to them. And I have 
not looked at the shmem.c code yet, but does it even use ptes?

It might be possible (find all ptes and write the 32-bit checksum to 
them, do something else for shmem, have two different code paths for 
small/large swapfiles), but I do not know if the memory savings are 
worth the extra complexity (especially the need for two separate code 
paths).

> If we take the trouble to touch the page, we may as well compare it
> against zero, and if so drop it instead of swapping it out.

The problem with this is that the page is touched deep inside the crc32c 
code, which might even be using hardware instructions (crc32c-intel). So 
we would need to read it two times to compare against zero.

One possibility could be to compare the full page against zero only if 
its crc is a specific value (the crc32c of a page full of zeros). This 
would not be too slow (we would be wasting time only when we have a very 
high probability of saving much more time), and not need to touch the 
crc32c code at all. I would only have to look at how this messes up the 
state tracking (i.e. how to make it track the fact that, instead of 
getting written out, this is now a zeroed page). Other than that, it 
seems a good idea.

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-23 18:32   ` Cesar Eduardo Barros
@ 2010-05-24  0:09     ` Minchan Kim
  2010-05-24  0:57       ` Cesar Eduardo Barros
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2010-05-24  0:09 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel, Hugh Dickins

Hi, Cesar.
I am not sure Cesar is first name. :)

On Mon, May 24, 2010 at 3:32 AM, Cesar Eduardo Barros <cesarb@cesarb.net> wrote:
> Em 23-05-2010 11:03, Minchan Kim escreveu:
>>
>> On Sat, May 22, 2010 at 03:08:07PM -0300, Cesar Eduardo Barros wrote:
>>>
>>> Add support for checksumming the swap pages written to disk, using the
>>> same checksum as btrfs (crc32c). Since the contents of the swap do not
>>> matter after a shutdown, the checksum is kept in memory only.
>>>
>>> Note that this code does not checksum the software suspend image.
>>
>> We have been used swap pages without checksum.
>>
>> First of all, Could you explain why you need checksum on swap pages?
>> Do you see any problem which swap pages are broken?
>
> The same reason we need checksums in the filesystem.
>
> If you use btrfs as your root filesystem, you are protected by checksums
> from damage in the filesystem, but not in the swap partition (which is often
> in the same disk, and thus as vulnerable as the filesystem). It is better to
> get a checksum error when swapping in than having a silently corrupted page.

Do you mean "vulnerable" is other file system or block I/O operation
invades swap partition and breaks data of swap?

If it is, I think it's the problem of them. so we have to fix it
before merged into mainline. But I admit human being always take a
mistake so that we can miss it at review time. In such case, it would
be very hard bug when swap pages are broken. I haven't hear about such
problem until now but it might be useful if the problem happens.
(Maybe they can't notice that due to hard bug to find)

But I have a concern about breaking memory which includes crc by
dangling pointer. In this case, swap block is correct but it would
emit crc error.

Do you have an idea making sure memory includes crc is correct?

Before review code, let's discuss we really need this and it's useful.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-24  0:09     ` Minchan Kim
@ 2010-05-24  0:57       ` Cesar Eduardo Barros
  2010-05-24  2:05         ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-24  0:57 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, Hugh Dickins

Em 23-05-2010 21:09, Minchan Kim escreveu:
> Hi, Cesar.
> I am not sure Cesar is first name. :)

Yes, it is.

> On Mon, May 24, 2010 at 3:32 AM, Cesar Eduardo Barros<cesarb@cesarb.net>  wrote:
>> Em 23-05-2010 11:03, Minchan Kim escreveu:
>>> We have been used swap pages without checksum.
>>>
>>> First of all, Could you explain why you need checksum on swap pages?
>>> Do you see any problem which swap pages are broken?
>>
>> The same reason we need checksums in the filesystem.
>>
>> If you use btrfs as your root filesystem, you are protected by checksums
>> from damage in the filesystem, but not in the swap partition (which is often
>> in the same disk, and thus as vulnerable as the filesystem). It is better to
>> get a checksum error when swapping in than having a silently corrupted page.
>
> Do you mean "vulnerable" is other file system or block I/O operation
> invades swap partition and breaks data of swap?

Vulnerable in that the same kind of hardware problems which can silently 
damage filesystem data in the disk can damage swap pages in the disk.

This is the reason both btrfs and zfs checksum all their data and 
metadata. However, the swap partition is still vulnerable (using a swap 
file is not a solution, since the swap code bypasses the filesystem). 
And silent data corruption in the swap partition could be even worse 
than in the filesystem - while a program might not trust a file it is 
reading to not be corrupted, almost all programs will trust their 
*memory* to not be corrupted.

The internal ECC of the disk will not save you - a quick Google search 
found an instance of someone with silent data corruption caused by a 
faulty *power supply*.[1]

And if it is silent corruption, without the checksums you will not 
notice it - it will just be dismissed as "oh, Firefox just crashed 
again" or similar (the same as bit flips on RAM without ECC).

> If it is, I think it's the problem of them. so we have to fix it
> before merged into mainline. But I admit human being always take a
> mistake so that we can miss it at review time. In such case, it would
> be very hard bug when swap pages are broken. I haven't hear about such
> problem until now but it might be useful if the problem happens.
> (Maybe they can't notice that due to hard bug to find)
>
> But I have a concern about breaking memory which includes crc by
> dangling pointer. In this case, swap block is correct but it would
> emit crc error.
>
> Do you have an idea making sure memory includes crc is correct?

The swap checksum only protects the page against being silently 
corrupted while on the disk and at least to some degree on the I/O path 
between the memory and the disk. It does not protect against broken 
kernel-mode code writing to the wrong address, nor against broken 
hardware (or hardware misconfigured by broken drivers) doing DMA to 
wrong addresses. It also does not protect against hardware errors in the 
RAM itself (you have ECC memory for that).

That is, the code assumes the memory containing the checksums will not 
be corrupted, because if it is, you have worse problems (and the CRC 
error here would be a *good* thing, since it would make you notice 
something is not quite right).

[1] http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-24  0:57       ` Cesar Eduardo Barros
@ 2010-05-24  2:05         ` Minchan Kim
  2010-05-24 10:50           ` Cesar Eduardo Barros
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2010-05-24  2:05 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel, Hugh Dickins

On Mon, May 24, 2010 at 9:57 AM, Cesar Eduardo Barros <cesarb@cesarb.net> wrote:
> Em 23-05-2010 21:09, Minchan Kim escreveu:
>>
>> Hi, Cesar.
>> I am not sure Cesar is first name. :)
>
> Yes, it is.
>
>> On Mon, May 24, 2010 at 3:32 AM, Cesar Eduardo Barros<cesarb@cesarb.net>
>>  wrote:
>>>
>>> Em 23-05-2010 11:03, Minchan Kim escreveu:
>>>>
>>>> We have been used swap pages without checksum.
>>>>
>>>> First of all, Could you explain why you need checksum on swap pages?
>>>> Do you see any problem which swap pages are broken?
>>>
>>> The same reason we need checksums in the filesystem.
>>>
>>> If you use btrfs as your root filesystem, you are protected by checksums
>>> from damage in the filesystem, but not in the swap partition (which is
>>> often
>>> in the same disk, and thus as vulnerable as the filesystem). It is better
>>> to
>>> get a checksum error when swapping in than having a silently corrupted
>>> page.
>>
>> Do you mean "vulnerable" is other file system or block I/O operation
>> invades swap partition and breaks data of swap?
>
> Vulnerable in that the same kind of hardware problems which can silently
> damage filesystem data in the disk can damage swap pages in the disk.
>
> This is the reason both btrfs and zfs checksum all their data and metadata.
> However, the swap partition is still vulnerable (using a swap file is not a
> solution, since the swap code bypasses the filesystem). And silent data
> corruption in the swap partition could be even worse than in the filesystem
> - while a program might not trust a file it is reading to not be corrupted,
> almost all programs will trust their *memory* to not be corrupted.
>
> The internal ECC of the disk will not save you - a quick Google search found
> an instance of someone with silent data corruption caused by a faulty *power
> supply*.[1]
>
> And if it is silent corruption, without the checksums you will not notice it
> - it will just be dismissed as "oh, Firefox just crashed again" or similar
> (the same as bit flips on RAM without ECC).

Thanks for kind explanation.

When I read your comment, suddenly some thought occurred to me.
If we can't believe ECC of the disk, why do we separate error
detection logic between file system and swap disk?

I mean it make sense that put crc detection into block layer?
It can make sure any block I/O.

And what's BER of disk?
Is it usual to meet the problem?

In normal desktop, some app killed are not critical. If the
application is critical, maybe app have to logic fault handling.
Firefox has session restore feature and Office program has temporal
save feature.

On the other hand, in server, does it designed well to use swap disk
until we meet bit error of disk?

My feel is that it seem to be rather overkill.

>
>> If it is, I think it's the problem of them. so we have to fix it
>> before merged into mainline. But I admit human being always take a
>> mistake so that we can miss it at review time. In such case, it would
>> be very hard bug when swap pages are broken. I haven't hear about such
>> problem until now but it might be useful if the problem happens.
>> (Maybe they can't notice that due to hard bug to find)
>>
>> But I have a concern about breaking memory which includes crc by
>> dangling pointer. In this case, swap block is correct but it would
>> emit crc error.
>>
>> Do you have an idea making sure memory includes crc is correct?
>
> The swap checksum only protects the page against being silently corrupted
> while on the disk and at least to some degree on the I/O path between the
> memory and the disk. It does not protect against broken kernel-mode code
> writing to the wrong address, nor against broken hardware (or hardware
> misconfigured by broken drivers) doing DMA to wrong addresses. It also does
> not protect against hardware errors in the RAM itself (you have ECC memory
> for that).
>
> That is, the code assumes the memory containing the checksums will not be
> corrupted, because if it is, you have worse problems (and the CRC error here
> would be a *good* thing, since it would make you notice something is not
> quite right).
>

Which is high between BER of RAM and disk?
It's a just question. :)

>
> [1] http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta
>
> --
> Cesar Eduardo Barros
> cesarb@cesarb.net
> cesar.barros@gmail.com
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm: Swap checksum
  2010-05-23 18:58     ` Cesar Eduardo Barros
@ 2010-05-24  6:41       ` Avi Kivity
  2010-05-24  7:32         ` Nick Piggin
  2010-05-24 11:24         ` Cesar Eduardo Barros
  0 siblings, 2 replies; 25+ messages in thread
From: Avi Kivity @ 2010-05-24  6:41 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel

On 05/23/2010 09:58 PM, Cesar Eduardo Barros wrote:
> Em 23-05-2010 12:19, Avi Kivity escreveu:
>> On 64-bit, we may be able to store the checksum in the pte, if the swap
>> device is small enough.
>
> Which pte? 

All of them.

> Correct me if I am wrong, but I do not think all pages written to the 
> swap have exactly one pte pointing to them. And I have not looked at 
> the shmem.c code yet, but does it even use ptes?

Well, the ptes need the swap address written into them, so they are 
already found and updated somehow.  All that's needed is to update the 
value written to also include the checksum.

> It might be possible (find all ptes and write the 32-bit checksum to 
> them, do something else for shmem, have two different code paths for 
> small/large swapfiles), but I do not know if the memory savings are 
> worth the extra complexity (especially the need for two separate code 
> paths).

Certainly not at first, but later it may be worthwhile.

>
>> If we take the trouble to touch the page, we may as well compare it
>> against zero, and if so drop it instead of swapping it out.
>
> The problem with this is that the page is touched deep inside the 
> crc32c code, which might even be using hardware instructions 
> (crc32c-intel). So we would need to read it two times to compare 
> against zero.

The second read is very cheap since the page is already in cache.  Also, 
we fail early when any word is nonzero, so usually the compare exits 
quickly.

>
> One possibility could be to compare the full page against zero only if 
> its crc is a specific value (the crc32c of a page full of zeros). This 
> would not be too slow (we would be wasting time only when we have a 
> very high probability of saving much more time), and not need to touch 
> the crc32c code at all. I would only have to look at how this messes 
> up the state tracking (i.e. how to make it track the fact that, 
> instead of getting written out, this is now a zeroed page).

Instead of returning a swap pte to be written to the page tables, return 
a zeroed pte.


> Other than that, it seems a good idea.
>


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm: Swap checksum
  2010-05-24  6:41       ` Avi Kivity
@ 2010-05-24  7:32         ` Nick Piggin
  2010-05-24 10:51           ` Avi Kivity
  2010-05-24 11:24         ` Cesar Eduardo Barros
  1 sibling, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2010-05-24  7:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Cesar Eduardo Barros, linux-mm, linux-kernel

On Mon, May 24, 2010 at 09:41:22AM +0300, Avi Kivity wrote:
> On 05/23/2010 09:58 PM, Cesar Eduardo Barros wrote:
> >Em 23-05-2010 12:19, Avi Kivity escreveu:
> >>On 64-bit, we may be able to store the checksum in the pte, if the swap
> >>device is small enough.
> >
> >Which pte?
> 
> All of them.
> 
> >Correct me if I am wrong, but I do not think all pages written to
> >the swap have exactly one pte pointing to them. And I have not
> >looked at the shmem.c code yet, but does it even use ptes?
> 
> Well, the ptes need the swap address written into them, so they are
> already found and updated somehow.  All that's needed is to update
> the value written to also include the checksum.
> 
> >It might be possible (find all ptes and write the 32-bit checksum
> >to them, do something else for shmem, have two different code
> >paths for small/large swapfiles), but I do not know if the memory
> >savings are worth the extra complexity (especially the need for
> >two separate code paths).
> 
> Certainly not at first, but later it may be worthwhile.
> 
> >
> >>If we take the trouble to touch the page, we may as well compare it
> >>against zero, and if so drop it instead of swapping it out.
> >
> >The problem with this is that the page is touched deep inside the
> >crc32c code, which might even be using hardware instructions
> >(crc32c-intel). So we would need to read it two times to compare
> >against zero.
> 
> The second read is very cheap since the page is already in cache.
> Also, we fail early when any word is nonzero, so usually the compare
> exits quickly.

For a page being written back from pagecache to disk, or for a
page being swapped out, the contents are likely cache cold and
likely not to be used in future either. Therefore a crc routine
for that would do well to minimise cache pollution.


> >One possibility could be to compare the full page against zero
> >only if its crc is a specific value (the crc32c of a page full of
> >zeros). This would not be too slow (we would be wasting time only
> >when we have a very high probability of saving much more time),
> >and not need to touch the crc32c code at all. I would only have to
> >look at how this messes up the state tracking (i.e. how to make it
> >track the fact that, instead of getting written out, this is now a
> >zeroed page).
> 
> Instead of returning a swap pte to be written to the page tables,
> return a zeroed pte.

A pte_none pte, to be precise.

I wonder, though. If we no longer trust block devices to give the
correct data back, should we provide a meta block device to do error
detection? No production filesystem on Linux has checksums (well, ext4
has a few). Of the ones that add checksumming, I'd say most will not do
data checksumming (and for direct IO it is not done).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-24  2:05         ` Minchan Kim
@ 2010-05-24 10:50           ` Cesar Eduardo Barros
  2010-05-25 23:52             ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-24 10:50 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, Hugh Dickins

Em 23-05-2010 23:05, Minchan Kim escreveu:
> On Mon, May 24, 2010 at 9:57 AM, Cesar Eduardo Barros<cesarb@cesarb.net>  wrote:
>> The internal ECC of the disk will not save you - a quick Google search found
>> an instance of someone with silent data corruption caused by a faulty *power
>> supply*.[1]
>>
>> And if it is silent corruption, without the checksums you will not notice it
>> - it will just be dismissed as "oh, Firefox just crashed again" or similar
>> (the same as bit flips on RAM without ECC).
>
> When I read your comment, suddenly some thought occurred to me.
> If we can't believe ECC of the disk, why do we separate error
> detection logic between file system and swap disk?
>
> I mean it make sense that put crc detection into block layer?
> It can make sure any block I/O.

There are differences as to where the checksum will be stored.

For the filesystem (and for the software suspend image), you have to 
store the checksums in the disk itself, and the correct place (and 
ordering requirements) depends on the filesystem. Also, most filesystems 
do not currently have on-disk checksums.

For the swap, it is much simpler; the checksums can be stored in memory 
(since they do not matter after a reboot; the swap contents are simply 
discarded). This also gives better performance, since the checksums do 
not have to be separately written, and more flexibility, since the 
kernel can use whatever kind of checksum it wants and can store the 
checksum in whatever data structure it choses, without worrying about 
compatibility.

And, in fact, there is a CRC code in the block layer; it is 
CONFIG_BLK_DEV_INTEGRITY. However, it is not a generic solution; it 
needs some extra prerequisites (like a disk low-level formatted with 
sectors with >512 bytes).

> And what's BER of disk?
> Is it usual to meet the problem?

It is unusual enough that most people who meet it will not notice.

And the filesystem developers (who understand about these things more 
than me) seem to be trending towards adding checksums to their 
filesystems. The point of this patch is to meet the same level of safety 
as btrfs (thus the choice of crc32c, which is what btrfs uses).

> In normal desktop, some app killed are not critical. If the
> application is critical, maybe app have to logic fault handling.
> Firefox has session restore feature and Office program has temporal
> save feature.

In fact, crashing is the "best" outcome here. The worst outcome is your 
application silently corrupting the data you then save to disk.

> On the other hand, in server, does it designed well to use swap disk
> until we meet bit error of disk?

Servers are the ones which would benefit the most, as their RAM is 
usually very reliable (they tend to use ECC memory). Their disk 
subsystems, however, are also more reliable.

Desktop systems (especially "no-name" brand ones) do not gain as much, 
since their RAM is usually unprotected; however, they are also the ones 
which have better chance of having low-quality power supplies, uncommon 
storage media (USB flash drives as the root disk is an example), and 
problematic I/O subsystems.

> My feel is that it seem to be rather overkill.

Yes, it is a bit overkill, except when you are using software suspend. 
While the software suspend image is not protected by this patch (I am 
already thinking of a separate patch to add checksums to it), the 
swapped out pages are (software suspend uses both a memory image saved 
to the swap partition and the normal swapped out pages).

But you do not have to use it if you think it is overkill - I even added 
a kernel parameter to easily disable it.

>> The swap checksum only protects the page against being silently corrupted
>> while on the disk and at least to some degree on the I/O path between the
>> memory and the disk. It does not protect against broken kernel-mode code
>> writing to the wrong address, nor against broken hardware (or hardware
>> misconfigured by broken drivers) doing DMA to wrong addresses. It also does
>> not protect against hardware errors in the RAM itself (you have ECC memory
>> for that).
>>
>> That is, the code assumes the memory containing the checksums will not be
>> corrupted, because if it is, you have worse problems (and the CRC error here
>> would be a *good* thing, since it would make you notice something is not
>> quite right).
>>
>
> Which is high between BER of RAM and disk?
> It's a just question. :)

I have no idea.

However, we can do nothing in software against RAM errors; it would kill 
performance too much. Against disk errors, however, we can do a lot 
(software RAID-1 is just one of the simplest examples).

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm: Swap checksum
  2010-05-24  7:32         ` Nick Piggin
@ 2010-05-24 10:51           ` Avi Kivity
  0 siblings, 0 replies; 25+ messages in thread
From: Avi Kivity @ 2010-05-24 10:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Cesar Eduardo Barros, linux-mm, linux-kernel

On 05/24/2010 10:32 AM, Nick Piggin wrote:
>
> I wonder, though. If we no longer trust block devices to give the
> correct data back, should we provide a meta block device to do error
> detection?

Some block devices do provide space for end-to-end checksums.  For the 
ones that don't, I see no efficient way of adding it (either we turn one 
access into two, or we have a non-power-of-two block size).

> No production filesystem on Linux has checksums (well, ext4
> has a few). Of the ones that add checksumming, I'd say most will not do
> data checksumming (and for direct IO it is not done).
>    

I believe btrfs checksums direct IO.  Unfortunately it has some way to 
go before it can be used in production.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm: Swap checksum
  2010-05-24  6:41       ` Avi Kivity
  2010-05-24  7:32         ` Nick Piggin
@ 2010-05-24 11:24         ` Cesar Eduardo Barros
  1 sibling, 0 replies; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-24 11:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-mm, linux-kernel

Em 24-05-2010 03:41, Avi Kivity escreveu:
> On 05/23/2010 09:58 PM, Cesar Eduardo Barros wrote:
>> One possibility could be to compare the full page against zero only if
>> its crc is a specific value (the crc32c of a page full of zeros). This
>> would not be too slow (we would be wasting time only when we have a
>> very high probability of saving much more time), and not need to touch
>> the crc32c code at all. I would only have to look at how this messes
>> up the state tracking (i.e. how to make it track the fact that,
>> instead of getting written out, this is now a zeroed page).
>
> Instead of returning a swap pte to be written to the page tables, return
> a zeroed pte.

Unfortunately, at this point in the code (swap_writepage) it is not a 
matter of simply returning a zeroed pte; it would have to go looking for 
the page tables.

If I am understanding things correctly, in some situations even looking 
for them via the struct page I have and its rmap stuff is not enough; 
you could have a pte still pointing to the swap instead of pointing to 
the page (which when faulted would go via the swap cache to get the page).

For instance, if the page was shared between several processes, got 
swapped out, swapped in but accessed only by some of the processes, and 
in the process of being swapped out again.

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-24 10:50           ` Cesar Eduardo Barros
@ 2010-05-25 23:52             ` Minchan Kim
  2010-05-26 10:21               ` Cesar Eduardo Barros
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2010-05-25 23:52 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel, Hugh Dickins

Hi, Cesar.

On Mon, May 24, 2010 at 7:50 PM, Cesar Eduardo Barros <cesarb@cesarb.net> wrote:
> Em 23-05-2010 23:05, Minchan Kim escreveu:
>>
>> On Mon, May 24, 2010 at 9:57 AM, Cesar Eduardo Barros<cesarb@cesarb.net>
>>  wrote:
>>>
>>> The internal ECC of the disk will not save you - a quick Google search
>>> found
>>> an instance of someone with silent data corruption caused by a faulty
>>> *power
>>> supply*.[1]
>>>
>>> And if it is silent corruption, without the checksums you will not notice
>>> it
>>> - it will just be dismissed as "oh, Firefox just crashed again" or
>>> similar
>>> (the same as bit flips on RAM without ECC).
>>
>> When I read your comment, suddenly some thought occurred to me.
>> If we can't believe ECC of the disk, why do we separate error
>> detection logic between file system and swap disk?
>>
>> I mean it make sense that put crc detection into block layer?
>> It can make sure any block I/O.
>
> There are differences as to where the checksum will be stored.
>
> For the filesystem (and for the software suspend image), you have to store
> the checksums in the disk itself, and the correct place (and ordering
> requirements) depends on the filesystem. Also, most filesystems do not
> currently have on-disk checksums.
>
> For the swap, it is much simpler; the checksums can be stored in memory
> (since they do not matter after a reboot; the swap contents are simply
> discarded). This also gives better performance, since the checksums do not
> have to be separately written, and more flexibility, since the kernel can
> use whatever kind of checksum it wants and can store the checksum in
> whatever data structure it choses, without worrying about compatibility.
>
> And, in fact, there is a CRC code in the block layer; it is
> CONFIG_BLK_DEV_INTEGRITY. However, it is not a generic solution; it needs
> some extra prerequisites (like a disk low-level formatted with sectors with
>>512 bytes).

Thanks for good information.

You mean BLK_DEV_INTEGRITY has a dependency with block device driver?
If you want to support checksum into suspend, At last, should we put
the checksum on disk?

I mean could we extend BLK_DEV_INTEGRITY by more generic solution?
As you said, in case of swap, we don't need to put checksum on disk.

If swap case, let it put the one on memory. If non-swap case, let it
put checksum on disk,
I am not sure it's possible.

When we have a unreliable disk, your point is that let's solve it with
(btrfs + swap) which both supports checksum. And my point is that
let's solve it with (any file system + swap) which is put on block
layer which supports checksum.

I am not a expert of block layer. so whatever i says might be nonsense.
And any mm guys don't oppose this idea until now(except Nick).
Could we regard it as ack of others?

If others don't oppose the idea, I will be not against, either.

Thanks, Cesar.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-25 23:52             ` Minchan Kim
@ 2010-05-26 10:21               ` Cesar Eduardo Barros
  2010-05-26 15:31                 ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-26 10:21 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, Hugh Dickins

Em 25-05-2010 20:52, Minchan Kim escreveu:
> On Mon, May 24, 2010 at 7:50 PM, Cesar Eduardo Barros<cesarb@cesarb.net>  wrote:
>> And, in fact, there is a CRC code in the block layer; it is
>> CONFIG_BLK_DEV_INTEGRITY. However, it is not a generic solution; it needs
>> some extra prerequisites (like a disk low-level formatted with sectors with
>>> 512 bytes).
>
> You mean BLK_DEV_INTEGRITY has a dependency with block device driver?
> If you want to support checksum into suspend, At last, should we put
> the checksum on disk?
>
> I mean could we extend BLK_DEV_INTEGRITY by more generic solution?
> As you said, in case of swap, we don't need to put checksum on disk.

CONFIG_BLK_DEV_INTEGRITY writes the checksum to the same sector as the 
data. However, for that to be possible, the sector size is increased on 
the disk itself, from 512 bytes to 520 bytes (and not all disks can do 
that). It is not a generic solution. It also, as far as I can see, does 
nothing against the disk simply failing to write and later returning 
stale data, since the stale checksum would match the stale data.

See the LWN article [1] and the presentations [2] for more detail.

For suspend, the swap checksum pages would be saved together with the 
rest of the memory (they are in the memory, after all), and the suspend 
snapshot would have its own separate checksum (written directly to the 
disk after the image).

> If swap case, let it put the one on memory. If non-swap case, let it
> put checksum on disk,
> I am not sure it's possible.
>
> When we have a unreliable disk, your point is that let's solve it with
> (btrfs + swap) which both supports checksum. And my point is that
> let's solve it with (any file system + swap) which is put on block
> layer which supports checksum.

A generic "checksumming block device" would be less efficient.

For the swap case, it cannot exploit the fact that its state tracking is 
within the swapfile code. Avi Kivity's idea of storing the checksum in 
otherwise wasted bits of the pte is an example of how this could be 
exploited in the future. In fact, the reason I did it on the swap layer 
(instead of interposing something in the block layer) was precisely to 
make it easier to enhance the state tracking in the future (and also 
because it felt the most natural layer to do it).

It would also complicate adding checksums to the software suspend 
snapshot. While normally you do not want to write the swap checksums to 
the disk, you do want to write them when saving the memory snapshot - 
which is written to the same block device. However, the checksums for 
the rest of the swap pages are already being saved as part of the memory 
snapshot (since the checksums were in the memory).

For the generic ("any file system") case, it is worse, since you 
actually have to write the checksum to the disk, and unlike in the 
software suspend case you cannot simply write them all in one pass at 
the end. In the worst case, you would have to write twice for each 
sector/page - once for the data, and once for the checksums 
(CONFIG_BLK_DEV_INTEGRITY completely avoids this issue since with it the 
checksum is together with the data in the same sector). Not to mention 
fun things like write amplification.

A filesystem with data checksums can write the checksum as part of its 
normal metadata updates (which it already has to do anyway).

A generic "checksumming block device" could be a way of "updating" a 
filesystem without checksums (or with only metadata checksums) to have 
them. However, I believe it would be more productive to add them 
directly to the filesystem itself. Even more since the only way I can 
see of doing it efficiently in a generic block layer is by using lots of 
filesystem-style tricks (things like a log-structured list of CRC 
values, dividing the device in "block groups" to keep the checksum close 
to the data, and so on).

[1] Block layer: integrity checking and lots of partitions
     http://lwn.net/Articles/290141/
[2] http://oss.oracle.com/projects/data-integrity/documentation/

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-26 10:21               ` Cesar Eduardo Barros
@ 2010-05-26 15:31                 ` Minchan Kim
  2010-05-26 21:28                   ` Valdis.Kletnieks
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2010-05-26 15:31 UTC (permalink / raw)
  To: Cesar Eduardo Barros; +Cc: linux-mm, linux-kernel, Hugh Dickins

On Wed, May 26, 2010 at 07:21:57AM -0300, Cesar Eduardo Barros wrote:
> Em 25-05-2010 20:52, Minchan Kim escreveu:
> >On Mon, May 24, 2010 at 7:50 PM, Cesar Eduardo Barros<cesarb@cesarb.net>  wrote:
> >>And, in fact, there is a CRC code in the block layer; it is
> >>CONFIG_BLK_DEV_INTEGRITY. However, it is not a generic solution; it needs
> >>some extra prerequisites (like a disk low-level formatted with sectors with
> >>>512 bytes).
> >
> >You mean BLK_DEV_INTEGRITY has a dependency with block device driver?
> >If you want to support checksum into suspend, At last, should we put
> >the checksum on disk?
> >
> >I mean could we extend BLK_DEV_INTEGRITY by more generic solution?
> >As you said, in case of swap, we don't need to put checksum on disk.
> 
> CONFIG_BLK_DEV_INTEGRITY writes the checksum to the same sector as
> the data. However, for that to be possible, the sector size is
> increased on the disk itself, from 512 bytes to 520 bytes (and not
> all disks can do that). It is not a generic solution. It also, as

It means if disk don't support 520 byte sector, CONFIG_BLK_DEV_INTEGRITY
can't work? That means CONFIG_BLK_DEV_INTEGRITY depends on block device?

> far as I can see, does nothing against the disk simply failing to
> write and later returning stale data, since the stale checksum would
> match the stale data.

Sorry. I can't understand your point. 
Who makes stale data? If any layer makes data as stale, integrity is up to 
the layer. Maybe I am missing your point. 
Could you explain more detail?

> 
> See the LWN article [1] and the presentations [2] for more detail.

Thanks for good information. 

> 
> For suspend, the swap checksum pages would be saved together with
> the rest of the memory (they are in the memory, after all), and the
> suspend snapshot would have its own separate checksum (written
> directly to the disk after the image).
> 
> >If swap case, let it put the one on memory. If non-swap case, let it
> >put checksum on disk,
> >I am not sure it's possible.
> >
> >When we have a unreliable disk, your point is that let's solve it with
> >(btrfs + swap) which both supports checksum. And my point is that
> >let's solve it with (any file system + swap) which is put on block
> >layer which supports checksum.
> 
> A generic "checksumming block device" would be less efficient.
> 
> For the swap case, it cannot exploit the fact that its state
> tracking is within the swapfile code. Avi Kivity's idea of storing
> the checksum in otherwise wasted bits of the pte is an example of
> how this could be exploited in the future. In fact, the reason I did
> it on the swap layer (instead of interposing something in the block
> layer) was precisely to make it easier to enhance the state tracking
> in the future (and also because it felt the most natural layer to do
> it).

Hmm. I don't know what is the state you mentioned in future. 
But As view of design, I tend to agree. 

> 
> It would also complicate adding checksums to the software suspend
> snapshot. While normally you do not want to write the swap checksums
> to the disk, you do want to write them when saving the memory
> snapshot - which is written to the same block device. However, the
> checksums for the rest of the swap pages are already being saved as
> part of the memory snapshot (since the checksums were in the
> memory).
> 
> For the generic ("any file system") case, it is worse, since you
> actually have to write the checksum to the disk, and unlike in the
> software suspend case you cannot simply write them all in one pass
> at the end. In the worst case, you would have to write twice for
> each sector/page - once for the data, and once for the checksums
> (CONFIG_BLK_DEV_INTEGRITY completely avoids this issue since with it
> the checksum is together with the data in the same sector). Not to
> mention fun things like write amplification.
> 
> A filesystem with data checksums can write the checksum as part of
> its normal metadata updates (which it already has to do anyway).
> 
> A generic "checksumming block device" could be a way of "updating" a
> filesystem without checksums (or with only metadata checksums) to
> have them. However, I believe it would be more productive to add
> them directly to the filesystem itself. Even more since the only way
> I can see of doing it efficiently in a generic block layer is by
> using lots of filesystem-style tricks (things like a log-structured
> list of CRC values, dividing the device in "block groups" to keep
> the checksum close to the data, and so on).
> 
> 
> [1] Block layer: integrity checking and lots of partitions
>     http://lwn.net/Articles/290141/
> [2] http://oss.oracle.com/projects/data-integrity/documentation/

Thanks for good explanation. 

I agree we don't have any method to detect disk error about swap pages.
I am not sure we _really_ need it and who want it in practice(now even 
many of file systems don't support checksum) but it's optional feature. 
so if there is anyone want it, he just use it by enable. 

Yes. I am not against this patch any more. 
I hope when you send this patch, please, write down things discussed with
me in description. 

1. Why do we need it?(ie, who can use it useful?)
2. Why is CONFIG_BLK_DEV_INTEGRITY's extension bad design?

And

3. Please, Cc Jens Axboe <jens.axboe@oracle.com>, Hugh Dickins <hughd@google.com>

Thanks for good reply on my long bore question. 

> 
> -- 
> Cesar Eduardo Barros
> cesarb@cesarb.net
> cesar.barros@gmail.com
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-26 15:31                 ` Minchan Kim
@ 2010-05-26 21:28                   ` Valdis.Kletnieks
  2010-05-26 22:45                     ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Valdis.Kletnieks @ 2010-05-26 21:28 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Cesar Eduardo Barros, linux-mm, linux-kernel, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]

On Thu, 27 May 2010 00:31:44 +0900, Minchan Kim said:
> On Wed, May 26, 2010 at 07:21:57AM -0300, Cesar Eduardo Barros wrote:
> > far as I can see, does nothing against the disk simply failing to
> > write and later returning stale data, since the stale checksum would
> > match the stale data.
> 
> Sorry. I can't understand your point. 
> Who makes stale data? If any layer makes data as stale, integrity is up to 
> the layer. Maybe I am missing your point. 
> Could you explain more detail?

I'm pretty sure that what Cesar meant was that the following could happen:

1) Write block 11983 on the disk, checksum 34FE9B72.
(... time passes.. maybe weeks)
2) Attempt to write block 11983 on disk with checksum AE9F3581. The write fails
due to a power failure or something.
(... more time passes...)
3) Read block 11983, get back data with checksum 34FE9B72. Checksum matches,
and there's no indication that the write in (2) ever failed. The program
proceeds thinking it's just read back the most recently written data, when in
fact it's just read an older version of that block. Problems can ensue if the
data just read is now out of sync with *other* blocks of data - instant data
corruption.

To be fair, we currently have the "read a stale block" problem after crashes
already.  The issue is that BLK_DEV_INTEGRITY can't provide a solution here,
but most users will form a mental image that it *is* in fact giving them
that guarantee.  The resulting mismatch between reality and expectations
cannot end well.

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-26 21:28                   ` Valdis.Kletnieks
@ 2010-05-26 22:45                     ` Minchan Kim
  2010-05-26 23:19                       ` Cesar Eduardo Barros
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2010-05-26 22:45 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Cesar Eduardo Barros, linux-mm, linux-kernel, Hugh Dickins

On Thu, May 27, 2010 at 6:28 AM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Thu, 27 May 2010 00:31:44 +0900, Minchan Kim said:
>> On Wed, May 26, 2010 at 07:21:57AM -0300, Cesar Eduardo Barros wrote:
>> > far as I can see, does nothing against the disk simply failing to
>> > write and later returning stale data, since the stale checksum would
>> > match the stale data.
>>
>> Sorry. I can't understand your point.
>> Who makes stale data? If any layer makes data as stale, integrity is up to
>> the layer. Maybe I am missing your point.
>> Could you explain more detail?
>
> I'm pretty sure that what Cesar meant was that the following could happen:
>
> 1) Write block 11983 on the disk, checksum 34FE9B72.
> (... time passes.. maybe weeks)
> 2) Attempt to write block 11983 on disk with checksum AE9F3581. The write fails
> due to a power failure or something.
> (... more time passes...)
> 3) Read block 11983, get back data with checksum 34FE9B72. Checksum matches,
> and there's no indication that the write in (2) ever failed. The program
> proceeds thinking it's just read back the most recently written data, when in
> fact it's just read an older version of that block. Problems can ensue if the
> data just read is now out of sync with *other* blocks of data - instant data
> corruption.

Oh, doesn't normal disk support atomicity of sector write?
I have been thought disk must support atomicity of sector write at least.

AFAIK, other device(ex, nand device by FTL) supports atomicity of sector write.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-26 22:45                     ` Minchan Kim
@ 2010-05-26 23:19                       ` Cesar Eduardo Barros
  2010-05-26 23:27                         ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Cesar Eduardo Barros @ 2010-05-26 23:19 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Valdis.Kletnieks, linux-mm, linux-kernel, Hugh Dickins

Em 26-05-2010 19:45, Minchan Kim escreveu:
> On Thu, May 27, 2010 at 6:28 AM,<Valdis.Kletnieks@vt.edu>  wrote:
>> On Thu, 27 May 2010 00:31:44 +0900, Minchan Kim said:
>>> On Wed, May 26, 2010 at 07:21:57AM -0300, Cesar Eduardo Barros wrote:
>>>> far as I can see, does nothing against the disk simply failing to
>>>> write and later returning stale data, since the stale checksum would
>>>> match the stale data.
>>>
>>> Sorry. I can't understand your point.
>>> Who makes stale data? If any layer makes data as stale, integrity is up to
>>> the layer. Maybe I am missing your point.
>>> Could you explain more detail?
>>
>> I'm pretty sure that what Cesar meant was that the following could happen:
>>
>> 1) Write block 11983 on the disk, checksum 34FE9B72.
>> (... time passes.. maybe weeks)
>> 2) Attempt to write block 11983 on disk with checksum AE9F3581. The write fails
>> due to a power failure or something.
>> (... more time passes...)
>> 3) Read block 11983, get back data with checksum 34FE9B72. Checksum matches,
>> and there's no indication that the write in (2) ever failed. The program
>> proceeds thinking it's just read back the most recently written data, when in
>> fact it's just read an older version of that block. Problems can ensue if the
>> data just read is now out of sync with *other* blocks of data - instant data
>> corruption.
>
> Oh, doesn't normal disk support atomicity of sector write?
> I have been thought disk must support atomicity of sector write at least.

It is called a "high fly write" (a write where the disk head was flying 
too high and the data did not get written at all). There are other 
causes than high fly writes for this, of course, but the symptom is the 
same: whatever you were trying to write was not written at all, and the 
old contents are still there.

The write is still atomic: it either did happen completely or did not 
happen at all (in this case, it did not happen at all). You *can* have a 
partial write (with a well-timed power loss, for instance), but the 
disk's own ECC will detect this as a corrupted sector and return an 
error on read.

-- 
Cesar Eduardo Barros
cesarb@cesarb.net
cesar.barros@gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] mm: Swap checksum
  2010-05-26 23:19                       ` Cesar Eduardo Barros
@ 2010-05-26 23:27                         ` Minchan Kim
  0 siblings, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2010-05-26 23:27 UTC (permalink / raw)
  To: Cesar Eduardo Barros
  Cc: Valdis.Kletnieks, linux-mm, linux-kernel, Hugh Dickins

On Thu, May 27, 2010 at 8:19 AM, Cesar Eduardo Barros <cesarb@cesarb.net> wrote:
> Em 26-05-2010 19:45, Minchan Kim escreveu:
>>
>> On Thu, May 27, 2010 at 6:28 AM,<Valdis.Kletnieks@vt.edu>  wrote:
>>>
>>> On Thu, 27 May 2010 00:31:44 +0900, Minchan Kim said:
>>>>
>>>> On Wed, May 26, 2010 at 07:21:57AM -0300, Cesar Eduardo Barros wrote:
>>>>>
>>>>> far as I can see, does nothing against the disk simply failing to
>>>>> write and later returning stale data, since the stale checksum would
>>>>> match the stale data.
>>>>
>>>> Sorry. I can't understand your point.
>>>> Who makes stale data? If any layer makes data as stale, integrity is up
>>>> to
>>>> the layer. Maybe I am missing your point.
>>>> Could you explain more detail?
>>>
>>> I'm pretty sure that what Cesar meant was that the following could
>>> happen:
>>>
>>> 1) Write block 11983 on the disk, checksum 34FE9B72.
>>> (... time passes.. maybe weeks)
>>> 2) Attempt to write block 11983 on disk with checksum AE9F3581. The write
>>> fails
>>> due to a power failure or something.
>>> (... more time passes...)
>>> 3) Read block 11983, get back data with checksum 34FE9B72. Checksum
>>> matches,
>>> and there's no indication that the write in (2) ever failed. The program
>>> proceeds thinking it's just read back the most recently written data,
>>> when in
>>> fact it's just read an older version of that block. Problems can ensue if
>>> the
>>> data just read is now out of sync with *other* blocks of data - instant
>>> data
>>> corruption.
>>
>> Oh, doesn't normal disk support atomicity of sector write?
>> I have been thought disk must support atomicity of sector write at least.
>
> It is called a "high fly write" (a write where the disk head was flying too
> high and the data did not get written at all). There are other causes than
> high fly writes for this, of course, but the symptom is the same: whatever
> you were trying to write was not written at all, and the old contents are
> still there.

It means that disk return _success_ even though data isn't written at
all on disk?

>
> The write is still atomic: it either did happen completely or did not happen
> at all (in this case, it did not happen at all). You *can* have a partial
> write (with a well-timed power loss, for instance), but the disk's own ECC
> will detect this as a corrupted sector and return an error on read.

Yes. still disk support atomicity as I expect.
Thanks.

>
> --
> Cesar Eduardo Barros
> cesarb@cesarb.net
> cesar.barros@gmail.com
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-05-26 23:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-22 18:08 [PATCH 0/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 1/3] mm/swapfile.c: better messages for swap_info_get Cesar Eduardo Barros
2010-05-22 18:13   ` Borislav Petkov
2010-05-22 18:18     ` Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 2/3] kernel/power/swap.c: do not use end_swap_bio_read Cesar Eduardo Barros
2010-05-22 18:08 ` [PATCH 3/3] mm: Swap checksum Cesar Eduardo Barros
2010-05-23 15:19   ` Avi Kivity
2010-05-23 18:58     ` Cesar Eduardo Barros
2010-05-24  6:41       ` Avi Kivity
2010-05-24  7:32         ` Nick Piggin
2010-05-24 10:51           ` Avi Kivity
2010-05-24 11:24         ` Cesar Eduardo Barros
2010-05-23 14:03 ` [PATCH 0/3] " Minchan Kim
2010-05-23 18:32   ` Cesar Eduardo Barros
2010-05-24  0:09     ` Minchan Kim
2010-05-24  0:57       ` Cesar Eduardo Barros
2010-05-24  2:05         ` Minchan Kim
2010-05-24 10:50           ` Cesar Eduardo Barros
2010-05-25 23:52             ` Minchan Kim
2010-05-26 10:21               ` Cesar Eduardo Barros
2010-05-26 15:31                 ` Minchan Kim
2010-05-26 21:28                   ` Valdis.Kletnieks
2010-05-26 22:45                     ` Minchan Kim
2010-05-26 23:19                       ` Cesar Eduardo Barros
2010-05-26 23:27                         ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).