Linux 2.6.27.27

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Linux 2.6.27.27
@ 2009-07-20  4:06 Greg KH
  2009-07-20  4:07 ` Greg KH
  2009-07-20 11:51 ` Krzysztof Oledzki
  0 siblings, 2 replies; 43+ messages in thread
From: Greg KH @ 2009-07-20  4:06 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, torvalds, stable; +Cc: lwn

I'm announcing the release of the 2.6.27.26 kernel.  All users of the
2.6.27 kernel series are very strongly encouraged to upgrade.

I'll also be replying to this message with a copy of the patch between
2.6.27.26 and 2.6.27.27

The updated 2.6.27.y git tree can be found at:
        git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
and can be browsed at the normal kernel.org git web browser:
        http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary

thanks,

greg k-h

------------

 Makefile                      |    7 ++-
 drivers/block/floppy.c        |    5 ++
 drivers/md/dm.c               |    4 --
 drivers/net/tulip/interrupt.c |   84 +++++++++++++++++++++++++++---------------
 drivers/net/tulip/tulip.h     |   32 +++++++++++++++-
 drivers/pci/iova.c            |   26 +++++++++++--
 include/linux/mm.h            |    2 -
 include/linux/personality.h   |    5 ++
 include/linux/security.h      |    2 +
 kernel/resource.c             |    2 -
 kernel/sysctl.c               |    2 -
 mm/Kconfig                    |   18 +++++++++
 mm/mmap.c                     |    3 +
 security/Kconfig              |   22 -----------
 security/security.c           |    3 -
 15 files changed, 145 insertions(+), 72 deletions(-)

Christoph Lameter (1):
      security: use mmap_min_addr indepedently of security models

David Woodhouse (1):
      Fix iommu address space allocation

Eugene Teo (1):
      Add '-fno-delete-null-pointer-checks' to gcc CFLAGS

Greg Kroah-Hartman (2):
      Revert "dm: sysfs skip output when device is being destroyed"
      Linux 2.6.27.27

Jiri Slaby (1):
      floppy: fix lock imbalance

Julien Tinnes (1):
      personality: fix PER_CLEAR_ON_SETID (CVE-2009-1895)

Linus Torvalds (1):
      Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x

Tomasz Lemiech (1):
      tulip: Fix for MTU problems with 802.1q tagged frames

Zhang Rui (1):
      kernel/resource.c: fix sign extension in reserve_setup()


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20  4:06 Linux 2.6.27.27 Greg KH
@ 2009-07-20  4:07 ` Greg KH
  2009-07-20 11:51 ` Krzysztof Oledzki
  1 sibling, 0 replies; 43+ messages in thread
From: Greg KH @ 2009-07-20  4:07 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, torvalds, stable; +Cc: lwn

diff --git a/Makefile b/Makefile
index 90764ee..387a5fd 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 27
-EXTRAVERSION = .26
+EXTRAVERSION = .27
 NAME = Trembling Tortoise
 
 # *DOCUMENTATION*
@@ -340,7 +340,8 @@ KBUILD_CPPFLAGS := -D__KERNEL__ $(LINUXINCLUDE)
 
 KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
 		   -fno-strict-aliasing -fno-common \
-		   -Werror-implicit-function-declaration
+		   -Werror-implicit-function-declaration \
+		   -fno-delete-null-pointer-checks
 KBUILD_AFLAGS   := -D__ASSEMBLY__
 
 # Read KERNELRELEASE from include/config/kernel.release (if it exists)
@@ -556,7 +557,7 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
 KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)
 
 # disable invalid "can't wrap" optimzations for signed / pointers
-KBUILD_CFLAGS	+= $(call cc-option,-fwrapv)
+KBUILD_CFLAGS	+= $(call cc-option,-fno-strict-overflow)
 
 # Add user supplied CPPFLAGS, AFLAGS and CFLAGS as the last assignments
 # But warn user when we do so
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 615fcd3..5900f76 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3320,7 +3320,10 @@ static inline int set_geometry(unsigned int cmd, struct floppy_struct *g,
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
 		mutex_lock(&open_lock);
-		LOCK_FDC(drive, 1);
+		if (lock_fdc(drive, 1)) {
+			mutex_unlock(&open_lock);
+			return -EINTR;
+		}
 		floppy_type[type] = *g;
 		floppy_type[type].name = "user format";
 		for (cnt = type << 2; cnt < (type << 2) + 4; cnt++)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 925efaf..ace998c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -265,10 +265,6 @@ static int dm_blk_open(struct inode *inode, struct file *file)
 		goto out;
 	}
 
-	if (test_bit(DMF_FREEING, &md->flags) ||
-	    test_bit(DMF_DELETING, &md->flags))
-		return NULL;
-
 	dm_get(md);
 	atomic_inc(&md->open_count);
 
diff --git a/drivers/net/tulip/interrupt.c b/drivers/net/tulip/interrupt.c
index c6bad98..7faf84f 100644
--- a/drivers/net/tulip/interrupt.c
+++ b/drivers/net/tulip/interrupt.c
@@ -140,6 +140,7 @@ int tulip_poll(struct napi_struct *napi, int budget)
                /* If we own the next entry, it is a new packet. Send it up. */
                while ( ! (tp->rx_ring[entry].status & cpu_to_le32(DescOwned))) {
                        s32 status = le32_to_cpu(tp->rx_ring[entry].status);
+		       short pkt_len;
 
                        if (tp->dirty_rx + RX_RING_SIZE == tp->cur_rx)
                                break;
@@ -151,8 +152,28 @@ int tulip_poll(struct napi_struct *napi, int budget)
 		       if (++work_done >= budget)
                                goto not_done;
 
-                       if ((status & 0x38008300) != 0x0300) {
-                               if ((status & 0x38000300) != 0x0300) {
+		       /*
+			* Omit the four octet CRC from the length.
+			* (May not be considered valid until we have
+			* checked status for RxLengthOver2047 bits)
+			*/
+		       pkt_len = ((status >> 16) & 0x7ff) - 4;
+
+		       /*
+			* Maximum pkt_len is 1518 (1514 + vlan header)
+			* Anything higher than this is always invalid
+			* regardless of RxLengthOver2047 bits
+			*/
+
+		       if ((status & (RxLengthOver2047 |
+				      RxDescCRCError |
+				      RxDescCollisionSeen |
+				      RxDescRunt |
+				      RxDescDescErr |
+				      RxWholePkt)) != RxWholePkt
+			   || pkt_len > 1518) {
+			       if ((status & (RxLengthOver2047 |
+					      RxWholePkt)) != RxWholePkt) {
                                 /* Ingore earlier buffers. */
                                        if ((status & 0xffff) != 0x7fff) {
                                                if (tulip_debug > 1)
@@ -161,30 +182,23 @@ int tulip_poll(struct napi_struct *napi, int budget)
                                                               dev->name, status);
                                                tp->stats.rx_length_errors++;
                                        }
-                               } else if (status & RxDescFatalErr) {
+			       } else {
                                 /* There was a fatal error. */
                                        if (tulip_debug > 2)
                                                printk(KERN_DEBUG "%s: Receive error, Rx status %8.8x.\n",
                                                       dev->name, status);
                                        tp->stats.rx_errors++; /* end of a packet.*/
-                                       if (status & 0x0890) tp->stats.rx_length_errors++;
+				       if (pkt_len > 1518 ||
+					   (status & RxDescRunt))
+					       tp->stats.rx_length_errors++;
+
                                        if (status & 0x0004) tp->stats.rx_frame_errors++;
                                        if (status & 0x0002) tp->stats.rx_crc_errors++;
                                        if (status & 0x0001) tp->stats.rx_fifo_errors++;
                                }
                        } else {
-                               /* Omit the four octet CRC from the length. */
-                               short pkt_len = ((status >> 16) & 0x7ff) - 4;
                                struct sk_buff *skb;
 
-#ifndef final_version
-                               if (pkt_len > 1518) {
-                                       printk(KERN_WARNING "%s: Bogus packet size of %d (%#x).\n",
-                                              dev->name, pkt_len, pkt_len);
-                                       pkt_len = 1518;
-                                       tp->stats.rx_length_errors++;
-                               }
-#endif
                                /* Check if the packet is long enough to accept without copying
                                   to a minimally-sized skbuff. */
                                if (pkt_len < tulip_rx_copybreak
@@ -357,14 +371,35 @@ static int tulip_rx(struct net_device *dev)
 	/* If we own the next entry, it is a new packet. Send it up. */
 	while ( ! (tp->rx_ring[entry].status & cpu_to_le32(DescOwned))) {
 		s32 status = le32_to_cpu(tp->rx_ring[entry].status);
+		short pkt_len;
 
 		if (tulip_debug > 5)
 			printk(KERN_DEBUG "%s: In tulip_rx(), entry %d %8.8x.\n",
 				   dev->name, entry, status);
 		if (--rx_work_limit < 0)
 			break;
-		if ((status & 0x38008300) != 0x0300) {
-			if ((status & 0x38000300) != 0x0300) {
+
+		/*
+		  Omit the four octet CRC from the length.
+		  (May not be considered valid until we have
+		  checked status for RxLengthOver2047 bits)
+		*/
+		pkt_len = ((status >> 16) & 0x7ff) - 4;
+		/*
+		  Maximum pkt_len is 1518 (1514 + vlan header)
+		  Anything higher than this is always invalid
+		  regardless of RxLengthOver2047 bits
+		*/
+
+		if ((status & (RxLengthOver2047 |
+			       RxDescCRCError |
+			       RxDescCollisionSeen |
+			       RxDescRunt |
+			       RxDescDescErr |
+			       RxWholePkt))        != RxWholePkt
+		     || pkt_len > 1518) {
+			if ((status & (RxLengthOver2047 |
+			     RxWholePkt))         != RxWholePkt) {
 				/* Ingore earlier buffers. */
 				if ((status & 0xffff) != 0x7fff) {
 					if (tulip_debug > 1)
@@ -373,31 +408,22 @@ static int tulip_rx(struct net_device *dev)
 							   dev->name, status);
 					tp->stats.rx_length_errors++;
 				}
-			} else if (status & RxDescFatalErr) {
+			} else {
 				/* There was a fatal error. */
 				if (tulip_debug > 2)
 					printk(KERN_DEBUG "%s: Receive error, Rx status %8.8x.\n",
 						   dev->name, status);
 				tp->stats.rx_errors++; /* end of a packet.*/
-				if (status & 0x0890) tp->stats.rx_length_errors++;
+				if (pkt_len > 1518 ||
+				    (status & RxDescRunt))
+					tp->stats.rx_length_errors++;
 				if (status & 0x0004) tp->stats.rx_frame_errors++;
 				if (status & 0x0002) tp->stats.rx_crc_errors++;
 				if (status & 0x0001) tp->stats.rx_fifo_errors++;
 			}
 		} else {
-			/* Omit the four octet CRC from the length. */
-			short pkt_len = ((status >> 16) & 0x7ff) - 4;
 			struct sk_buff *skb;
 
-#ifndef final_version
-			if (pkt_len > 1518) {
-				printk(KERN_WARNING "%s: Bogus packet size of %d (%#x).\n",
-					   dev->name, pkt_len, pkt_len);
-				pkt_len = 1518;
-				tp->stats.rx_length_errors++;
-			}
-#endif
-
 			/* Check if the packet is long enough to accept without copying
 			   to a minimally-sized skbuff. */
 			if (pkt_len < tulip_rx_copybreak
diff --git a/drivers/net/tulip/tulip.h b/drivers/net/tulip/tulip.h
index 19abbc3..0afa2d4 100644
--- a/drivers/net/tulip/tulip.h
+++ b/drivers/net/tulip/tulip.h
@@ -201,8 +201,38 @@ enum desc_status_bits {
 	DescStartPkt = 0x20000000,
 	DescEndRing  = 0x02000000,
 	DescUseLink  = 0x01000000,
-	RxDescFatalErr = 0x008000,
+
+	/*
+	 * Error summary flag is logical or of 'CRC Error', 'Collision Seen',
+	 * 'Frame Too Long', 'Runt' and 'Descriptor Error' flags generated
+	 * within tulip chip.
+	 */
+	RxDescErrorSummary = 0x8000,
+	RxDescCRCError = 0x0002,
+	RxDescCollisionSeen = 0x0040,
+
+	/*
+	 * 'Frame Too Long' flag is set if packet length including CRC exceeds
+	 * 1518.  However, a full sized VLAN tagged frame is 1522 bytes
+	 * including CRC.
+	 *
+	 * The tulip chip does not block oversized frames, and if this flag is
+	 * set on a receive descriptor it does not indicate the frame has been
+	 * truncated.  The receive descriptor also includes the actual length.
+	 * Therefore we can safety ignore this flag and check the length
+	 * ourselves.
+	 */
+	RxDescFrameTooLong = 0x0080,
+	RxDescRunt = 0x0800,
+	RxDescDescErr = 0x4000,
 	RxWholePkt   = 0x00000300,
+	/*
+	 * Top three bits of 14 bit frame length (status bits 27-29) should
+	 * never be set as that would make frame over 2047 bytes. The Receive
+	 * Watchdog flag (bit 4) may indicate the length is over 2048 and the
+	 * length field is invalid.
+	 */
+	RxLengthOver2047 = 0x38000010
 };
 
 
diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 3ef4ac0..078bf8b 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -1,9 +1,19 @@
 /*
- * Copyright (c) 2006, Intel Corporation.
+ * Copyright © 2006-2009, Intel Corporation.
  *
- * This file is released under the GPLv2.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
  *
- * Copyright (C) 2006-2008 Intel Corporation
  * Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
  */
 
@@ -123,7 +133,15 @@ move_left:
 	/* Insert the new_iova into domain rbtree by holding writer lock */
 	/* Add new node and rebalance tree. */
 	{
-		struct rb_node **entry = &((prev)), *parent = NULL;
+		struct rb_node **entry, *parent = NULL;
+
+		/* If we have 'prev', it's a valid place to start the
+		   insertion. Otherwise, start from the root. */
+		if (prev)
+			entry = &prev;
+		else
+			entry = &iovad->rbroot.rb_node;
+
 		/* Figure out where to put new node */
 		while (*entry) {
 			struct iova *this = container_of(*entry,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ae9775d..eeb7e56 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -572,12 +572,10 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
  */
 static inline unsigned long round_hint_to_min(unsigned long hint)
 {
-#ifdef CONFIG_SECURITY
 	hint &= PAGE_MASK;
 	if (((void *)hint != NULL) &&
 	    (hint < mmap_min_addr))
 		return PAGE_ALIGN(mmap_min_addr);
-#endif
 	return hint;
 }
 
diff --git a/include/linux/personality.h b/include/linux/personality.h
index a84e9ff..1261208 100644
--- a/include/linux/personality.h
+++ b/include/linux/personality.h
@@ -40,7 +40,10 @@ enum {
  * Security-relevant compatibility flags that must be
  * cleared upon setuid or setgid exec:
  */
-#define PER_CLEAR_ON_SETID (READ_IMPLIES_EXEC|ADDR_NO_RANDOMIZE)
+#define PER_CLEAR_ON_SETID (READ_IMPLIES_EXEC  | \
+			    ADDR_NO_RANDOMIZE  | \
+			    ADDR_COMPAT_LAYOUT | \
+			    MMAP_PAGE_ZERO)
 
 /*
  * Personality types.
diff --git a/include/linux/security.h b/include/linux/security.h
index 80c4d00..1638afd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2134,6 +2134,8 @@ static inline int security_file_mmap(struct file *file, unsigned long reqprot,
 				     unsigned long addr,
 				     unsigned long addr_only)
 {
+	if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+		return -EACCES;
 	return 0;
 }
 
diff --git a/kernel/resource.c b/kernel/resource.c
index 03d796c..87f675a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -741,7 +741,7 @@ static int __init reserve_setup(char *str)
 	static struct resource reserve[MAXRESERVE];
 
 	for (;;) {
-		int io_start, io_num;
+		unsigned int io_start, io_num;
 		int x = reserved;
 
 		if (get_option (&str, &io_start) != 2)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6816e6d..1228d65 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1132,7 +1132,6 @@ static struct ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
-#ifdef CONFIG_SECURITY
 	{
 		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "mmap_min_addr",
@@ -1141,7 +1140,6 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_doulongvec_minmax,
 	},
-#endif
 #ifdef CONFIG_NUMA
 	{
 		.ctl_name	= CTL_UNNUMBERED,
diff --git a/mm/Kconfig b/mm/Kconfig
index 0bd9c2d..07b4ec4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -208,3 +208,21 @@ config VIRT_TO_BUS
 
 config MMU_NOTIFIER
 	bool
+
+config DEFAULT_MMAP_MIN_ADDR
+        int "Low address space to protect from user allocation"
+        default 4096
+        help
+	  This is the portion of low virtual memory which should be protected
+	  from userspace allocation.  Keeping a user from writing to low pages
+	  can help reduce the impact of kernel NULL pointer bugs.
+
+	  For most ia64, ppc64 and x86 users with lots of address space
+	  a value of 65536 is reasonable and should cause no problems.
+	  On arm and other archs it should not be higher than 32768.
+	  Programs which use vm86 functionality would either need additional
+	  permissions from either the LSM or the capabilities module or have
+	  this protection disabled.
+
+	  This value can be changed after boot using the
+	  /proc/sys/vm/mmap_min_addr tunable.
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ae093e..d330758 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -86,6 +86,9 @@ int sysctl_overcommit_ratio = 50;	/* default is 50% */
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0);
 
+/* amount of vm to protect from userspace access */
+unsigned long mmap_min_addr = CONFIG_DEFAULT_MMAP_MIN_ADDR;
+
 /*
  * Check that a process has enough memory to allocate a new virtual
  * mapping. 0 means there is enough memory for the allocation to
diff --git a/security/Kconfig b/security/Kconfig
index 5592939..38411dd 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -92,28 +92,8 @@ config SECURITY_ROOTPLUG
 
 	  See <http://www.linuxjournal.com/article.php?sid=6279> for
 	  more information about this module.
-	  
-	  If you are unsure how to answer this question, answer N.
-
-config SECURITY_DEFAULT_MMAP_MIN_ADDR
-        int "Low address space to protect from user allocation"
-        depends on SECURITY
-        default 0
-        help
-	  This is the portion of low virtual memory which should be protected
-	  from userspace allocation.  Keeping a user from writing to low pages
-	  can help reduce the impact of kernel NULL pointer bugs.
-
-	  For most ia64, ppc64 and x86 users with lots of address space
-	  a value of 65536 is reasonable and should cause no problems.
-	  On arm and other archs it should not be higher than 32768.
-	  Programs which use vm86 functionality would either need additional
-	  permissions from either the LSM or the capabilities module or have
-	  this protection disabled.
-
-	  This value can be changed after boot using the
-	  /proc/sys/vm/mmap_min_addr tunable.
 
+	  If you are unsure how to answer this question, answer N.
 
 source security/selinux/Kconfig
 source security/smack/Kconfig
diff --git a/security/security.c b/security/security.c
index 3a4b4f5..27a315d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -26,9 +26,6 @@ extern void security_fixup_ops(struct security_operations *ops);
 
 struct security_operations *security_ops;	/* Initialized to NULL */
 
-/* amount of vm to protect from userspace access */
-unsigned long mmap_min_addr = CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR;
-
 static inline int verify(struct security_operations *ops)
 {
 	/* verify the security_operations structure exists */

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20  4:06 Linux 2.6.27.27 Greg KH
  2009-07-20  4:07 ` Greg KH
@ 2009-07-20 11:51 ` Krzysztof Oledzki
  2009-07-20 15:10   ` Greg KH
  1 sibling, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-20 11:51 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, Andrew Morton, torvalds, stable, lwn

On 2009-07-20 06:06, Greg KH wrote:
> I'm announcing the release of the 2.6.27.26 kernel.  All users of the
> 2.6.27 kernel series are very strongly encouraged to upgrade.
> 
> I'll also be replying to this message with a copy of the patch between
> 2.6.27.26 and 2.6.27.27
> 
> The updated 2.6.27.y git tree can be found at:
>         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
> and can be browsed at the normal kernel.org git web browser:
>         http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
> 
> thanks,

<CUT>
> Linus Torvalds (1):
>       Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x

I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4 hangs 
during boot and it is caused by this fix.

I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with 
additional info & .config.

Best regards,

				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 11:51 ` Krzysztof Oledzki
@ 2009-07-20 15:10   ` Greg KH
  2009-07-20 16:01     ` Linus Torvalds
  2009-07-23 17:33     ` Krzysztof Olędzki
  0 siblings, 2 replies; 43+ messages in thread
From: Greg KH @ 2009-07-20 15:10 UTC (permalink / raw)
  To: Krzysztof Oledzki; +Cc: linux-kernel, Andrew Morton, torvalds, stable, lwn

On Mon, Jul 20, 2009 at 01:51:33PM +0200, Krzysztof Oledzki wrote:
> On 2009-07-20 06:06, Greg KH wrote:
> >I'm announcing the release of the 2.6.27.26 kernel.  All users of the
> >2.6.27 kernel series are very strongly encouraged to upgrade.
> >
> >I'll also be replying to this message with a copy of the patch between
> >2.6.27.26 and 2.6.27.27
> >
> >The updated 2.6.27.y git tree can be found at:
> >        git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
> >and can be browsed at the normal kernel.org git web browser:
> >        http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
> >
> >thanks,
> 
> <CUT>
> >Linus Torvalds (1):
> >      Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
> 
> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
> hangs during boot and it is caused by this fix.
> 
> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
> additional info & .config.

Wierd.  Linus, any thoughts?  Do we need to check the version of the
compiler that we are using before turning that option off?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 15:10   ` Greg KH
@ 2009-07-20 16:01     ` Linus Torvalds
  2009-07-20 21:45       ` Krzysztof Oledzki
  2009-07-23 17:33     ` Krzysztof Olędzki
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-20 16:01 UTC (permalink / raw)
  To: Greg KH; +Cc: Krzysztof Oledzki, linux-kernel, Andrew Morton, stable, lwn

On Mon, 20 Jul 2009, Greg KH wrote:
>
> > >Linus Torvalds (1):
> > >      Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
> > 
> > I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
> > hangs during boot and it is caused by this fix.
> > 
> > I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
> > additional info & .config.
> 
> Wierd.  Linus, any thoughts?  Do we need to check the version of the
> compiler that we are using before turning that option off?

Dang. That was what we hoped to avoid by not using -fwrapv - we didn't 
know which compilers were buggy, although it _looked_ like just 4.1.x (I 
have some dim memory of somebody suspecting a 4.2.x version too, but that 
may have been just a "we know it should only be needed in 4.3.x, so maybe 
we shouldn't use it in 4.2.x either").

But it looks like the "buggy window" for -fno-strict-overflow is 
potentially even _larger_ than it ever was with -fwrapv. And I suspect 
that in both cases the real problem is that almost nobody ever uses either 
optimization flag, so coverage is way way smaller.

For now, I suspect we need to do something like this: go back to -fwrapv 
(because it has gotten more testing), but limit it to 4.2.x and newer.

However, it would be really good to figure out _why_ it breaks. It's 
probably some specific configuration issue in addition to the particular 
kernel vesion (and maybe it even needs some particular hardware to show 
the bug - ie a specific driver that triggers it or whatever).

			Linus

---
 Makefile |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index 79957b3..47a3a1a 100644
--- a/Makefile
+++ b/Makefile
@@ -566,7 +566,7 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
 KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)

 # disable invalid "can't wrap" optimizations for signed / pointers
-KBUILD_CFLAGS	+= $(call cc-option,-fno-strict-overflow)
+KBUILD_CFLAGS	+= $(call cc-ifversion, -ge, 0402, -fwrapv)

 # revert to pre-gcc-4.4 behaviour of .eh_frame
 KBUILD_CFLAGS	+= $(call cc-option,-fno-dwarf2-cfi-asm)

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 16:01     ` Linus Torvalds
@ 2009-07-20 21:45       ` Krzysztof Oledzki
  2009-07-20 22:08         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-20 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, Krzysztof Oledzki, linux-kernel, Andrew Morton, stable,
	lwn

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1721 bytes --]



On Mon, 20 Jul 2009, Linus Torvalds wrote:

>
>
> On Mon, 20 Jul 2009, Greg KH wrote:
>>
>>>> Linus Torvalds (1):
>>>>      Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
>>>
>>> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
>>> hangs during boot and it is caused by this fix.
>>>
>>> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
>>> additional info & .config.
>>
>> Wierd.  Linus, any thoughts?  Do we need to check the version of the
>> compiler that we are using before turning that option off?
>
> Dang. That was what we hoped to avoid by not using -fwrapv - we didn't
> know which compilers were buggy, although it _looked_ like just 4.1.x (I
> have some dim memory of somebody suspecting a 4.2.x version too, but that
> may have been just a "we know it should only be needed in 4.3.x, so maybe
> we shouldn't use it in 4.2.x either").
>
> But it looks like the "buggy window" for -fno-strict-overflow is
> potentially even _larger_ than it ever was with -fwrapv. And I suspect
> that in both cases the real problem is that almost nobody ever uses either
> optimization flag, so coverage is way way smaller.
>
> For now, I suspect we need to do something like this: go back to -fwrapv
> (because it has gotten more testing), but limit it to 4.2.x and newer.
>
> However, it would be really good to figure out _why_ it breaks. It's
> probably some specific configuration issue in addition to the particular
> kernel vesion (and maybe it even needs some particular hardware to show
> the bug - ie a specific driver that triggers it or whatever).

No problem. Please let me know what should I do to help tracking this 
issue.

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 21:45       ` Krzysztof Oledzki
@ 2009-07-20 22:08         ` Linus Torvalds
  2009-07-20 23:47           ` Marc Dionne
  2009-07-21  6:33           ` Krzysztof Oledzki
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2009-07-20 22:08 UTC (permalink / raw)
  To: Krzysztof Oledzki; +Cc: Greg KH, linux-kernel, Andrew Morton, stable, lwn



On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
> 
> No problem. Please let me know what should I do to help tracking this issue.

Can you build two kernels: one with -fwrapv, and one with 
-fno-strict-overflow, and then verify that

 - they are otherwise identical (ie exact same source code, same compiler 
   etc)

 - verify that yes, the -fwrapv kernel works, the other does not. Just to 
   avoid the confusion that obviously exists with Debian/sid binutils 
   upgrades that _also_ happens result in nonbootable kernels.

 - upload the 'vmlinux' images somewhere (I'm not sure what the limits for 
   binary attachments are at the kernel bugzilla, but that would be the 
   logical place)

In fact, it would be nice to have a third "identical" kernel build, except 
with neither -fwrapv/-fno-strict-overflow.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 22:08         ` Linus Torvalds
@ 2009-07-20 23:47           ` Marc Dionne
  2009-07-20 23:56             ` Linus Torvalds
  2009-07-21  6:33           ` Krzysztof Oledzki
  1 sibling, 1 reply; 43+ messages in thread
From: Marc Dionne @ 2009-07-20 23:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn

On 07/20/2009 06:08 PM, Linus Torvalds wrote:
>
> On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>> No problem. Please let me know what should I do to help tracking this issue.
>
> Can you build two kernels: one with -fwrapv, and one with
> -fno-strict-overflow, and then verify that
>
>   - they are otherwise identical (ie exact same source code, same compiler
>     etc)
>
>   - verify that yes, the -fwrapv kernel works, the other does not. Just to
>     avoid the confusion that obviously exists with Debian/sid binutils
>     upgrades that _also_ happens result in nonbootable kernels.
>
>   - upload the 'vmlinux' images somewhere (I'm not sure what the limits for
>     binary attachments are at the kernel bugzilla, but that would be the
>     logical place)
>
> In fact, it would be nice to have a third "identical" kernel build, except
> with neither -fwrapv/-fno-strict-overflow.
>
> 			Linus

I might be seeing a slightly different bug, but in case it's helpful, 
the behaviour here on Fedora rawhide with gcc-4.4.0-14.x86_64 and 
binutils-2.19.51.0.11-27.fc12.x86_64 is that I get various .o files that 
come out as completely empty files (or in one case as a precisely 64K 
sized file that gives a "File format not recognized" error"), and the 
latest 2.6.31-rc git can't be built at all.

If I replace -fno-strict-overflow with -fwrapv in Makefile everything 
builds and runs fine.

Interestingly though, "-fno-strict-overflow -v" also gives me a good 
kernel, and comparing the assembly for one of the affected files doesn't 
show any difference between -fwrapv and -fno-strict-overflow.

Marc

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 23:47           ` Marc Dionne
@ 2009-07-20 23:56             ` Linus Torvalds
  2009-07-21  0:37               ` Marc Dionne
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-20 23:56 UTC (permalink / raw)
  To: Marc Dionne
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn

On Mon, 20 Jul 2009, Marc Dionne wrote:
> 
> I might be seeing a slightly different bug, but in case it's helpful, the
> behaviour here on Fedora rawhide with gcc-4.4.0-14.x86_64 and
> binutils-2.19.51.0.11-27.fc12.x86_64 is that I get various .o files that come
> out as completely empty files (or in one case as a precisely 64K sized file
> that gives a "File format not recognized" error"), and the latest 2.6.31-rc
> git can't be built at all.

Hmm. This sounds more like the binutils bug that people had. Sounds like 
an assembler bug if the *.o file ends up being empty or at some fixed 
size. If it was cc1 that fails, I'd expect to not see an *.o file at all, 
since it didn't generate good assembly.

In fact, your behavior sounds like the thing that produces the *.o files 
core-dumped or died for other reasons, and had a 64kB buffer that either 
got flushed or not. That would explain the "zero or exactly 64kB" size.

It could be ccache too, of course. 

> If I replace -fno-strict-overflow with -fwrapv in Makefile everything builds
> and runs fine.

.. and this is just really really odd. If it was the cc1 front-end that 
does that with a bad optimization, I'd expect more visible turds. But on 
the other hand, if it's the binutils, then I don't see why -fwrapv would 
matter. Some front-end options get passed down to the assembler, but I 
would definitely not expect -fwrapv/-fno-strict-overflow to be one of 
those.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 23:56             ` Linus Torvalds
@ 2009-07-21  0:37               ` Marc Dionne
  2009-07-21  1:01                 ` Linus Torvalds
  2009-07-21  1:05                 ` Linus Torvalds
  0 siblings, 2 replies; 43+ messages in thread
From: Marc Dionne @ 2009-07-21  0:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn

On 07/20/2009 07:56 PM, Linus Torvalds wrote:
>
> On Mon, 20 Jul 2009, Marc Dionne wrote:
>> I might be seeing a slightly different bug, but in case it's helpful, the
>> behaviour here on Fedora rawhide with gcc-4.4.0-14.x86_64 and
>> binutils-2.19.51.0.11-27.fc12.x86_64 is that I get various .o files that come
>> out as completely empty files (or in one case as a precisely 64K sized file
>> that gives a "File format not recognized" error"), and the latest 2.6.31-rc
>> git can't be built at all.
>
> Hmm. This sounds more like the binutils bug that people had. Sounds like
> an assembler bug if the *.o file ends up being empty or at some fixed
> size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
> since it didn't generate good assembly.
>
> In fact, your behavior sounds like the thing that produces the *.o files
> core-dumped or died for other reasons, and had a 64kB buffer that either
> got flushed or not. That would explain the "zero or exactly 64kB" size.
>
> It could be ccache too, of course.

Actually in my case it turns out that it is ccache after all - if I 
remove it from the picture everything is fine.  If I re-enable it, even 
with a clean cache, I get the problem.

It might just be a coincidence that it's triggered by the -fwrapv change.

Marc

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21  0:37               ` Marc Dionne
@ 2009-07-21  1:01                 ` Linus Torvalds
  2009-07-21  6:40                   ` Krzysztof Oledzki
  2009-07-21  1:05                 ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-21  1:01 UTC (permalink / raw)
  To: Marc Dionne
  Cc: Krzysztof Oledzki, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn



On Mon, 20 Jul 2009, Marc Dionne wrote:
> > 
> > Hmm. This sounds more like the binutils bug that people had. Sounds like
> > an assembler bug if the *.o file ends up being empty or at some fixed
> > size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
> > since it didn't generate good assembly.
> > 
> > In fact, your behavior sounds like the thing that produces the *.o files
> > core-dumped or died for other reasons, and had a 64kB buffer that either
> > got flushed or not. That would explain the "zero or exactly 64kB" size.
> > 
> > It could be ccache too, of course.
> 
> Actually in my case it turns out that it is ccache after all - if I remove it
> from the picture everything is fine.  If I re-enable it, even with a clean
> cache, I get the problem.
> 
> It might just be a coincidence that it's triggered by the -fwrapv change.

Ok, so this is getting ridiculous. Do we have _three_ different kernel 
issues going on at the same time, all subtly related to tools issues 
rather than the kernel source tree itself?

That's just completely bizarre.

So right now we have:

 - Krzysztof Oledzki: the only one who so far has really pinpointed it to 
   the -fwrapv change itself.

   It would be good to really double-check that this is not about ccache, 
   since Marc apparently gets a good kernel without ccache, and -fwrapv 
   seems to be involved. 

 - Marc Dionne: ccache getting confused, with 0-byte and 64kB object 
   files. But why -fwrapv vs -fno-strict-overflow would matter is totally 
   unclear. Just happenstance? Something silly like overflowing the 
   length of the ccache argument buffer?

   It would be wonderful to figure out what odd issue ccache might have. 
   The kernel command line isn't _that_ long, but it does end up being 
   something reasonably monstrous like

	gcc -Wp,-MD,kernel/.fork.s.d -nostdinc -isystem 
		/usr/lib/gcc/x86_64-redhat-linux/4.4.0/include -Iinclude 
		-I/home/torvalds/v2.6/linux/arch/x86/include -include 
		include/linux/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes 
		-Wno-trigraphs -fno-strict-aliasing -fno-common 
		-Werror-implicit-function-declaration -Wno-format-security 
		-fno-delete-null-pointer-checks -Os -m64 -march=core2 -mno-red-zone 
		-mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args 
		-DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare 
		-fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow 
		-Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer 
		-fno-optimize-sibling-calls -Wdeclaration-after-statement 
		-Wno-pointer-sign -fwrapv -fno-dwarf2-cfi-asm -D"KBUILD_STR(s)=#s" 
		-D"KBUILD_BASENAME=KBUILD_STR(fork)"  -D"KBUILD_MODNAME=KBUILD_STR(fork)"  
		-fverbose-asm -S -o kernel/fork.s kernel/fork.c

   so we are getting into the kilobyte range for it, and mayeb simply the 
   longer argument made something fail. But other build systems do even 
   worse things, I'm sure.

 - the Debian/sid binutils package failure, solved by downgrading 
   binutils.

Crazy, crazy.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21  1:01                 ` Linus Torvalds
@ 2009-07-21  6:40                   ` Krzysztof Oledzki
  0 siblings, 0 replies; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-21  6:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Dionne, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1809 bytes --]



On Mon, 20 Jul 2009, Linus Torvalds wrote:

>
>
> On Mon, 20 Jul 2009, Marc Dionne wrote:
>>>
>>> Hmm. This sounds more like the binutils bug that people had. Sounds like
>>> an assembler bug if the *.o file ends up being empty or at some fixed
>>> size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
>>> since it didn't generate good assembly.
>>>
>>> In fact, your behavior sounds like the thing that produces the *.o files
>>> core-dumped or died for other reasons, and had a 64kB buffer that either
>>> got flushed or not. That would explain the "zero or exactly 64kB" size.
>>>
>>> It could be ccache too, of course.
>>
>> Actually in my case it turns out that it is ccache after all - if I remove it
>> from the picture everything is fine.  If I re-enable it, even with a clean
>> cache, I get the problem.
>>
>> It might just be a coincidence that it's triggered by the -fwrapv change.
>
> Ok, so this is getting ridiculous. Do we have _three_ different kernel
> issues going on at the same time, all subtly related to tools issues
> rather than the kernel source tree itself?
>
> That's just completely bizarre.
>
> So right now we have:
>
> - Krzysztof Oledzki: the only one who so far has really pinpointed it to
>   the -fwrapv change itself.
>
>   It would be good to really double-check that this is not about ccache,
>   since Marc apparently gets a good kernel without ccache, and -fwrapv
>   seems to be involved.

There is no ccache configured in my systems and the same problem appears 
on a different servers (both i386 and x86-64). However, the configs are 
very similar and the hardware is nearly identical.

I'm pretty sure the only different between bootable and unbootable kernel 
is that fwrapv vs strict-overflow change.

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21  0:37               ` Marc Dionne
  2009-07-21  1:01                 ` Linus Torvalds
@ 2009-07-21  1:05                 ` Linus Torvalds
  2009-07-21  2:38                   ` Marc Dionne
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-21  1:05 UTC (permalink / raw)
  To: Marc Dionne
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn



On Mon, 20 Jul 2009, Marc Dionne wrote:
> > 
> > It could be ccache too, of course.
> 
> Actually in my case it turns out that it is ccache after all - if I remove it
> from the picture everything is fine.  If I re-enable it, even with a clean
> cache, I get the problem.
> 
> It might just be a coincidence that it's triggered by the -fwrapv change.

Btw, do you find any core-files lying around if you enable them before the 
build with

	ulimit -c unlimited

or similar?

And how did you clean ccache? There's "-c" and then there's "-C".

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21  1:05                 ` Linus Torvalds
@ 2009-07-21  2:38                   ` Marc Dionne
  0 siblings, 0 replies; 43+ messages in thread
From: Marc Dionne @ 2009-07-21  2:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn

On 07/20/2009 09:05 PM, Linus Torvalds wrote:
>
> On Mon, 20 Jul 2009, Marc Dionne wrote:
>>> It could be ccache too, of course.
>> Actually in my case it turns out that it is ccache after all - if I remove it
>> from the picture everything is fine.  If I re-enable it, even with a clean
>> cache, I get the problem.
>>
>> It might just be a coincidence that it's triggered by the -fwrapv change.
>
> Btw, do you find any core-files lying around if you enable them before the
> build with
>
> 	ulimit -c unlimited
>
> or similar?
>
> And how did you clean ccache? There's "-c" and then there's "-C".
>
> 		Linus

Unfortunately I'm not able to reproduce anymore, after clearing 
/var/cache/ccache completely (rm -rf).  Earlier I had done ccache -C, 
which didn't help.  I didn't think to copy the contents for more analysis.

So perhaps a combination of some odd ccache state along with changing 
gcc, binutils (which I updated today) and the compile flag.  Revving 
binutils and gcc back and forth didn't reproduce it.

What is odd though is that when I straced a single gcc command line that 
produced an empty .o file, it looked normal - a series of successful 
writes with the correct amount of data to a temp file, close, unlink .o 
file, rename temp file -> .o file.  But the resulting file was empty. 
Make me wonder if there was something filesystem/caching related to it.

Marc

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 22:08         ` Linus Torvalds
  2009-07-20 23:47           ` Marc Dionne
@ 2009-07-21  6:33           ` Krzysztof Oledzki
  2009-07-21 10:16             ` Krzysztof Oledzki
  1 sibling, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-21  6:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1057 bytes --]



On Mon, 20 Jul 2009, Linus Torvalds wrote:

>
>
> On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>>
>> No problem. Please let me know what should I do to help tracking this issue.
>
> Can you build two kernels: one with -fwrapv, and one with
> -fno-strict-overflow, and then verify that
>
> - they are otherwise identical (ie exact same source code, same compiler
>   etc)
>
> - verify that yes, the -fwrapv kernel works, the other does not. Just to
>   avoid the confusion that obviously exists with Debian/sid binutils
>   upgrades that _also_ happens result in nonbootable kernels.
>
> - upload the 'vmlinux' images somewhere (I'm not sure what the limits for
>   binary attachments are at the kernel bugzilla, but that would be the
>   logical place)
>
> In fact, it would be nice to have a third "identical" kernel build, except
> with neither -fwrapv/-fno-strict-overflow.

OK. Right now I'm building the three kernels you asked for 
(fwrapv/fno-strict-overflow/none). I'll test them and upload vmlinux 
images.

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21  6:33           ` Krzysztof Oledzki
@ 2009-07-21 10:16             ` Krzysztof Oledzki
  2009-07-21 16:11               ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-21 10:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Greg KH, linux-kernel, Andrew Morton, stable, lwn

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2621 bytes --]



On Tue, 21 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Mon, 20 Jul 2009, Linus Torvalds wrote:
>
>> 
>> 
>> On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>>> 
>>> No problem. Please let me know what should I do to help tracking this 
>>> issue.
>> 
>> Can you build two kernels: one with -fwrapv, and one with
>> -fno-strict-overflow, and then verify that
>> 
>> - they are otherwise identical (ie exact same source code, same compiler
>>   etc)
>> 
>> - verify that yes, the -fwrapv kernel works, the other does not. Just to
>>   avoid the confusion that obviously exists with Debian/sid binutils
>>   upgrades that _also_ happens result in nonbootable kernels.
>> 
>> - upload the 'vmlinux' images somewhere (I'm not sure what the limits for
>>   binary attachments are at the kernel bugzilla, but that would be the
>>   logical place)
>> 
>> In fact, it would be nice to have a third "identical" kernel build, except
>> with neither -fwrapv/-fno-strict-overflow.
>
> OK. Right now I'm building the three kernels you asked for 
> (fwrapv/fno-strict-overflow/none). I'll test them and upload vmlinux images.

OK, there are three kernels, exactly as you requested:

http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)

# grep -A1 "can't wrap" linux-2.6.27.27-*/Makefile
linux-2.6.27.27-fno-strict-overflow/Makefile:# disable invalid "can't wrap" optimzations for signed / pointers
linux-2.6.27.27-fno-strict-overflow/Makefile-KBUILD_CFLAGS      += $(call cc-option,-fno-strict-overflow)
--
linux-2.6.27.27-fwrapv/Makefile:# disable invalid "can't wrap" optimzations for signed / pointers
linux-2.6.27.27-fwrapv/Makefile-KBUILD_CFLAGS   += $(call cc-option,-fwrapv)
--
linux-2.6.27.27-fnone/Makefile:# disable invalid "can't wrap" optimzations for signed / pointers
linux-2.6.27.27-fnone/Makefile-#KBUILD_CFLAGS   += $(call cc-option,-fno-strict-overflow)

Plwase note the third one has the KBUILD_CFLAGS commented.

Kernels are identical and are compiled from the same config, on the same 
server with gcc-4.2.4, binutils-2.19. There is no ccache installed and the 
kernels are not patched with any additonal patches - just vanilla 
linux-2.6.27.27.

Screenshot from the hanging kernel (-fno-strict-overflow):
  http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/2.6.27.27-hang.png

Dmesg from a bootable kernel:
  http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/dmesg

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21 10:16             ` Krzysztof Oledzki
@ 2009-07-21 16:11               ` Linus Torvalds
  2009-07-21 19:15                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-21 16:11 UTC (permalink / raw)
  To: Krzysztof Oledzki; +Cc: Greg KH, linux-kernel, Andrew Morton, stable, lwn

On Tue, 21 Jul 2009, Krzysztof Oledzki wrote:
> 
> OK, there are three kernels, exactly as you requested:
> 
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)

Perfect.

And interestingly, the "fno-strict-overflow" kernel is actually much 
closer to the "fnone" kernel than to the "fwrapv" one. I have some silly 
scripts based on 'objdump -d' plus a lot of stupid sed scripting to remove 
the trivial differences due to instruction addresses, and then doing a 
'diff -u' between the munged disassembly of the kernels gives me:

  [torvalds@nehalem fno-strict-overflow]$ wc -l fnone-to-fno-strict-overflow fwrapv-to-fno-strict-overflow 
    39309 fnone-to-fno-strict-overflow
    91423 fwrapv-to-fno-strict-overflow
   130732 total

ie the diff from the kernel with no flags is less than twice the size of 
the diff from fwrapv.

Still - it's almost 40kB of diffs of disassembly, so I'm not going to 
guarantee that I can make any sense of it and find the compiler problem in 
it. But I'll try. And send you test-patches to see if I can pinpoint the 
code that causes the problem.

> Kernels are identical and are compiled from the same config, on the same
> server with gcc-4.2.4, binutils-2.19. There is no ccache installed and the
> kernels are not patched with any additonal patches - just vanilla
> linux-2.6.27.27.

Thank you.

> Screenshot from the hanging kernel (-fno-strict-overflow):
>  http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/2.6.27.27-hang.png
> 
> Dmesg from a bootable kernel:
>  http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/dmesg

Great. This is all about as perfect as could be asked for. Now it's just a 
question of trying to find the right code generation difference...

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21 16:11               ` Linus Torvalds
@ 2009-07-21 19:15                 ` Linus Torvalds
  2009-07-21 21:34                   ` Troy Moure
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-21 19:15 UTC (permalink / raw)
  To: Krzysztof Oledzki; +Cc: Greg KH, linux-kernel, Andrew Morton, stable, lwn

On Tue, 21 Jul 2009, Linus Torvalds wrote:
> 
> Great. This is all about as perfect as could be asked for. Now it's just a 
> question of trying to find the right code generation difference...

Ok, that "just" is turning out to be really painful.

I've tried to do clever things, but the best I've been able to do is to 
get the relevant differences down to about 22 thousand lines of assembler 
diffs that don't match either of the working kernels. Sadly, 22KLOC of 
assembler diffs isn't something anybody can reasonably read to even start 
to make a guess about which lines are causing problems. 

So what I'd love to do is to narrow the failure down a bit, by using 
-fno-strict-overflow only on _parts_ of the tree and then try a couple of 
kernels to see if they hang, to see which part it is that mis-compiles.

With a newer kernel, we could do something like this:

	diff --git a/Makefile b/Makefile
	index 79957b3..b096be2 100644
	--- a/Makefile
	+++ b/Makefile
	@@ -565,9 +565,6 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
	 # disable pointer signed / unsigned warnings in gcc 4.0
	 KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)

	-# disable invalid "can't wrap" optimizations for signed / pointers
	-KBUILD_CFLAGS	+= $(call cc-option,-fno-strict-overflow)
	-
	 # revert to pre-gcc-4.4 behaviour of .eh_frame
	 KBUILD_CFLAGS	+= $(call cc-option,-fno-dwarf2-cfi-asm)

	diff --git a/drivers/Makefile b/drivers/Makefile
	index bc4205d..1250b55 100644
	--- a/drivers/Makefile
	+++ b/drivers/Makefile
	@@ -5,6 +5,8 @@
	 # Rewritten to use lists instead of if-statements.
	 #

	+subdir-ccflags-y += -fno-strict-overflow
	+
	 obj-y				+= gpio/
	 obj-$(CONFIG_PCI)		+= pci/
	 obj-$(CONFIG_PARISC)		+= parisc/

to say "use -fno-strict-overflow only when compiling objects in the
drivers/ subdirectories", but I'm pretty sure that whole clever
'subdir-ccflags-y' thing was added pretty recently, and won't work in
2.6.27

However, since there is _some_ reason to wonder about whether the problem 
could be in radeonfb (because the last printouts before the hang are about 
that), it would be good to test just that part.

So if you have the time and energy, it would be very interesting if you 
could do something like this:

 - remove the "KBUILD_CFLAGS  += $(call cc-option,-fno-strict-overflow)"
   entirely from the main Makefile.

 - one directory at a time, add

	ccflags-y += -fno-strict-overflow

   to the Makefile in just that particular directory, and compile and test 
   the kernel. Now, since your old kernel doesn't have that nifty new
   "subdir-ccflags-y" thing, you can't do it for big parts of the kernel, 
   you can literally do it for just the contents of one subdirectory 
   (non-recusive!) at a time, but while there's two thousand 
   subdirectories in the Linux kernel sources, judicious sprinking of that 
   into the tree could hopefully make it possible to find.

 - the first Makefile's to test would be 'drivers/video/aty/Makefile'. If 
   that one doesn't work, some scripting might be in order, eg something 
   like

	for i in $(find drivers -name Makefile)
	do
		( echo "ccflags-y += -fno-strict-overflow" ; cat $i ) > $i.new
		mv $i.new $i
	done

   should add it to all the subdirectories under 'drivers', etc.

and if you can find the subdirectory where '-fno-strict-overflow' makes 
the difference, at that point I'd love to see the kernel image where 
things worked (ie the last kernel you booted successfully _before_ the 
kernel that failed) and the kernel that fails - now hopefully the 
differences should be much smaller (how small will obviously depend on 
whether you caught the difference in just one subdirectory or whether you 
scripted it over lots and lots of subdirectories).

Of course, the tighter you can do this, the better. If it happens to be in 
'drivers/video/aty/' for example, and you end up being really gung-ho 
about this and want to narrow it down to not just the subdirectory, but a 
few files, you could remove the per-directory "ccflags-y" line, and do a 
few per-file CFLAGS entries instead, like:

	CFLAGS_radeon_base.o += -fno-strict-overflow

etc.

And hey, if you think this is too much work, then you're right. It's a lot 
of work. So don't worry if you can't be bothered - it would be wonderful 
to try to get this thing resolved, but I do realize I'm asking a lot here.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21 19:15                 ` Linus Torvalds
@ 2009-07-21 21:34                   ` Troy Moure
  2009-07-22  0:53                     ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Troy Moure @ 2009-07-21 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Krzysztof Oledzki, Greg KH, linux-kernel, Andrew Morton, stable,
	lwn



On Tue, 21 Jul 2009, Linus Torvalds wrote:

> > Great. This is all about as perfect as could be asked for. Now it's just a 
> > question of trying to find the right code generation difference...
> 
> Ok, that "just" is turning out to be really painful.

I think I've found something interesting.  Look at the the code generated 
for edid_checksum() in driver/video/fbmon.c.  This is what I see for the 
-fno-strict-overflow kernel:

...
ffffffff803b37ed <edid_checksum>:
ffffffff803b37ed:       53                      push   %rbx
ffffffff803b37ee:       48 89 fb                mov    %rdi,%rbx
ffffffff803b37f1:       e8 8d fd ff ff          callq  ffffffff803b3583 <check_edid>
ffffffff803b37f6:       85 c0                   test   %eax,%eax
ffffffff803b37f8:       89 c6                   mov    %eax,%esi
ffffffff803b37fa:       74 08                   je     ffffffff803b3804 <edid_checksum+0x17>
ffffffff803b37fc:       48 89 df                mov    %rbx,%rdi
ffffffff803b37ff:       e8 c0 fe ff ff          callq  ffffffff803b36c4 <fix_edid>
ffffffff803b3804:       eb fe                   jmp    ffffffff803b3804 <edid_checksum+0x17>

ffffffff803b3806 <fb_parse_edid>:
ffffffff803b3806:       41 54                   push   %r12
ffffffff803b3808:       48 85 ff                test   %rdi,%rdi
...

That last insn in edid_checksum() doesn't look *quite* right to me...

The -fnone kernel has something a lot more sensible-looking:

ffffffff803b39dd <edid_checksum>:
ffffffff803b39dd:       53                      push   %rbx
ffffffff803b39de:       48 89 fb                mov    %rdi,%rbx
ffffffff803b39e1:       e8 8d fd ff ff          callq  ffffffff803b3773 <check_$
ffffffff803b39e6:       85 c0                   test   %eax,%eax
ffffffff803b39e8:       89 c6                   mov    %eax,%esi
ffffffff803b39ea:       74 08                   je     ffffffff803b39f4 <edid_c$
ffffffff803b39ec:       48 89 df                mov    %rbx,%rdi
ffffffff803b39ef:       e8 c0 fe ff ff          callq  ffffffff803b38b4 <fix_ed$
ffffffff803b39f4:       31 c9                   xor    %ecx,%ecx
ffffffff803b39f6:       31 f6                   xor    %esi,%esi
ffffffff803b39f8:       31 d2                   xor    %edx,%edx
ffffffff803b39fa:       eb 0a                   jmp    ffffffff803b3a06 <edid_c$
ffffffff803b39fc:       0f b6 c0                movzbl %al,%eax
ffffffff803b39ff:       8a 04 03                mov    (%rbx,%rax,1),%al
ffffffff803b3a02:       01 c1                   add    %eax,%ecx
ffffffff803b3a04:       09 c6                   or     %eax,%esi
ffffffff803b3a06:       88 d0                   mov    %dl,%al
ffffffff803b3a08:       ff c2                   inc    %edx
ffffffff803b3a0a:       81 fa 81 00 00 00       cmp    $0x81,%edx
ffffffff803b3a10:       75 ea                   jne    ffffffff803b39fc <edid_c$
ffffffff803b3a12:       84 c9                   test   %cl,%cl
ffffffff803b3a14:       0f 94 c0                sete   %al
ffffffff803b3a17:       40 84 f6                test   %sil,%sil
ffffffff803b3a1a:       5b                      pop    %rbx
ffffffff803b3a1b:       0f 95 c2                setne  %dl
ffffffff803b3a1e:       21 d0                   and    %edx,%eax
ffffffff803b3a20:       0f b6 c0                movzbl %al,%eax
ffffffff803b3a23:       c3                      retq

ffffffff803b3a24 <fb_parse_edid>:
ffffffff803b3a24:       41 54                   push   %r12
ffffffff803b3a26:       48 85 ff                test   %rdi,%rdi

Hope that helps narrow things down ;)

	Troy


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-21 21:34                   ` Troy Moure
@ 2009-07-22  0:53                     ` Linus Torvalds
  2009-07-22  1:07                       ` Linus Torvalds
                                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Linus Torvalds @ 2009-07-22  0:53 UTC (permalink / raw)
  To: Troy Moure
  Cc: Krzysztof Oledzki, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor


[ Added Ian Lance Taylor to the cc, he knows the background, and unlike me 
  is competent with gcc. ]

On Tue, 21 Jul 2009, Troy Moure wrote:
> 
> I think I've found something interesting.  Look at the the code generated 
> for edid_checksum() in driver/video/fbmon.c.  This is what I see for the 
> -fno-strict-overflow kernel:

Ooh.

Bingo. You're 100% right, and you definitely found it (of course, there 
may be _other_ cases like this, but that's certainly _one_ of the 
problems, and probably the only one).

Just out of curiosity, how did you find it? Now that I know where to look, 
it's very obvious in the assembler diffs, but I didn't notice it until you 
pointed it out just because there is so _much_ of the diffs...

And yes, that's very much a compiler bug. And I also bet it's very easily 
fixed.

The code in question is this loop:

	#define EDID_LENGTH 128

	unsigned char i, ...

	for (i = 0; i < EDID_LENGTH; i++) {
                csum += edid[i];
                all_null |= edid[i];
        }

and gcc -fno-strict-overflow has apparently decided that that is an 
infinite loop, even though it clearly is not. So then the stupid and buggy 
compiler will compile that loop (and the whole rest of the function) to 
the "optimized" version that is just

	loop:
		jmp loop;

I even bet I know why: it looks at "unsigned char", and sees that it is an 
8-bit variable, and then it looks at "i < EDID_LENGTH" and sees that it is 
a _signed_ comparison (it's signed because the C type rules mean that 
'unsigned char' will be extended to 'int' in an expression), and then it 
decides that in a signed comparison an 8-bit entry is always going to be 
smaller than 128.

Anyway, I bet we can work around the compiler bug by just changing the 
type of "i" from "unsigned char" to be a plain "int".

Krzysztof? Mind testing that?

Ian? This is Linux 2.6.27.27 compiled with gcc-4.2.4. I'm not seeing the 
bug in the gcc I have on my machine (gcc-4.4.0), but the bug is very clear 
(once you _find_ it, which was the problem) in the binaries that Krzysztof 
posted. They're still at:

    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)

and you can clearly see the 'edid_checksum' miscompilation in the objdump 
disassembly.

Thanks Troy,

			Linus

---
Leaving in the context for Ian:

Buggy -fno-strict-overflow compilation:

> ...
> ffffffff803b37ed <edid_checksum>:
> ffffffff803b37ed:       53                      push   %rbx
> ffffffff803b37ee:       48 89 fb                mov    %rdi,%rbx
> ffffffff803b37f1:       e8 8d fd ff ff          callq  ffffffff803b3583 <check_edid>
> ffffffff803b37f6:       85 c0                   test   %eax,%eax
> ffffffff803b37f8:       89 c6                   mov    %eax,%esi
> ffffffff803b37fa:       74 08                   je     ffffffff803b3804 <edid_checksum+0x17>
> ffffffff803b37fc:       48 89 df                mov    %rbx,%rdi
> ffffffff803b37ff:       e8 c0 fe ff ff          callq  ffffffff803b36c4 <fix_edid>
> ffffffff803b3804:       eb fe                   jmp    ffffffff803b3804 <edid_checksum+0x17>
> 
> ffffffff803b3806 <fb_parse_edid>:
> ffffffff803b3806:       41 54                   push   %r12
> ffffffff803b3808:       48 85 ff                test   %rdi,%rdi
> ...
> 
> That last insn in edid_checksum() doesn't look *quite* right to me...
> 
> The -fnone kernel has something a lot more sensible-looking:
> 
> ffffffff803b39dd <edid_checksum>:
> ffffffff803b39dd:       53                      push   %rbx
> ffffffff803b39de:       48 89 fb                mov    %rdi,%rbx
> ffffffff803b39e1:       e8 8d fd ff ff          callq  ffffffff803b3773 <check_$
> ffffffff803b39e6:       85 c0                   test   %eax,%eax
> ffffffff803b39e8:       89 c6                   mov    %eax,%esi
> ffffffff803b39ea:       74 08                   je     ffffffff803b39f4 <edid_c$
> ffffffff803b39ec:       48 89 df                mov    %rbx,%rdi
> ffffffff803b39ef:       e8 c0 fe ff ff          callq  ffffffff803b38b4 <fix_ed$
> ffffffff803b39f4:       31 c9                   xor    %ecx,%ecx
> ffffffff803b39f6:       31 f6                   xor    %esi,%esi
> ffffffff803b39f8:       31 d2                   xor    %edx,%edx
> ffffffff803b39fa:       eb 0a                   jmp    ffffffff803b3a06 <edid_c$
> ffffffff803b39fc:       0f b6 c0                movzbl %al,%eax
> ffffffff803b39ff:       8a 04 03                mov    (%rbx,%rax,1),%al
> ffffffff803b3a02:       01 c1                   add    %eax,%ecx
> ffffffff803b3a04:       09 c6                   or     %eax,%esi
> ffffffff803b3a06:       88 d0                   mov    %dl,%al
> ffffffff803b3a08:       ff c2                   inc    %edx
> ffffffff803b3a0a:       81 fa 81 00 00 00       cmp    $0x81,%edx
> ffffffff803b3a10:       75 ea                   jne    ffffffff803b39fc <edid_c$
> ffffffff803b3a12:       84 c9                   test   %cl,%cl
> ffffffff803b3a14:       0f 94 c0                sete   %al
> ffffffff803b3a17:       40 84 f6                test   %sil,%sil
> ffffffff803b3a1a:       5b                      pop    %rbx
> ffffffff803b3a1b:       0f 95 c2                setne  %dl
> ffffffff803b3a1e:       21 d0                   and    %edx,%eax
> ffffffff803b3a20:       0f b6 c0                movzbl %al,%eax
> ffffffff803b3a23:       c3                      retq
> 
> ffffffff803b3a24 <fb_parse_edid>:
> ffffffff803b3a24:       41 54                   push   %r12
> ffffffff803b3a26:       48 85 ff                test   %rdi,%rdi
> 
> Hope that helps narrow things down ;)
> 
> 	Troy
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  0:53                     ` Linus Torvalds
@ 2009-07-22  1:07                       ` Linus Torvalds
  2009-07-22  6:16                         ` Troy Moure
  2009-07-22  1:16                       ` Linus Torvalds
  2009-07-22 11:49                       ` Krzysztof Oledzki
  2 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-22  1:07 UTC (permalink / raw)
  To: Troy Moure
  Cc: Krzysztof Oledzki, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

On Tue, 21 Jul 2009, Linus Torvalds wrote:
> 
> Just out of curiosity, how did you find it? Now that I know where to look, 
> it's very obvious in the assembler diffs, but I didn't notice it until you 
> pointed it out just because there is so _much_ of the diffs...

Ahh. I think I see how you found it. Looking at the diffs, there's only a 
few places where the number of instructions changed by a big fraction. And 
there's only _one_ place that has a factor-of-three difference (26 lines 
in the correct cases, and 7 lines in the wrong one). Clever.

There's also a case in do_page_fault() where -fno-strict-overflow 
generates a _lot_ more instructions than the other cases (but not by a 
factor of three - but it expands 63 instructions to 100). I'm not seeing 
quite _why_ it does that, but it does various stupid things like multiply 
by 0x38 etc. But it doesn't look buggy, it just looks stupid.

Or did you just brute-force it and spend a lot of time eyeballing things?

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  1:07                       ` Linus Torvalds
@ 2009-07-22  6:16                         ` Troy Moure
  2009-07-22 15:58                           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Troy Moure @ 2009-07-22  6:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Krzysztof Oledzki, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

On Tue, 21 Jul 2009, Linus Torvalds wrote:

> > Just out of curiosity, how did you find it? Now that I know where to look, 
> > it's very obvious in the assembler diffs, but I didn't notice it until you 
> > pointed it out just because there is so _much_ of the diffs...
> 
> Ahh. I think I see how you found it. Looking at the diffs, there's only a 
> few places where the number of instructions changed by a big fraction. And 
> there's only _one_ place that has a factor-of-three difference (26 lines 
> in the correct cases, and 7 lines in the wrong one). Clever.

Hmm..that's interesting.  But no, I wasn't that clever.

I actually just started poking around the radeonfb code, since you 
mentioned it looked like that might be where the issue was.  The last 
message printed in the hung kernel is "Monitor 2 type no found" - printed 
from radeon_probe_screens().  And the first message after that in the 
non-hung kernel is "Console: switching to colour frame buffer device", 
which I guessed was printed from register_framebuffer() (since that calls 
notifiers).

So I started looking in radeon_fb_register() between the call to 
radeon_probe_screens() and the call to register_framebuffer(), and tracing 
through the calls it made.  I ignored pci_, sysfs_, etc. calls, thinking 
the driver code was more likely to have a device probing loop or something 
odd like that that could be miscompiled.

For any functions that had a loop or anything strange-looking, I checked 
the assembler diffs.  And after a little while (a half-hour or so, I 
think), I found edid_checksum().  Just the name made me think it was a 
likely culprit, even before I looked at the diff. 

Obviously I got a bit lucky that problem was actually basically where I 
started looking for it.  But I figured even if I didn't find it, I'd learn 
something about the radeonfb code. And who would pass up an opportunity to 
learn about that?

	Troy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  6:16                         ` Troy Moure
@ 2009-07-22 15:58                           ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2009-07-22 15:58 UTC (permalink / raw)
  To: Troy Moure
  Cc: Krzysztof Oledzki, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor



On Wed, 22 Jul 2009, Troy Moure wrote:
> 
> Obviously I got a bit lucky that problem was actually basically where I 
> started looking for it.  But I figured even if I didn't find it, I'd learn 
> something about the radeonfb code. And who would pass up an opportunity to 
> learn about that?

Who indeed? <eyeroll> There are few things more interesting than looking 
at assembler code from the frame buffer layer. </eyeroll>

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  0:53                     ` Linus Torvalds
  2009-07-22  1:07                       ` Linus Torvalds
@ 2009-07-22  1:16                       ` Linus Torvalds
  2009-07-22  8:12                         ` Krzysztof Oledzki
  2009-07-22 13:48                         ` Krzysztof Oledzki
  2009-07-22 11:49                       ` Krzysztof Oledzki
  2 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2009-07-22  1:16 UTC (permalink / raw)
  To: Troy Moure
  Cc: Krzysztof Oledzki, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor



On Tue, 21 Jul 2009, Linus Torvalds wrote:
> 
> Anyway, I bet we can work around the compiler bug by just changing the 
> type of "i" from "unsigned char" to be a plain "int".

IOW, like this.

		Linus

---
 drivers/video/fbmon.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
index 5c1a2c0..af4a15c 100644
--- a/drivers/video/fbmon.c
+++ b/drivers/video/fbmon.c
@@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
 
 static int edid_checksum(unsigned char *edid)
 {
-	unsigned char i, csum = 0, all_null = 0;
-	int err = 0, fix = check_edid(edid);
+	unsigned csum = 0, all_null = 0;
+	int i, err = 0, fix = check_edid(edid);
 
 	if (fix)
 		fix_edid(edid, fix);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  1:16                       ` Linus Torvalds
@ 2009-07-22  8:12                         ` Krzysztof Oledzki
  2009-07-22  8:32                           ` Krzysztof Oledzki
  2009-07-22 13:48                         ` Krzysztof Oledzki
  1 sibling, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22  8:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1342 bytes --]



On Tue, 21 Jul 2009, Linus Torvalds wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>
>> Anyway, I bet we can work around the compiler bug by just changing the
>> type of "i" from "unsigned char" to be a plain "int".
>
> IOW, like this.
>
> 		Linus
>
> ---
> drivers/video/fbmon.c |    4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> index 5c1a2c0..af4a15c 100644
> --- a/drivers/video/fbmon.c
> +++ b/drivers/video/fbmon.c
> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>
> static int edid_checksum(unsigned char *edid)
> {
> -	unsigned char i, csum = 0, all_null = 0;
> -	int err = 0, fix = check_edid(edid);
> +	unsigned csum = 0, all_null = 0;
> +	int i, err = 0, fix = check_edid(edid);
>
> 	if (fix)
> 		fix_edid(edid, fix);

Wow! You guys rock! ;)

Indeed, this simple change is enough to make my kernel bootable. However, 
there is still something wrong. My console is now 80x30 instead of 128x48:

-Console: switching to colour frame buffer device 128x48
+Console: switching to colour frame buffer device 80x30

So, it looks like the loop may be still miscompiled.

The kernel is here:
  http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2

Best regards,


 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  8:12                         ` Krzysztof Oledzki
@ 2009-07-22  8:32                           ` Krzysztof Oledzki
  2009-07-22  9:55                             ` Krzysztof Oledzki
                                               ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2302 bytes --]



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
>> 
>> 
>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>> 
>>> Anyway, I bet we can work around the compiler bug by just changing the
>>> type of "i" from "unsigned char" to be a plain "int".
>> 
>> IOW, like this.
>>
>> 		Linus
>> 
>> ---
>> drivers/video/fbmon.c |    4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>> index 5c1a2c0..af4a15c 100644
>> --- a/drivers/video/fbmon.c
>> +++ b/drivers/video/fbmon.c
>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>> 
>> static int edid_checksum(unsigned char *edid)
>> {
>> -	unsigned char i, csum = 0, all_null = 0;
>> -	int err = 0, fix = check_edid(edid);
>> +	unsigned csum = 0, all_null = 0;
>> +	int i, err = 0, fix = check_edid(edid);
>>
>> 	if (fix)
>> 		fix_edid(edid, fix);
>
> Wow! You guys rock! ;)
>
> Indeed, this simple change is enough to make my kernel bootable. However, 
> there is still something wrong. My console is now 80x30 instead of 128x48:
>
> -Console: switching to colour frame buffer device 128x48
> +Console: switching to colour frame buffer device 80x30
>
> So, it looks like the loop may be still miscompiled.
>
> The kernel is here:
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2

OK, by adding a simple debug printk I'm now sure that the problem is 
indeed in the loop:

--- linux-2.6.27.27-a/drivers/video/fbmon.c	2009-07-20 05:45:22.000000000 +0200
+++ linux-2.6.27.27-b/drivers/video/fbmon.c	2009-07-22 09:45:34.000000000 +0200
@@ -272,6 +272,8 @@
  		err = 1;
  	}

+	printk("edid_checksum debug: csum=%u, all_null=%u, err=%d\n", csum, all_null, err);
+
  	return err;
  }


Here is a diff between a good and a bad kernel:

-edid_checksum debug: csum=0, all_null=255, err=1
-edid_checksum debug: csum=0, all_null=255, err=1
-Console: switching to colour frame buffer device 128x48
+edid_checksum debug: csum=6400, all_null=255, err=0
+Console: switching to colour frame buffer device 80x30

In the good one the function is called twice and it returns err=1 (==OK). 
In the bad kernel it returns 0 because csum!=0x00 (==6400).

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  8:32                           ` Krzysztof Oledzki
@ 2009-07-22  9:55                             ` Krzysztof Oledzki
  2009-07-22 10:44                               ` Krzysztof Oledzki
  2009-07-22  9:58                             ` Jens Rosenboom
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22  9:55 UTC (permalink / raw)
  To: Krzysztof Oledzki
  Cc: Linus Torvalds, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3751 bytes --]



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>
>> 
>> 
>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>> 
>>> 
>>> 
>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>> 
>>>> Anyway, I bet we can work around the compiler bug by just changing the
>>>> type of "i" from "unsigned char" to be a plain "int".
>>> 
>>> IOW, like this.
>>>
>>> 		Linus
>>> 
>>> ---
>>> drivers/video/fbmon.c |    4 ++--
>>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>>> index 5c1a2c0..af4a15c 100644
>>> --- a/drivers/video/fbmon.c
>>> +++ b/drivers/video/fbmon.c
>>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>> 
>>> static int edid_checksum(unsigned char *edid)
>>> {
>>> -	unsigned char i, csum = 0, all_null = 0;
>>> -	int err = 0, fix = check_edid(edid);
>>> +	unsigned csum = 0, all_null = 0;
>>> +	int i, err = 0, fix = check_edid(edid);
>>>
>>> 	if (fix)
>>> 		fix_edid(edid, fix);
>> 
>> Wow! You guys rock! ;)
>> 
>> Indeed, this simple change is enough to make my kernel bootable. However, 
>> there is still something wrong. My console is now 80x30 instead of 128x48:
>> 
>> -Console: switching to colour frame buffer device 128x48
>> +Console: switching to colour frame buffer device 80x30
>> 
>> So, it looks like the loop may be still miscompiled.
>> 
>> The kernel is here:
>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2
>
> OK, by adding a simple debug printk I'm now sure that the problem is indeed 
> in the loop:
>
> --- linux-2.6.27.27-a/drivers/video/fbmon.c	2009-07-20 05:45:22.000000000 
> +0200
> +++ linux-2.6.27.27-b/drivers/video/fbmon.c	2009-07-22 09:45:34.000000000 
> +0200
> @@ -272,6 +272,8 @@
> 		err = 1;
> 	}
>
> +	printk("edid_checksum debug: csum=%u, all_null=%u, err=%d\n", csum, 
> all_null, err);
> +
> 	return err;
> }
>
>
> Here is a diff between a good and a bad kernel:
>
> -edid_checksum debug: csum=0, all_null=255, err=1
> -edid_checksum debug: csum=0, all_null=255, err=1
> -Console: switching to colour frame buffer device 128x48
> +edid_checksum debug: csum=6400, all_null=255, err=0
> +Console: switching to colour frame buffer device 80x30
>
> In the good one the function is called twice and it returns err=1 (==OK). In 
> the bad kernel it returns 0 because csum!=0x00 (==6400).

More fun.

Actually, the problem is indepenent from -fno-strict-overflow and 
also exists with -fwrapv. It is probably caused by "unsigned char i" to "int i" 
translation and the variables reordering.

With "unsigned char i" before "unsigned char csum = 0":
  edid_checksum loop: edid[0]=0, csum=0
  edid_checksum loop: edid[1]=255, csum=0
  edid_checksum loop: edid[2]=255, csum=255
-edid_checksum loop: edid[3]=255, csum=254
-edid_checksum loop: edid[4]=255, csum=253
-edid_checksum loop: edid[5]=255, csum=252
-edid_checksum loop: edid[6]=255, csum=251
-edid_checksum loop: edid[7]=0, csum=250
-edid_checksum loop: edid[8]=6, csum=250
-edid_checksum loop: edid[9]=207, csum=0
(...)
-edid_checksum debug: csum=0, all_null=255, err=1 -> OK

With "int i" after "unsigned char csum = 0":
  edid_checksum loop: edid[0]=0, csum=0
  edid_checksum loop: edid[1]=255, csum=0
  edid_checksum loop: edid[2]=255, csum=255
+edid_checksum loop: edid[3]=255, csum=510
+edid_checksum loop: edid[4]=255, csum=765
+edid_checksum loop: edid[5]=255, csum=1020
+edid_checksum loop: edid[6]=255, csum=1275
+edid_checksum loop: edid[7]=0, csum=1530
+edid_checksum loop: edid[8]=6, csum=1530
+edid_checksum loop: edid[9]=207, csum=1536
(...)
+edid_checksum debug: csum=6400, all_null=255, err=0 -> ERROR

Funny. ;)

Best regards,

 			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  9:55                             ` Krzysztof Oledzki
@ 2009-07-22 10:44                               ` Krzysztof Oledzki
  0 siblings, 0 replies; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22 10:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4483 bytes --]



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>
>> 
>> 
>> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>> 
>>> 
>>> 
>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>> 
>>>> 
>>>> 
>>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>>> 
>>>>> Anyway, I bet we can work around the compiler bug by just changing the
>>>>> type of "i" from "unsigned char" to be a plain "int".
>>>> 
>>>> IOW, like this.
>>>>
>>>> 		Linus
>>>> 
>>>> ---
>>>> drivers/video/fbmon.c |    4 ++--
>>>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>>>> index 5c1a2c0..af4a15c 100644
>>>> --- a/drivers/video/fbmon.c
>>>> +++ b/drivers/video/fbmon.c
>>>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>>> 
>>>> static int edid_checksum(unsigned char *edid)
>>>> {
>>>> -	unsigned char i, csum = 0, all_null = 0;
>>>> -	int err = 0, fix = check_edid(edid);
>>>> +	unsigned csum = 0, all_null = 0;
>>>> +	int i, err = 0, fix = check_edid(edid);
>>>>
>>>> 	if (fix)
>>>> 		fix_edid(edid, fix);
>>> 
>>> Wow! You guys rock! ;)
>>> 
>>> Indeed, this simple change is enough to make my kernel bootable. However, 
>>> there is still something wrong. My console is now 80x30 instead of 128x48:
>>> 
>>> -Console: switching to colour frame buffer device 128x48
>>> +Console: switching to colour frame buffer device 80x30
>>> 
>>> So, it looks like the loop may be still miscompiled.
>>> 
>>> The kernel is here:
>>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2
>> 
>> OK, by adding a simple debug printk I'm now sure that the problem is indeed 
>> in the loop:
>> 
>> --- linux-2.6.27.27-a/drivers/video/fbmon.c	2009-07-20 05:45:22.000000000 
>> +0200
>> +++ linux-2.6.27.27-b/drivers/video/fbmon.c	2009-07-22 09:45:34.000000000 
>> +0200
>> @@ -272,6 +272,8 @@
>> 		err = 1;
>> 	}
>> 
>> +	printk("edid_checksum debug: csum=%u, all_null=%u, err=%d\n", csum, 
>> all_null, err);
>> +
>> 	return err;
>> }
>> 
>> 
>> Here is a diff between a good and a bad kernel:
>> 
>> -edid_checksum debug: csum=0, all_null=255, err=1
>> -edid_checksum debug: csum=0, all_null=255, err=1
>> -Console: switching to colour frame buffer device 128x48
>> +edid_checksum debug: csum=6400, all_null=255, err=0
>> +Console: switching to colour frame buffer device 80x30
>> 
>> In the good one the function is called twice and it returns err=1 (==OK). 
>> In the bad kernel it returns 0 because csum!=0x00 (==6400).
>
> More fun.
>
> Actually, the problem is indepenent from -fno-strict-overflow and also exists 
> with -fwrapv. It is probably caused by "unsigned char i" to "int i" 
> translation and the variables reordering.
>
> With "unsigned char i" before "unsigned char csum = 0":
> edid_checksum loop: edid[0]=0, csum=0
> edid_checksum loop: edid[1]=255, csum=0
> edid_checksum loop: edid[2]=255, csum=255
> -edid_checksum loop: edid[3]=255, csum=254
> -edid_checksum loop: edid[4]=255, csum=253
> -edid_checksum loop: edid[5]=255, csum=252
> -edid_checksum loop: edid[6]=255, csum=251
> -edid_checksum loop: edid[7]=0, csum=250
> -edid_checksum loop: edid[8]=6, csum=250
> -edid_checksum loop: edid[9]=207, csum=0
> (...)
> -edid_checksum debug: csum=0, all_null=255, err=1 -> OK
>
> With "int i" after "unsigned char csum = 0":
> edid_checksum loop: edid[0]=0, csum=0
> edid_checksum loop: edid[1]=255, csum=0
> edid_checksum loop: edid[2]=255, csum=255
> +edid_checksum loop: edid[3]=255, csum=510
> +edid_checksum loop: edid[4]=255, csum=765
> +edid_checksum loop: edid[5]=255, csum=1020
> +edid_checksum loop: edid[6]=255, csum=1275
> +edid_checksum loop: edid[7]=0, csum=1530
> +edid_checksum loop: edid[8]=6, csum=1530
> +edid_checksum loop: edid[9]=207, csum=1536
> (...)
> +edid_checksum debug: csum=6400, all_null=255, err=0 -> ERROR
>
> Funny. ;)

OK. I'm just blind. The oryginal patch is wrong. Instead of:

-       unsigned char i, csum = 0, all_null = 0;
-       int err = 0, fix = check_edid(edid);
+       unsigned csum = 0, all_null = 0;
+       int i, err = 0, fix = check_edid(edid);


It should be:

-       unsigned char i, csum = 0, all_null = 0;
-       int err = 0, fix = check_edid(edid);
+       unsigned char csum = 0, all_null = 0;
+       int i, err = 0, fix = check_edid(edid);

The patch should not change "unsigned char (...) csum" to "unsigned (...) csum".

Best regards,

 			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  8:32                           ` Krzysztof Oledzki
  2009-07-22  9:55                             ` Krzysztof Oledzki
@ 2009-07-22  9:58                             ` Jens Rosenboom
  2009-07-22 10:27                               ` Troy Moure
  2009-07-22 10:54                               ` Krzysztof Oledzki
  2009-07-22 10:24                             ` Troy Moure
  2009-07-22 10:33                             ` Dick Streefland
  3 siblings, 2 replies; 43+ messages in thread
From: Jens Rosenboom @ 2009-07-22  9:58 UTC (permalink / raw)
  To: Krzysztof Oledzki
  Cc: Linus Torvalds, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

On Wed, 2009-07-22 at 10:32 +0200, Krzysztof Oledzki wrote:
> 
> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
> 
> >
> >
> > On Tue, 21 Jul 2009, Linus Torvalds wrote:
> >
> >> 
> >> 
> >> On Tue, 21 Jul 2009, Linus Torvalds wrote:
> >>> 
> >>> Anyway, I bet we can work around the compiler bug by just changing the
> >>> type of "i" from "unsigned char" to be a plain "int".
> >> 
> >> IOW, like this.
> >>
> >> 		Linus
> >> 
> >> ---
> >> drivers/video/fbmon.c |    4 ++--
> >> 1 files changed, 2 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> >> index 5c1a2c0..af4a15c 100644
> >> --- a/drivers/video/fbmon.c
> >> +++ b/drivers/video/fbmon.c
> >> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> >> 
> >> static int edid_checksum(unsigned char *edid)
> >> {
> >> -	unsigned char i, csum = 0, all_null = 0;
> >> -	int err = 0, fix = check_edid(edid);
> >> +	unsigned csum = 0, all_null = 0;

I guess Linus shouldn't have deleted the "char" here...

[...]

> Here is a diff between a good and a bad kernel:
> 
> -edid_checksum debug: csum=0, all_null=255, err=1
> -edid_checksum debug: csum=0, all_null=255, err=1
> -Console: switching to colour frame buffer device 128x48
> +edid_checksum debug: csum=6400, all_null=255, err=0
> +Console: switching to colour frame buffer device 80x30
> 
> In the good one the function is called twice and it returns err=1 (==OK). 
> In the bad kernel it returns 0 because csum!=0x00 (==6400).

6400 is a multiple of 256, so as unsigned char it is ==0.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  9:58                             ` Jens Rosenboom
@ 2009-07-22 10:27                               ` Troy Moure
  2009-07-22 10:54                               ` Krzysztof Oledzki
  1 sibling, 0 replies; 43+ messages in thread
From: Troy Moure @ 2009-07-22 10:27 UTC (permalink / raw)
  To: Jens Rosenboom
  Cc: Krzysztof Oledzki, Linus Torvalds, Troy Moure, Greg KH,
	Linux Kernel Mailing List, Andrew Morton, stable, lwn,
	Ian Lance Taylor



On Wed, 22 Jul 2009, Jens Rosenboom wrote:

> > >> --- a/drivers/video/fbmon.c
> > >> +++ b/drivers/video/fbmon.c
> > >> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> > >> 
> > >> static int edid_checksum(unsigned char *edid)
> > >> {
> > >> -	unsigned char i, csum = 0, all_null = 0;
> > >> -	int err = 0, fix = check_edid(edid);
> > >> +	unsigned csum = 0, all_null = 0;
> 
> I guess Linus shouldn't have deleted the "char" here...
> 
> [...]
> 

Doh! I missed this - I didn't look closely at his patch.

You can disregard my just-sent message, then...

	Troy


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  9:58                             ` Jens Rosenboom
  2009-07-22 10:27                               ` Troy Moure
@ 2009-07-22 10:54                               ` Krzysztof Oledzki
  1 sibling, 0 replies; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22 10:54 UTC (permalink / raw)
  To: Jens Rosenboom
  Cc: Linus Torvalds, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1217 bytes --]



On Wed, 22 Jul 2009, Jens Rosenboom wrote:

> On Wed, 2009-07-22 at 10:32 +0200, Krzysztof Oledzki wrote:
>>
>> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>>
>>>
>>>
>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>
>>>>
>>>>
>>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>>>
>>>>> Anyway, I bet we can work around the compiler bug by just changing the
>>>>> type of "i" from "unsigned char" to be a plain "int".
>>>>
>>>> IOW, like this.
>>>>
>>>> 		Linus
>>>>
>>>> ---
>>>> drivers/video/fbmon.c |    4 ++--
>>>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>>>> index 5c1a2c0..af4a15c 100644
>>>> --- a/drivers/video/fbmon.c
>>>> +++ b/drivers/video/fbmon.c
>>>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>>>
>>>> static int edid_checksum(unsigned char *edid)
>>>> {
>>>> -	unsigned char i, csum = 0, all_null = 0;
>>>> -	int err = 0, fix = check_edid(edid);
>>>> +	unsigned csum = 0, all_null = 0;
>
> I guess Linus shouldn't have deleted the "char" here...

Yep! However, it took me one more hour to find it. I should have read my 
mails more often. ;)

Thanks.

Best regards,

 			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  8:32                           ` Krzysztof Oledzki
  2009-07-22  9:55                             ` Krzysztof Oledzki
  2009-07-22  9:58                             ` Jens Rosenboom
@ 2009-07-22 10:24                             ` Troy Moure
  2009-07-22 10:33                             ` Dick Streefland
  3 siblings, 0 replies; 43+ messages in thread
From: Troy Moure @ 2009-07-22 10:24 UTC (permalink / raw)
  To: Krzysztof Oledzki
  Cc: Linus Torvalds, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
> > 
> > Indeed, this simple change is enough to make my kernel bootable. However,
> > there is still something wrong. My console is now 80x30 instead of 128x48:
> > 
> > -Console: switching to colour frame buffer device 128x48
> > +Console: switching to colour frame buffer device 80x30
> > 
> > So, it looks like the loop may be still miscompiled.
> > 

Yes.  I took a look at the -fixed binary you sent out.  edid_checksum() is 
now compiled to this (I added some notes on the side):

ffffffff803b37ed: <edid_checksum>:
       53                      push   %rbx
       48 89 fb                mov    %rdi,%rbx		   %ebx = edid

	[... Calls to check_edid and fix_edid ... ]

       31 c9                   xor    %ecx,%ecx		   csum = 0
       31 f6                   xor    %esi,%esi		   all_null = 0
       31 d2                   xor    %edx,%edx		   i = 0
 L:    0f b6 04 1a             movzbl (%rdx,%rbx,1),%eax   %eax = *(i + edid)
       48 ff c2                inc    %rdx		   i++
       01 c1                   add    %eax,%ecx		   csum += %eax
       09 c6                   or     %eax,%esi		   all_null |= %eax
       48 81 fa 80 00 00 00    cmp    $0x80,%rdx
       75 ec                   jne    L			   if i != 80 goto L
       85 c9                   test   %ecx,%ecx
       0f 94 c0                sete   %al		%al == (csum == 0)
       85 f6                   test   %esi,%esi
       5b                      pop    %rbx
       0f 95 c2                setne  %dl		%dl == (all_null == 0)
       21 d0                   and    %edx,%eax
       0f b6 c0                movzbl %al,%eax		%eax == (%al && %dl)
       c3                      retq			return %eax

The problem is that csum is stored in %ecx (a 32-bit register) at all 
times and is never truncated to a byte.  In other words, the compiler is 
treating csum like it's an 'int', not an 'unsigned char'.

> Here is a diff between a good and a bad kernel:
> 
> -edid_checksum debug: csum=0, all_null=255, err=1
> -edid_checksum debug: csum=0, all_null=255, err=1
> -Console: switching to colour frame buffer device 128x48
> +edid_checksum debug: csum=6400, all_null=255, err=0
> +Console: switching to colour frame buffer device 80x30
> 
> In the good one the function is called twice and it returns err=1 (==OK). In
> the bad kernel it returns 0 because csum!=0x00 (==6400).

That makes sense - since csum is being treated like an 'int', it never 
wraps, so it just ends up holding the total sum of all the bytes, which 
apparently is 6400.  Notice that 6400 % 256 == 0, so if it *had* wrapped, 
it would have ended up being 0, as expected.

One "fix" might be to just make 'csum' an 'int' (since that's what the 
compiler seems to think anyway :p) and do the wrapping by hand (patch 
below, if you want to try this).

However, I wouldn't be surprised if other kernel functions are also being 
miscompiled.  It seems to me that any function that does arithmetic on 
'unsigned char's and depends on the wrapping behaviour could potentially 
be broken... 

Best regards,

	Troy

diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
index 5c1a2c0..6802b4c 100644
--- a/drivers/video/fbmon.c
+++ b/drivers/video/fbmon.c
@@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
 
 static int edid_checksum(unsigned char *edid)
 {
-	unsigned char i, csum = 0, all_null = 0;
-	int err = 0, fix = check_edid(edid);
+	unsigned char all_null = 0;
+	int i, csum = 0, err = 0, fix = check_edid(edid);
 
 	if (fix)
 		fix_edid(edid, fix);
@@ -267,7 +267,7 @@ static int edid_checksum(unsigned char *edid)
 		all_null |= edid[i];
 	}
 
-	if (csum == 0x00 && all_null) {
+	if ((csum & 0xff) == 0x00 && all_null) {
 		/* checksum passed, everything's good */
 		err = 1;
 	}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  8:32                           ` Krzysztof Oledzki
                                               ` (2 preceding siblings ...)
  2009-07-22 10:24                             ` Troy Moure
@ 2009-07-22 10:33                             ` Dick Streefland
  3 siblings, 0 replies; 43+ messages in thread
From: Dick Streefland @ 2009-07-22 10:33 UTC (permalink / raw)
  To: linux-kernel

Krzysztof Oledzki <olel@ans.pl> wrote:
| Here is a diff between a good and a bad kernel:
| 
| -edid_checksum debug: csum=0, all_null=255, err=1
| -edid_checksum debug: csum=0, all_null=255, err=1
| -Console: switching to colour frame buffer device 128x48
| +edid_checksum debug: csum=6400, all_null=255, err=0
| +Console: switching to colour frame buffer device 80x30
| 
| In the good one the function is called twice and it returns err=1 (==OK). 
| In the bad kernel it returns 0 because csum!=0x00 (==6400).

Linus accidentally dropped the "char" from the declaration of "csum" and
"all_null":

> -	unsigned char i, csum = 0, all_null = 0;
> -	int err = 0, fix = check_edid(edid);
> +	unsigned csum = 0, all_null = 0;
> +	int i, err = 0, fix = check_edid(edid);

-- 
Dick Streefland                      ////                      Altium BV
dick.streefland@altium.nl           (@ @)          http://www.altium.com
--------------------------------oOO--(_)--OOo---------------------------

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  1:16                       ` Linus Torvalds
  2009-07-22  8:12                         ` Krzysztof Oledzki
@ 2009-07-22 13:48                         ` Krzysztof Oledzki
  2009-07-22 15:48                           ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22 13:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 984 bytes --]



On Tue, 21 Jul 2009, Linus Torvalds wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>
>> Anyway, I bet we can work around the compiler bug by just changing the
>> type of "i" from "unsigned char" to be a plain "int".
>
> IOW, like this.
>
> 		Linus
>
> ---
> drivers/video/fbmon.c |    4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> index 5c1a2c0..af4a15c 100644
> --- a/drivers/video/fbmon.c
> +++ b/drivers/video/fbmon.c
> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>
> static int edid_checksum(unsigned char *edid)
> {
> -	unsigned char i, csum = 0, all_null = 0;
> -	int err = 0, fix = check_edid(edid);
> +	unsigned csum = 0, all_null = 0;
> +	int i, err = 0, fix = check_edid(edid);
>
> 	if (fix)
> 		fix_edid(edid, fix);

Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

On condition, that you keep "unsigned char" here. ;)

Best regards,

 			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22 13:48                         ` Krzysztof Oledzki
@ 2009-07-22 15:48                           ` Linus Torvalds
  2009-07-29 14:57                             ` Pavel Machek
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2009-07-22 15:48 UTC (permalink / raw)
  To: Krzysztof Oledzki
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

> 
> 
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
> > 
> > IOW, like this.

Yeah, I'm a moron. Not at _all_ like that. I wish I hadn't sent out the 
patch, you'd have done it correctly by hand from my description.

> > diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> > index 5c1a2c0..af4a15c 100644
> > --- a/drivers/video/fbmon.c
> > +++ b/drivers/video/fbmon.c
> > @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> > 
> > static int edid_checksum(unsigned char *edid)
> > {
> > -	unsigned char i, csum = 0, all_null = 0;
> > -	int err = 0, fix = check_edid(edid);
> > +	unsigned csum = 0, all_null = 0;
> > +	int i, err = 0, fix = check_edid(edid);

I don't know where the 'char' disappeared, but that was obviously not 
intended. I just meant to move the 'i' from one line to the other.

> > 
> > 	if (fix)
> > 		fix_edid(edid, fix);
> 
> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
> 
> On condition, that you keep "unsigned char" here. ;)

Indeed. I'll commit the fixed version. Thanks for testing and sorry for 
the idiot patch.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22 15:48                           ` Linus Torvalds
@ 2009-07-29 14:57                             ` Pavel Machek
  2009-07-29 15:59                               ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Machek @ 2009-07-29 14:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Krzysztof Oledzki, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor


> > > diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> > > index 5c1a2c0..af4a15c 100644
> > > --- a/drivers/video/fbmon.c
> > > +++ b/drivers/video/fbmon.c
> > > @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> > > 
> > > static int edid_checksum(unsigned char *edid)
> > > {
> > > -	unsigned char i, csum = 0, all_null = 0;
> > > -	int err = 0, fix = check_edid(edid);
> > > +	unsigned csum = 0, all_null = 0;
> > > +	int i, err = 0, fix = check_edid(edid);
> 
> I don't know where the 'char' disappeared, but that was obviously not 
> intended. I just meant to move the 'i' from one line to the other.
> 
> > > 
> > > 	if (fix)
> > > 		fix_edid(edid, fix);
> > 
> > Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
> > 
> > On condition, that you keep "unsigned char" here. ;)
> 
> Indeed. I'll commit the fixed version. Thanks for testing and sorry for 
> the idiot patch.

So... we are going to just work around the gcc bug in the kernel?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-29 14:57                             ` Pavel Machek
@ 2009-07-29 15:59                               ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2009-07-29 15:59 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Krzysztof Oledzki, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

On Wed, 29 Jul 2009, Pavel Machek wrote:
> 
> So... we are going to just work around the gcc bug in the kernel?

Well, the gcc people hopefully will fix it in the 4.2.4 tree too.

Also, it's not exactly the first time we work around compiler bugs. We've 
done it before, I'm sure we'll do it again.

In this case, the work-around is trivial, and in many ways makes the code 
more "normal" (it's just a loop counter, might as well use an "int" for 
it), so there are no downsides to it. 

We could disallow gcc-4.2.4 entirely, of course, and have a big "this 
compiler is known to generate broken code" message and refuse to compile 
the kernel with it, but while that would be a "safer" approach, it would 
be rather user-unfriendly.

Compiler bugs happen. They're really really annoying, and nasty to track 
down. But they aren't the end of the world, and the pattern of this 
particular bug doesn't seem like it would likely trigger anywhere else.

For example, I suspect it really does need that loop induction variable to 
be 'unsigned char', and it really needs that limit to be exactly 128. It 
looks like a combination of loop optimization and broken range logic. I 
doubt it would hit in some random code that just happens to compare an 
unsigned char against 128 in general.

And using 'unsigned char' as a lop induction variable is _very_ rare. 
Which is probably why the gcc bug happened in the first place - no 
testing.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22  0:53                     ` Linus Torvalds
  2009-07-22  1:07                       ` Linus Torvalds
  2009-07-22  1:16                       ` Linus Torvalds
@ 2009-07-22 11:49                       ` Krzysztof Oledzki
  2009-07-22 13:27                         ` Henrique de Moraes Holschuh
                                           ` (2 more replies)
  2 siblings, 3 replies; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22 11:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3312 bytes --]



On Tue, 21 Jul 2009, Linus Torvalds wrote:

>
> [ Added Ian Lance Taylor to the cc, he knows the background, and unlike me
>  is competent with gcc. ]
>
> On Tue, 21 Jul 2009, Troy Moure wrote:
>>
>> I think I've found something interesting.  Look at the the code generated
>> for edid_checksum() in driver/video/fbmon.c.  This is what I see for the
>> -fno-strict-overflow kernel:
>
> Ooh.
>
> Bingo. You're 100% right, and you definitely found it (of course, there
> may be _other_ cases like this, but that's certainly _one_ of the
> problems, and probably the only one).
>
> Just out of curiosity, how did you find it? Now that I know where to look,
> it's very obvious in the assembler diffs, but I didn't notice it until you
> pointed it out just because there is so _much_ of the diffs...
>
> And yes, that's very much a compiler bug. And I also bet it's very easily
> fixed.
>
> The code in question is this loop:
>
> 	#define EDID_LENGTH 128
>
> 	unsigned char i, ...
>
> 	for (i = 0; i < EDID_LENGTH; i++) {
>                csum += edid[i];
>                all_null |= edid[i];
>        }
>
> and gcc -fno-strict-overflow has apparently decided that that is an
> infinite loop, even though it clearly is not. So then the stupid and buggy
> compiler will compile that loop (and the whole rest of the function) to
> the "optimized" version that is just
>
> 	loop:
> 		jmp loop;
>
> I even bet I know why: it looks at "unsigned char", and sees that it is an
> 8-bit variable, and then it looks at "i < EDID_LENGTH" and sees that it is
> a _signed_ comparison (it's signed because the C type rules mean that
> 'unsigned char' will be extended to 'int' in an expression), and then it
> decides that in a signed comparison an 8-bit entry is always going to be
> smaller than 128.
>
> Anyway, I bet we can work around the compiler bug by just changing the
> type of "i" from "unsigned char" to be a plain "int".
>
> Krzysztof? Mind testing that?
>
> Ian? This is Linux 2.6.27.27 compiled with gcc-4.2.4. I'm not seeing the
> bug in the gcc I have on my machine (gcc-4.4.0), but the bug is very clear
> (once you _find_ it, which was the problem) in the binaries that Krzysztof
> posted. They're still at:
>
>    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
>    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
>    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)
>
> and you can clearly see the 'edid_checksum' miscompilation in the objdump
> disassembly.

BTW: here is a simple testcase for this bug:

--- fno-strict-overflow-fixed-bug.c ---
#include <stdio.h>

int main() {

         unsigned char i;

         for (i = 0; i < 128; i++)
                 printf("loop %u\n", i);

         return 0;
}
--- cut here ---

The code should be compiled with:
  cc -o fno-strict-overflow-fixed-bug -Os -fno-strict-overflow fno-strict-overflow-fixed-bug.c
or:
  cc -o fno-strict-overflow-fixed-bug -O2 -fno-strict-overflow fno-strict-overflow-fixed-bug.c

This bug does not exist with -O1 or if the loop is controlled by "i < 127" 
or "i < 129".

So, we should make sure there is no
  unsigned char i; (...) for (i = 0; i < 128; i++)
somewhere inside the kernel.

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22 11:49                       ` Krzysztof Oledzki
@ 2009-07-22 13:27                         ` Henrique de Moraes Holschuh
  2009-07-22 13:45                         ` Krzysztof Oledzki
  2009-07-22 15:36                         ` Ian Lance Taylor
  2 siblings, 0 replies; 43+ messages in thread
From: Henrique de Moraes Holschuh @ 2009-07-22 13:27 UTC (permalink / raw)
  To: Krzysztof Oledzki
  Cc: Linus Torvalds, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn, Ian Lance Taylor

On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
> BTW: here is a simple testcase for this bug:
> 
> --- fno-strict-overflow-fixed-bug.c ---
> #include <stdio.h>
> 
> int main() {
> 
>         unsigned char i;
> 
>         for (i = 0; i < 128; i++)
>                 printf("loop %u\n", i);
> 
>         return 0;
> }
> --- cut here ---
> 
> The code should be compiled with:
>  cc -o fno-strict-overflow-fixed-bug -Os -fno-strict-overflow fno-strict-overflow-fixed-bug.c
> or:
>  cc -o fno-strict-overflow-fixed-bug -O2 -fno-strict-overflow fno-strict-overflow-fixed-bug.c
> 
> This bug does not exist with -O1 or if the loop is controlled by "i
> < 127" or "i < 129".

Thank you.

Debian stable,   ia32, gcc 4.2.4-6:    BUG
Debian stable,   ia32, gcc 4.3.2-1.1:  no bug (default)

Debian unstable, ia32, gcc 4.2.4-6:    BUG
Debian unstable, ia32, gcc 4.3.3-14:   no bug (default)

Someone else already tested gcc 4.4.0, as no bug.

I will file bugs against Debian gcc-4.2...

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22 11:49                       ` Krzysztof Oledzki
  2009-07-22 13:27                         ` Henrique de Moraes Holschuh
@ 2009-07-22 13:45                         ` Krzysztof Oledzki
  2009-07-22 15:36                         ` Ian Lance Taylor
  2 siblings, 0 replies; 43+ messages in thread
From: Krzysztof Oledzki @ 2009-07-22 13:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Troy Moure, Greg KH, Linux Kernel Mailing List, Andrew Morton,
	stable, lwn, Ian Lance Taylor

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3878 bytes --]



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
>> 
>> [ Added Ian Lance Taylor to the cc, he knows the background, and unlike me
>>  is competent with gcc. ]
>> 
>> On Tue, 21 Jul 2009, Troy Moure wrote:
>>> 
>>> I think I've found something interesting.  Look at the the code generated
>>> for edid_checksum() in driver/video/fbmon.c.  This is what I see for the
>>> -fno-strict-overflow kernel:
>> 
>> Ooh.
>> 
>> Bingo. You're 100% right, and you definitely found it (of course, there
>> may be _other_ cases like this, but that's certainly _one_ of the
>> problems, and probably the only one).
>> 
>> Just out of curiosity, how did you find it? Now that I know where to look,
>> it's very obvious in the assembler diffs, but I didn't notice it until you
>> pointed it out just because there is so _much_ of the diffs...
>> 
>> And yes, that's very much a compiler bug. And I also bet it's very easily
>> fixed.
>> 
>> The code in question is this loop:
>>
>> 	#define EDID_LENGTH 128
>>
>> 	unsigned char i, ...
>>
>> 	for (i = 0; i < EDID_LENGTH; i++) {
>>                csum += edid[i];
>>                all_null |= edid[i];
>>        }
>> 
>> and gcc -fno-strict-overflow has apparently decided that that is an
>> infinite loop, even though it clearly is not. So then the stupid and buggy
>> compiler will compile that loop (and the whole rest of the function) to
>> the "optimized" version that is just
>>
>> 	loop:
>> 		jmp loop;
>> 
>> I even bet I know why: it looks at "unsigned char", and sees that it is an
>> 8-bit variable, and then it looks at "i < EDID_LENGTH" and sees that it is
>> a _signed_ comparison (it's signed because the C type rules mean that
>> 'unsigned char' will be extended to 'int' in an expression), and then it
>> decides that in a signed comparison an 8-bit entry is always going to be
>> smaller than 128.
>> 
>> Anyway, I bet we can work around the compiler bug by just changing the
>> type of "i" from "unsigned char" to be a plain "int".
>> 
>> Krzysztof? Mind testing that?
>> 
>> Ian? This is Linux 2.6.27.27 compiled with gcc-4.2.4. I'm not seeing the
>> bug in the gcc I have on my machine (gcc-4.4.0), but the bug is very clear
>> (once you _find_ it, which was the problem) in the binaries that Krzysztof
>> posted. They're still at:
>>
>>    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 
>> (Hangs)
>>    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 
>> (OK)
>>    http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 
>> (OK)
>> 
>> and you can clearly see the 'edid_checksum' miscompilation in the objdump
>> disassembly.
>
> BTW: here is a simple testcase for this bug:
>
> --- fno-strict-overflow-fixed-bug.c ---
> #include <stdio.h>
>
> int main() {
>
>        unsigned char i;
>
>        for (i = 0; i < 128; i++)
>                printf("loop %u\n", i);
>
>        return 0;
> }
> --- cut here ---

Or this better one (no infinite loop):

--- cut here ---
#include <stdio.h>

int main() {

         unsigned char i, j=0;

         for (i = 0; i <= 127; i++) {

                 if (!i && j++) {
                         printf("Buggy GCC\n");
                         return 1;
                 }
         }

         printf("GCC is OK\n");

         return 0;
}
--- cut here ---

> The code should be compiled with:
> cc -o fno-strict-overflow-fixed-bug -Os -fno-strict-overflow 
> fno-strict-overflow-fixed-bug.c
> or:
> cc -o fno-strict-overflow-fixed-bug -O2 -fno-strict-overflow 
> fno-strict-overflow-fixed-bug.c
>
> This bug does not exist with -O1 or if the loop is controlled by "i < 127" or 
> "i < 129".
>
> So, we should make sure there is no
> unsigned char i; (...) for (i = 0; i < 128; i++)
> somewhere inside the kernel.


Best regards,

 				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-22 11:49                       ` Krzysztof Oledzki
  2009-07-22 13:27                         ` Henrique de Moraes Holschuh
  2009-07-22 13:45                         ` Krzysztof Oledzki
@ 2009-07-22 15:36                         ` Ian Lance Taylor
  2 siblings, 0 replies; 43+ messages in thread
From: Ian Lance Taylor @ 2009-07-22 15:36 UTC (permalink / raw)
  To: Krzysztof Oledzki
  Cc: Linus Torvalds, Troy Moure, Greg KH, Linux Kernel Mailing List,
	Andrew Morton, stable, lwn

Krzysztof Oledzki <olel@ans.pl> writes:

> BTW: here is a simple testcase for this bug:
>
> --- fno-strict-overflow-fixed-bug.c ---
> #include <stdio.h>
>
> int main() {
>
>         unsigned char i;
>
>         for (i = 0; i < 128; i++)
>                 printf("loop %u\n", i);
>
>         return 0;
> }
> --- cut here ---

I can confirm that this is broken in gcc 4.2 using -fno-strict-overflow,
and works correctly in gcc 4.3 and later.  Sigh.

Ian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-20 15:10   ` Greg KH
  2009-07-20 16:01     ` Linus Torvalds
@ 2009-07-23 17:33     ` Krzysztof Olędzki
  2009-07-24 21:13       ` Greg KH
  1 sibling, 1 reply; 43+ messages in thread
From: Krzysztof Olędzki @ 2009-07-23 17:33 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, Andrew Morton, torvalds, stable, lwn

On 2009-07-20 17:10, Greg KH wrote:
> On Mon, Jul 20, 2009 at 01:51:33PM +0200, Krzysztof Oledzki wrote:
>> On 2009-07-20 06:06, Greg KH wrote:
>>> I'm announcing the release of the 2.6.27.26 kernel.  All users of the
>>> 2.6.27 kernel series are very strongly encouraged to upgrade.
>>>
>>> I'll also be replying to this message with a copy of the patch between
>>> 2.6.27.26 and 2.6.27.27
>>>
>>> The updated 2.6.27.y git tree can be found at:
>>>        git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
>>> and can be browsed at the normal kernel.org git web browser:
>>>        http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
>>>
>>> thanks,
>> <CUT>
>>> Linus Torvalds (1):
>>>      Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
>> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
>> hangs during boot and it is caused by this fix.
>>
>> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
>> additional info & .config.
> 
> Wierd.  Linus, any thoughts?  Do we need to check the version of the
> compiler that we are using before turning that option off?

This bug is now fixed by 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3730793d457fed79a6d49bae72996d458c8e4f2d

Could you please include this patch in the next -stable releases?

BTW: the problem exists in gcc-4.2.4 not gcc-2.4.2 so I think it is 
worth to change the subject when adding the patch. ;)

Best regards,

			Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Linux 2.6.27.27
  2009-07-23 17:33     ` Krzysztof Olędzki
@ 2009-07-24 21:13       ` Greg KH
  0 siblings, 0 replies; 43+ messages in thread
From: Greg KH @ 2009-07-24 21:13 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: linux-kernel, Andrew Morton, torvalds, stable, lwn

On Thu, Jul 23, 2009 at 07:33:23PM +0200, Krzysztof Olędzki wrote:
> On 2009-07-20 17:10, Greg KH wrote:
> > On Mon, Jul 20, 2009 at 01:51:33PM +0200, Krzysztof Oledzki wrote:
> >> On 2009-07-20 06:06, Greg KH wrote:
> >>> I'm announcing the release of the 2.6.27.26 kernel.  All users of the
> >>> 2.6.27 kernel series are very strongly encouraged to upgrade.
> >>>
> >>> I'll also be replying to this message with a copy of the patch between
> >>> 2.6.27.26 and 2.6.27.27
> >>>
> >>> The updated 2.6.27.y git tree can be found at:
> >>>        git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
> >>> and can be browsed at the normal kernel.org git web browser:
> >>>        http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
> >>>
> >>> thanks,
> >> <CUT>
> >>> Linus Torvalds (1):
> >>>      Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
> >> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
> >> hangs during boot and it is caused by this fix.
> >>
> >> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
> >> additional info & .config.
> > 
> > Wierd.  Linus, any thoughts?  Do we need to check the version of the
> > compiler that we are using before turning that option off?
> 
> This bug is now fixed by 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3730793d457fed79a6d49bae72996d458c8e4f2d
> 
> Could you please include this patch in the next -stable releases?

Yes, I have it queued up to go out today :)

> BTW: the problem exists in gcc-4.2.4 not gcc-2.4.2 so I think it is 
> worth to change the subject when adding the patch. ;)

Hm, I'll see about that.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2009-07-29 16:01 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-20  4:06 Linux 2.6.27.27 Greg KH
2009-07-20  4:07 ` Greg KH
2009-07-20 11:51 ` Krzysztof Oledzki
2009-07-20 15:10   ` Greg KH
2009-07-20 16:01     ` Linus Torvalds
2009-07-20 21:45       ` Krzysztof Oledzki
2009-07-20 22:08         ` Linus Torvalds
2009-07-20 23:47           ` Marc Dionne
2009-07-20 23:56             ` Linus Torvalds
2009-07-21  0:37               ` Marc Dionne
2009-07-21  1:01                 ` Linus Torvalds
2009-07-21  6:40                   ` Krzysztof Oledzki
2009-07-21  1:05                 ` Linus Torvalds
2009-07-21  2:38                   ` Marc Dionne
2009-07-21  6:33           ` Krzysztof Oledzki
2009-07-21 10:16             ` Krzysztof Oledzki
2009-07-21 16:11               ` Linus Torvalds
2009-07-21 19:15                 ` Linus Torvalds
2009-07-21 21:34                   ` Troy Moure
2009-07-22  0:53                     ` Linus Torvalds
2009-07-22  1:07                       ` Linus Torvalds
2009-07-22  6:16                         ` Troy Moure
2009-07-22 15:58                           ` Linus Torvalds
2009-07-22  1:16                       ` Linus Torvalds
2009-07-22  8:12                         ` Krzysztof Oledzki
2009-07-22  8:32                           ` Krzysztof Oledzki
2009-07-22  9:55                             ` Krzysztof Oledzki
2009-07-22 10:44                               ` Krzysztof Oledzki
2009-07-22  9:58                             ` Jens Rosenboom
2009-07-22 10:27                               ` Troy Moure
2009-07-22 10:54                               ` Krzysztof Oledzki
2009-07-22 10:24                             ` Troy Moure
2009-07-22 10:33                             ` Dick Streefland
2009-07-22 13:48                         ` Krzysztof Oledzki
2009-07-22 15:48                           ` Linus Torvalds
2009-07-29 14:57                             ` Pavel Machek
2009-07-29 15:59                               ` Linus Torvalds
2009-07-22 11:49                       ` Krzysztof Oledzki
2009-07-22 13:27                         ` Henrique de Moraes Holschuh
2009-07-22 13:45                         ` Krzysztof Oledzki
2009-07-22 15:36                         ` Ian Lance Taylor
2009-07-23 17:33     ` Krzysztof Olędzki
2009-07-24 21:13       ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox