From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D392E1C84BC
	for <tech-board-discuss@lists.linux.dev>; Tue, 10 Mar 2026 04:56:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=18.9.28.11
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773118572; cv=none; b=bOWCK1DWgc5u+r0AXdMhfsp4wW/8NArimAFtufyLAjJWduHBkq8dj2ueqr3n9mnLSvEMCCPjCUwjUe6B4Z3ZXxVuNZGIWBYl6OEMedVB2RGm4U0yDow9njzIVc6wOsu869R0TeE6f6C7N9Xx3c2zldfwnaQx0D4U1ail5SVkFB4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773118572; c=relaxed/simple;
	bh=NOLMb5KDqJzp0YJkoWtkOIjaVQZoGiTOJv74IJ6yKyA=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=kKIy/yq6AgnF5B2LBoHYkmxD/UibkTi7+OW4WYERSjEGfqsGMITkznx8F3bGjR3xCmZrlNdkXD7/BcwAYxuo1w0hsoA8UH9G9dtMNubFaa0TgzyrdUO0/Wsio64rRRnsGsK1wBSNC3cmeoRPRMKx3A+OMpQes6A8z9/Im6LFmTU=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu; spf=pass smtp.mailfrom=mit.edu; dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b=PjurwgRA; arc=none smtp.client-ip=18.9.28.11
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mit.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b="PjurwgRA"
Received: from macsyma.thunk.org (pool-173-48-117-133.bstnma.fios.verizon.net [173.48.117.133])
	(authenticated bits=0)
        (User authenticated as tytso@ATHENA.MIT.EDU)
	by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 62A4qANa015611
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 10 Mar 2026 00:52:11 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing;
	t=1773118333; bh=EbcXuVVBuLkoAQ0/37a2sDQMaqYseEX5l0GJJNAHxkA=;
	h=Date:From:Subject:Message-ID:MIME-Version:Content-Type;
	b=PjurwgRAHmC0tHBL1nFWA2av/VUIW7yhR7k2CTVgMmzfOy/P6eCdB0iQRSZ7jrmS3
	 uuT3nOupDAbTgsiz1GfTWfQnN5ieeJcAkx4Co9e+7PGN9rWIozb7jD0PkujotC9qE3
	 ChVKG0F0TP/CLCpAlWWj7id8nMGNY1XAlRjN9r09HO519ppuF+0HYjgKiYdPSWpgZf
	 8TrV4xe75EAjPNJVyhVsbPYl3V4g8HyMfZPksJy6t1bJmziVE2ShKwGSr6cmqD7rjn
	 4UCRzYKq6byUTRyww1ENQFRguWCPNTXDykrkJVQURyGENCnZ1OgwhFIJ8aF7UyjHyt
	 wMOz5MsigCtBg==
Received: by macsyma.thunk.org (Postfix, from userid 15806)
	id A30085C3E567; Tue, 10 Mar 2026 00:52:10 -0400 (EDT)
Date: Tue, 10 Mar 2026 00:52:10 -0400
From: "Theodore Tso" <tytso@mit.edu>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jonathan Corbet <corbet@lwn.net>, Steven Rostedt <rostedt@goodmis.org>,
        Christian Brauner <brauner@kernel.org>,
        tech-board-discuss@lists.linux.dev, linux-kernel@vger.kernel.org,
        ksummit-discuss@lists.linuxfoundation.org,
        christianvanbrauner@gmail.com
Subject: Re: LLM based rewrites
Message-ID: <20260310045210.GA14867@macsyma-wired.lan>
References: <20260307-clean-room-6118793eb175@brauner>
 <20260309095705.7a6b6177@gandalf.local.home>
 <EBF43D48-DA7B-4449-85CF-36351BE07A56@zytor.com>
 <20260309121629.21cabc25@gandalf.local.home>
 <871phtvu7r.fsf@trenco.lwn.net>
 <04B897EF-DEEC-42D0-8E00-888CEEA5318E@zytor.com>
Precedence: bulk
X-Mailing-List: tech-board-discuss@lists.linux.dev
List-Id: <tech-board-discuss.lists.linux.dev>
List-Subscribe: <mailto:tech-board-discuss+subscribe@lists.linux.dev>
List-Unsubscribe: <mailto:tech-board-discuss+unsubscribe@lists.linux.dev>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <04B897EF-DEEC-42D0-8E00-888CEEA5318E@zytor.com>

> >The fact that every version of chardet was surely in its training data
> >is not deemed to be relevant.
>
> That's a question for the lawyers and the courts, really. But it is
> most definitely *not* clean room. That being said, clean room is
> certainly not the only way to rewrite software that can pass legal
> muster, but it is the gold standard

Well, given that researchers were able to elicit 96% of Harry Potter
and the Sorcerer's Stone from Claude 3.7 Sonnet[1], the question I
have is that if you have one LLM instance create a specification from
looking at the code that you are trying to clone, and then you have a
second LLM instance that was trained on the code you are trying to
clone, and then fed the specification --- regardless of whether this
can be considered "clean room" from a process perpsective, the other
question is just whether there is enough similarity in the actual
*results*, that could also be a problem.

[1] https://arxiv.org/html/2601.02671v1

Of course, we could imagine using the LLM to incrementally rerite the
C code that was elicited from the specification if the results are too
closely to the source program --- that is, "Hey ChatGPT, please file
off the serial number so the source code looks nothing like the GPL
code that I'm trying to rip off."

The thing is, though, this is something that humans could do as well,
It wouldn't surprise me if there are cases of "clean room
implementation" where there might be some incremental rewriting; and
proving that it wasn't a strict clean room procedure might be quite
difficult.  It's just that with AI, it might be easier to do things at
scale.

						- Ted