From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D392E1C84BC for ; Tue, 10 Mar 2026 04:56:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=18.9.28.11 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773118572; cv=none; b=bOWCK1DWgc5u+r0AXdMhfsp4wW/8NArimAFtufyLAjJWduHBkq8dj2ueqr3n9mnLSvEMCCPjCUwjUe6B4Z3ZXxVuNZGIWBYl6OEMedVB2RGm4U0yDow9njzIVc6wOsu869R0TeE6f6C7N9Xx3c2zldfwnaQx0D4U1ail5SVkFB4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773118572; c=relaxed/simple; bh=NOLMb5KDqJzp0YJkoWtkOIjaVQZoGiTOJv74IJ6yKyA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=kKIy/yq6AgnF5B2LBoHYkmxD/UibkTi7+OW4WYERSjEGfqsGMITkznx8F3bGjR3xCmZrlNdkXD7/BcwAYxuo1w0hsoA8UH9G9dtMNubFaa0TgzyrdUO0/Wsio64rRRnsGsK1wBSNC3cmeoRPRMKx3A+OMpQes6A8z9/Im6LFmTU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu; spf=pass smtp.mailfrom=mit.edu; dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b=PjurwgRA; arc=none smtp.client-ip=18.9.28.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mit.edu Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b="PjurwgRA" Received: from macsyma.thunk.org (pool-173-48-117-133.bstnma.fios.verizon.net [173.48.117.133]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 62A4qANa015611 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 10 Mar 2026 00:52:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing; t=1773118333; bh=EbcXuVVBuLkoAQ0/37a2sDQMaqYseEX5l0GJJNAHxkA=; h=Date:From:Subject:Message-ID:MIME-Version:Content-Type; b=PjurwgRAHmC0tHBL1nFWA2av/VUIW7yhR7k2CTVgMmzfOy/P6eCdB0iQRSZ7jrmS3 uuT3nOupDAbTgsiz1GfTWfQnN5ieeJcAkx4Co9e+7PGN9rWIozb7jD0PkujotC9qE3 ChVKG0F0TP/CLCpAlWWj7id8nMGNY1XAlRjN9r09HO519ppuF+0HYjgKiYdPSWpgZf 8TrV4xe75EAjPNJVyhVsbPYl3V4g8HyMfZPksJy6t1bJmziVE2ShKwGSr6cmqD7rjn 4UCRzYKq6byUTRyww1ENQFRguWCPNTXDykrkJVQURyGENCnZ1OgwhFIJ8aF7UyjHyt wMOz5MsigCtBg== Received: by macsyma.thunk.org (Postfix, from userid 15806) id A30085C3E567; Tue, 10 Mar 2026 00:52:10 -0400 (EDT) Date: Tue, 10 Mar 2026 00:52:10 -0400 From: "Theodore Tso" To: "H. Peter Anvin" Cc: Jonathan Corbet , Steven Rostedt , Christian Brauner , tech-board-discuss@lists.linux.dev, linux-kernel@vger.kernel.org, ksummit-discuss@lists.linuxfoundation.org, christianvanbrauner@gmail.com Subject: Re: LLM based rewrites Message-ID: <20260310045210.GA14867@macsyma-wired.lan> References: <20260307-clean-room-6118793eb175@brauner> <20260309095705.7a6b6177@gandalf.local.home> <20260309121629.21cabc25@gandalf.local.home> <871phtvu7r.fsf@trenco.lwn.net> <04B897EF-DEEC-42D0-8E00-888CEEA5318E@zytor.com> Precedence: bulk X-Mailing-List: tech-board-discuss@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <04B897EF-DEEC-42D0-8E00-888CEEA5318E@zytor.com> > >The fact that every version of chardet was surely in its training data > >is not deemed to be relevant. > > That's a question for the lawyers and the courts, really. But it is > most definitely *not* clean room. That being said, clean room is > certainly not the only way to rewrite software that can pass legal > muster, but it is the gold standard Well, given that researchers were able to elicit 96% of Harry Potter and the Sorcerer's Stone from Claude 3.7 Sonnet[1], the question I have is that if you have one LLM instance create a specification from looking at the code that you are trying to clone, and then you have a second LLM instance that was trained on the code you are trying to clone, and then fed the specification --- regardless of whether this can be considered "clean room" from a process perpsective, the other question is just whether there is enough similarity in the actual *results*, that could also be a problem. [1] https://arxiv.org/html/2601.02671v1 Of course, we could imagine using the LLM to incrementally rerite the C code that was elicited from the specification if the results are too closely to the source program --- that is, "Hey ChatGPT, please file off the serial number so the source code looks nothing like the GPL code that I'm trying to rip off." The thing is, though, this is something that humans could do as well, It wouldn't surprise me if there are cases of "clean room implementation" where there might be some incremental rewriting; and proving that it wasn't a strict clean room procedure might be quite difficult. It's just that with AI, it might be easier to do things at scale. - Ted