From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4423AC433EF for ; Thu, 14 Apr 2022 09:38:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241871AbiDNJlV (ORCPT ); Thu, 14 Apr 2022 05:41:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241861AbiDNJlU (ORCPT ); Thu, 14 Apr 2022 05:41:20 -0400 Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1795470CFE for ; Thu, 14 Apr 2022 02:38:55 -0700 (PDT) Received: by mail-ed1-x52e.google.com with SMTP id g20so5604687edw.6 for ; Thu, 14 Apr 2022 02:38:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=klerks.biz; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=rNr7pmafAGNmbta21cWSnW/PtsEFtz1ZaEXCuuL6N20=; b=Sf4Xw/cVJX2ipAoABre1Z/R54salNa9gK+IAQsMzoxf5N3fa+CTzatIzAuuf9OAi39 x1lZYOGcVrPV6wW/TylRAQ4vmltRTIeZTG1cDQ804QPKBL9uppKaeuNfInqPCpOmXFCL EyMCkr+6DB96c7yLyINuy7CsGiciWM0byDDgU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=rNr7pmafAGNmbta21cWSnW/PtsEFtz1ZaEXCuuL6N20=; b=u1v7h1kGvMNq1V8+hS2f6qHnJkqXr5XHZ2XhVlzCun2D2W8zkc0I5HUVVh8M1ypXjI 4uKBCC+E+EDlIUAcMuzeiFodDqJwdMTTqskGb+W01flv/DKp8I3FeOQsXkp1bU39y2hP TcCKJdOtn0s81jGZPpIfa2aEz/vET+SvpU8UV1B4ZLkyWQnHaV3CQvUpOfReR9N1Xgyf dDrC+kC5w6Kq7v55/34nQYMOaua+yrti6fkrWBv5/eJaY9WSdTSzVADGhfivUe1T+EHf cSoSaVae+GP3gK2eVWNMqNiYGobVtKm/dPLUIvQygYr33eKBsUq5h7AgLMJj8AvSbErP 45TQ== X-Gm-Message-State: AOAM5318BKQXjA3Dt2iyv8+0C0RD3caZHK76oJEVhX//qPSa9sKr9qjM 4kh69S4ZCQNhdCUOnvT5BuGnmTj01qocTV3CM+3fRVebK9EypGQE X-Google-Smtp-Source: ABdhPJyF7qH5z3aC2NRGVTpKVlybWz+GH61c1p/Exb1HN3JKxTbcUp3rhnmpK6R6eJYVcnO1UA+jqhqHwTIEKu5hMRo= X-Received: by 2002:aa7:cad3:0:b0:410:b188:a49a with SMTP id l19-20020aa7cad3000000b00410b188a49amr1993743edt.416.1649929133567; Thu, 14 Apr 2022 02:38:53 -0700 (PDT) MIME-Version: 1.0 References: <220413.86zgkpf5g7.gmgdl@evledraar.gmail.com> <220413.86sfqgerf7.gmgdl@evledraar.gmail.com> In-Reply-To: <220413.86sfqgerf7.gmgdl@evledraar.gmail.com> From: Tao Klerks Date: Thu, 14 Apr 2022 11:38:42 +0200 Message-ID: Subject: Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings To: =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Cc: Tao Klerks via GitGitGadget , git@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Wed, Apr 13, 2022 at 9:04 PM =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: > > > AFAICT we only allow selecting between encodings, not "no idea what this > is, but here's some raw sequence of bytes", except by omitting the > header. > I don't understand what you mean here. My reading of the docs is that omitting the header *does not* imply "no idea what this is" - it implies "this is utf-8". Git will do the best it can (convert to utf-8 at output time if the bytes are parseable in the specified or implied encoding, and return the raw bytes otherwise), *regardless* of whether an encoding is explicitly specified or utf-8 is implied. > It seems to me that between legacy/strict/fallback there's a 4th setting > missing here. I.e. a "try-encoding". One where if your data is valid > utf8 (or cp1252 if we want to get fancy and combine it with "fallback") > you get an encoding header, but having tried that we'll just write > whatever raw data we found, but *not* munge it. For utf-8 specifically, the "legacy" strategy achieves this effect with less overhead: It copies the bytes over raw, and in git the implied encoding is utf-8. So if the bytes were utf-8 in the first place, then they're good (we didn't need to munge them in any way), and if they weren't utf-8 we copied them over anyway, which is the "tried and failed" behavior you propose also. For cp1252 or another encoding, the behavior you propose is an enhancement/modification of the current "fallback" behavior, I guess: Currently if the fallback encoding doesn't wash, any remaining "bad bytes" get replaced out of existence. This leads to data loss of those individual bytes, and it also means that if the data really wasn't cp1252 in the first place, you might just have garbled up the data something awful. Of course, you might have done that without any errors anyway - cp1252 only leaves 4 unmapped bytes, so most arbitrary text data will be successfully "interpreted", no matter how erroneously. The advantage of the current "replace" behavior is that you *always* end up with valid utf-8 data in your commit text. The disadvantage is that you can suffer (minor) unrecoverable data loss. I think your proposal is that (optionally?) the fallback-or-raw behavior would simply spit out / leave the original bytes in this situation, not making any attempt to interpret them or convert them to utf-8, but simply bring them to git as-is. This would make that particular commit message "invalidly encoded", in the sense that any git client or git-caller would need to "do what it can" with that sequence of bytes. This tradeoff between avoidance of data loss, and type/encoding coherence, is not one I'm comfortable deciding on. I would ideally prefer a third route, where the data *is* interpreted and converted, but in a fully reversible way. What would you think of a scheme where, *if* the fallback encoding fails to decode successfully, we simply take all x80+ bytes and escape them to a form like "\x8c", so a commit message might end up like "solve the problem with the japanese character: \8f\c4."? Would this way of preserving the bytes, without breaking out of having a known (utf-8) encoding in the git commits, make sense to you? We could even add a suffix to the message like "[original data contained bytes that could not be mapped in targeted encoding cp1252; bytes at or over x80 were escaped as \xNN, and backslashes were escaped as \\ to avoid ambiguity]", or something less horrendously verbose :) > > I haven't worked with p4, but having done some legacy SCM imports it was > nice to be able to map data 1=3D1 to git in those "odd encoding" cases, o= r > even cases where there was raw binary in a commit message or whatever... > My problem here, again, is that these "badly encoded commit messages" endure forever in your commit history: any tool that wants to parse your history will have to deal with them, skirt around them, etc. That just seems like bad discipline. If the intent is to ensure that you *can* reconstruct those bytes if/when you need to, we should find a way to store them safely, in a way that won't randomly trip others up. > Anyway, all of this is fine with me as-is, I just had a drive-by > question after some admittedly not very careful reading, sorry. Thanks for looking into it!