Re: [PATCH RESEND] x86/asm/32: Modernize _memcpy()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH RESEND] x86/asm/32: Modernize _memcpy()
Date: Tue, 16 Dec 2025 19:18:52 +0000	[thread overview]
Message-ID: <20251216191852.7f5a7354@pumpkin> (raw)
In-Reply-To: <CAFULd4ahPq+ui5=mx0VHNkfQrqRB2n-HgmXFA1UyNFYtZaCeEA@mail.gmail.com>

On Tue, 16 Dec 2025 17:30:55 +0100
Uros Bizjak <ubizjak@gmail.com> wrote:

> On Tue, Dec 16, 2025 at 2:14 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> 
> > > 00e778b0 <memcpy>:
> > >   e778b0:     55                      push   %ebp
> > >   e778b1:     89 e5                   mov    %esp,%ebp
> > >   e778b3:     83 ec 08                sub    $0x8,%esp
> > >   e778b6:     89 75 f8                mov    %esi,-0x8(%ebp)
> > >   e778b9:     89 d6                   mov    %edx,%esi
> > >   e778bb:     89 ca                   mov    %ecx,%edx
> > >   e778bd:     89 7d fc                mov    %edi,-0x4(%ebp)
> > >   e778c0:     c1 e9 02                shr    $0x2,%ecx
> > >   e778c3:     89 c7                   mov    %eax,%edi
> > >   e778c5:     f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
> > >   e778c7:     83 e2 03                and    $0x3,%edx
> > >   e778ca:     74 04                   je     e778d0 <memcpy+0x20>
> > >   e778cc:     89 d1                   mov    %edx,%ecx
> > >   e778ce:     f3 a4                   rep movsb %ds:(%esi),%es:(%edi)
> > >   e778d0:     8b 75 f8                mov    -0x8(%ebp),%esi
> > >   e778d3:     8b 7d fc                mov    -0x4(%ebp),%edi
> > >   e778d6:     89 ec                   mov    %ebp,%esp
> > >   e778d8:     5d                      pop    %ebp
> > >   e778d9:     c3                      ret
> > >
> > > due to a better register allocation, avoiding the call-saved
> > > %ebx register.  
> >
> > That'll might be semi-random.  
> 
> Not really, the compiler has more freedom to allocate more optimal register.
> 
> > > +     unsigned long ecx = n >> 2;
> > > +
> > > +     asm volatile("rep movsl"
> > > +                  : "+D" (edi), "+S" (esi), "+c" (ecx)
> > > +                  : : "memory");
> > > +     ecx = n & 3;
> > > +     if (ecx)
> > > +             asm volatile("rep movsb"
> > > +                          : "+D" (edi), "+S" (esi), "+c" (ecx)
> > > +                          : : "memory");
> > >       return to;
> > >  }
> > >  
> 
> > This version seems to generate better code still:
> > see https://godbolt.org/z/78cq97PPj
> >
> > void *__memcpy(void *to, const void *from, unsigned long n)
> > {
> >         unsigned long ecx = n >> 2;
> >
> >         asm volatile("rep movsl"
> >                      : "+D" (to), "+S" (from), "+c" (ecx)
> >                      : : "memory");
> >         ecx = n & 3;
> >         if (ecx)
> >                 asm volatile("rep movsb"
> >                              : "+D" (to), "+S" (from), "+c" (ecx)
> >                              : : "memory");
> >         return (char *)to - n;  
> 
> I don't think that additional subtraction outweighs a move from EAX to
> a temporary.

It saves a read from stack, and/or a register spill to stack
(at least as I compiled the code...).
The extra alu operation is much cheaper - especially since it will
sit in the 'out of order' instruction queue with nothing waiting
for the result (the write to stack .

Since pretty much no code user the result of memcpy() you could
have a inline wrapper that just preserves the 'to' address and calls
a 'void' function to do the copy itself.

But the real performance killer is the setup time for 'rep movs'.
That will outweigh any minor faffing about by orders of magnitude.

Oh, and start trying to select between algorithms based on length
and/or alignment and you get killed by the mispredicted branch
penalty - that seems to be about 20 clocks on my zen-5.
I haven't measured the setup cost for 'rep movs', it is on my
'round tuit' list.

	David

> 
> BR,
> Uros.

     prev parent reply	other threads:[~2025-12-16 19:18 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-16 10:37 [PATCH RESEND] x86/asm/32: Modernize _memcpy() Uros Bizjak
2025-12-16 13:14 ` David Laight
2025-12-16 16:30   ` Uros Bizjak
2025-12-16 16:38     ` Dave Hansen
2025-12-16 19:18     ` David Laight [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251216191852.7f5a7354@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=ubizjak@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox