You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Win32: htonl()+etc use 1 CPU ins/1 op, not slow imports to swap byte order
Changing BE/LE byte order is a very common operation used in many
places inside libperl, and some core bundled XS modules, and in many
CPAN XS modules. Storable.xs and PP pack()/unpack() are the largest most
frequent users of byte swapping in Perl interp git repo. Some Perl interp
repo .c or .h or .xs files DIY their own byte swap algorithms with macros
or static inline functions. Others like Storable.xs and PP pack()/unpack()
use the htonl(), htons(), ntohs(), and ntohs() macros/functions provided
by the OS/CC, or depending on config.sh, equivelent polyfills from perl.h.
On all modern CPU archs, i386/ARM/x64/PowerPC, there is a dedicate CPU
instruction for doing it. Perl on Linux compiled with Clang or GCC,
automatically is using the appropriate inlined intrinsic CPU opcode
when using the CCs/OSes htonl(). Those 2 CCs also recogize perl.h's
my_swap32() algorithm, and all other .h/.c files that use that statement
as a euphemism to inline htonl() to 1 CPU opcode.
On Windows, the situation is very bad compare to the above.
The htonl(), htons(), ntohs(), and ntohs() functions/macros,
provided to Perl from both MSVC and GCC, are exceptionally slow and
inefficient, and unoptimized. MSVC has further problems, generating
very inefficient machine code to swap byte order, using the DIY shift
and mask algorithm.
Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and
ntohs() have been extern "C" PE symbol table exported functions from
ws2_32.dll. ws2_32.dll is AKA WinSock, Win32/64's front end public facing
lib for TCP/IP sockets. It is not a light weight DLL, but loads multiple
other DLLs, filter/FW DLLs, middleware and backendware DLLs. WinPerl delay
loads (RTLD_LAZY) the Winsock DLL until the first attempt is made by
[lib] perl5XX.dll to go on the WWW.
These 4 functions do 1 things and 1 thing only, swap bytes around.
Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but
they have nothing to do with ethernet, token ring, or TCPIP. They should
be using the correct inlined single CPU instruction to do this. All modern
CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping.
x86 introduced the 32 bit/U32 bswap instruction with the release of the
i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits).
Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs
to 1 opcode on Win32/64, ask for the C linker symbol, and you will get
the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are
calling Winsock's exported functions to do the swap. This is slow since
its a function call. Even worse, its not 1 opcode inside ws2_32.dll but
11 to 15 CPU ops.
MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023),
don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00)
<<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for
"I want the byte swap CPU instruction". Mingw GCC does recognize that
macro to mean 1 CPU instruction, but Mingw GCC won't modify or optimize
MS's htonl()/etc identifiers/symbols for you, you need to DIY that macro
yourself to trigger the 1 CPU op optimization.
To fix all this, hook the 4 functions with the CPP to prevent the 4
tokens from calling the Winsock export table implimentations. 2nd, with
all MSVC CCs versions use the official proper MS specific way to do a
"byte swap". Which is instrinsic function _byteswap_ulong().
Theoretically RtlUlongByteSwap() also exists, but its much less mentioned
on MSDN and the general WWW, and lives in "wdm.h" which is intended
for Ring 0 kernel drivers, so header clashes can occur now, or in the
future, so pick the easier and simple _byteswap_ulong() category of
intrinsics, not the Ring 0 ones.
Since _byteswap_ulong() is an intrinsic function and not a macro,
S_my_swap32() isn't needed to prevent multi-eval problems.
htonll()/ntohll() were added simply because it was easy to write them,
and multiple interp core .c files, and interp core .xs files want a 64 bit
byte swap function/macro, since most people use 64 bit pointer CPUs.
2 different algorithms for htonll() are included, I picked the one with
less C src code ops, but I didn't check if neither, 1 or the other, or
both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic
request without formally asking for the instrinsic with a named identifier.
The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork
and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h
will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost
and iperlsys.h vtables, which then redirect to winsock, negating this fix.
Other less-than-perfect CC/CPUs/OSes combos might be discovered in the
future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32
specific. Rumors online say GCC only added its __builtin_bswap32() and
matching that macro, in 2008/v4.0.
So S_my_swap32() isn't being turned on for any OSes/CCs other than
Mingw and MSVC on Win32, because "if it ain't broke, don't fix it".
If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl
attempts to hook, intercept, or use a token or identifier from 6 years
ago, not 18 months ago, more harm (deoptimization) can happen that good.
MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before
VC 2022 19.37 Aug 2023. Winsock has the same exact machine code.
I'll assume MS/other major Windows corporate users, assumed that
"ISO C" byte swap macro is fundamentally wrong, since acknowleging
that endianness exists violates "ISO C", and that code base is
now "some Vendor's C", so why fix htonl() or that long non-CPU arch
specific macro, instead of the intrinsic? That program already is aware of
OS and Vendor and platform names and isn't portable
0 commit comments