Skip to content

Fix cpuinfo init on Linux without CPU sysfs lists#28230

Open
tianleiwu wants to merge 1 commit intomicrosoft:mainfrom
tianleiwu:tlwu/fix-cpuinfo-sysfs-fallback
Open

Fix cpuinfo init on Linux without CPU sysfs lists#28230
tianleiwu wants to merge 1 commit intomicrosoft:mainfrom
tianleiwu:tlwu/fix-cpuinfo-sysfs-fallback

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

Fixes ONNX Runtime startup on Linux ARM64 environments where /sys/devices/system/cpu/possible and /sys/devices/system/cpu/present are unavailable, such as AWS Lambda ARM64/Graviton and restricted build sandboxes.

There are two related failure modes:

  1. PosixEnv may be constructed before ORT's default logger is registered. If cpuinfo_initialize() fails during that early construction path, the existing LOGS_DEFAULT(INFO) call can terminate with Attempt to use DefaultLogger but none has been registered.
  2. The bundled pytorch/cpuinfo code treats missing Linux CPU possible/present sysfs cpulists as fatal on ARM Linux. The max-count helpers return UINT32_MAX, which wraps to 0 after 1 + UINT32_MAX in ARM Linux initialization and prevents cpuinfo from reaching the later /proc/cpuinfo and getauxval() based detection paths.

Root Cause

The immediate import crash is caused by unsafe early logging in onnxruntime/core/platform/posix/env.cc. Python bindings can reference Env::Default() during module load before logging is initialized, so a cpuinfo initialization failure must not use LOGS_DEFAULT() unless a default logger exists.

The cpuinfo initialization failure is more subtle. A count-only fallback is not enough: after cpuinfo computes max possible/present CPU counts, it calls cpuinfo_linux_detect_possible_processors() and cpuinfo_linux_detect_present_processors() to set CPUINFO_LINUX_FLAG_POSSIBLE and CPUINFO_LINUX_FLAG_PRESENT on each processor. ARM Linux initialization later marks processors valid only if those flags are set. If only the count fallback is provided, valid_processors can remain zero and cpuinfo can proceed into an invalid partial initialization state.

Fix

  • Make PosixEnv logging safe when cpuinfo initialization fails before a default logger exists:
    • use logging::LoggingManager::HasDefaultLogger() before LOGS_DEFAULT()
    • fall back to std::cerr when no logger is registered
  • Add a cpuinfo patch for Linux missing sysfs CPU cpulists:
    • fallback max possible/present processor detection to sysconf(_SC_NPROCESSORS_ONLN) - 1
    • fallback present/possible processor flag detection by marking CPUs 0..nproc-1
    • preserve existing sysfs parsing behavior when the cpulist files are available
  • Wire the cpuinfo patch into the existing cpuinfo FetchContent flow for Linux and existing ARM64/ARM64EC patch path.
  • Add a simulation test that validates:
    • safe early logging without a registered default logger
    • sysconf(_SC_NPROCESSORS_ONLN) count and present/possible flag fallback behavior
    • hiding /sys/devices/system/cpu/{possible,present} via LD_PRELOAD
    • optional ORT import with hidden sysfs when a built ORT package is importable

Testing

Ran from a clean branch/worktree:

python onnxruntime/test/common/test_cpuinfo_sysfs_fallback.py

Result:

  • safe logging simulation: PASS
  • sysconf count + flag fallback simulation: PASS
  • LD_PRELOAD sysfs-hiding simulation: PASS
  • ORT import integration: SKIP (onnxruntime.capi not built/importable in this workspace)

Also validated the cpuinfo patch directly:

cd build/cu128/Release/_deps/pytorch_cpuinfo-src
patch --dry-run -p1 < /path/to/cmake/patches/cpuinfo/fix_missing_sysfs_fallback.patch

And syntax-checked patched src/linux/processors.c in a temporary tree with cpuinfo headers.

A full ORT build was not completed in this workspace; a previous build_cu128.sh run was interrupted.

Related Issue

Fixes #10038.

Comment thread onnxruntime/test/common/test_cpuinfo_sysfs_fallback.py Dismissed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes ONNX Runtime startup failures on Linux ARM64 environments where /sys/devices/system/cpu/{possible,present} are unavailable by (1) making early cpuinfo-init logging safe before a default logger exists, and (2) patching the bundled pytorch/cpuinfo to fall back to sysconf(_SC_NPROCESSORS_ONLN) for both CPU counts and per-CPU present/possible flags.

Changes:

  • Guard LOGS_DEFAULT(...) usage in PosixEnv so cpuinfo init failures won’t crash when logging hasn’t been initialized yet.
  • Patch pytorch/cpuinfo Linux processor detection to provide robust sysfs-missing fallbacks (counts + flags).
  • Add a standalone simulation script to validate the early-logging and sysfs-missing behaviors (incl. LD_PRELOAD sysfs hiding).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
onnxruntime/core/platform/posix/env.cc Avoids crashing during early PosixEnv construction by falling back to std::cerr when no default logger exists.
cmake/external/onnxruntime_external_deps.cmake Wires in the new cpuinfo patch during FetchContent dependency setup (Linux + ARM64/ARM64EC patch flow).
cmake/patches/cpuinfo/fix_missing_sysfs_fallback.patch Adds sysconf(_SC_NPROCESSORS_ONLN)-based fallbacks for max CPU count and present/possible flags when sysfs cpulists are missing.
onnxruntime/test/common/test_cpuinfo_sysfs_fallback.py Adds a manual/simulation validation script (compiles small programs + LD_PRELOAD shim).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +40 to +115
def test_safe_logging_pattern():
"""
Test 1: Verify the safe logging pattern doesn't crash when no logger exists.

This simulates the fix in env.cc where we check HasDefaultLogger() before
calling LOGS_DEFAULT(). We compile a minimal C++ program that:
- Does NOT register a default logger
- Calls the safe logging pattern
- Verifies it writes to stderr instead of crashing
"""
print("=" * 60)
print("Test 1: Safe logging pattern (no default logger)")
print("=" * 60)

source = textwrap.dedent(r"""
#include <iostream>
#include <string_view>

// Minimal simulation of ORT's logging check pattern
namespace logging {
class LoggingManager {
public:
// Simulate: no default logger registered
static bool HasDefaultLogger() { return false; }
};
} // namespace logging

void LogEarlyWarning(std::string_view message) {
if (logging::LoggingManager::HasDefaultLogger()) {
// Would call LOGS_DEFAULT(WARNING) here - but logger doesn't exist
// This path should NOT be taken
std::cerr << "BUG: should not reach here\n";
return;
}
// Safe fallback to stderr
std::cerr << "onnxruntime warning: " << message << "\n";
}

int main() {
// This simulates what PosixEnv() does when cpuinfo_initialize() fails
bool cpuinfo_available = false; // Simulating failure
if (!cpuinfo_available) {
LogEarlyWarning("cpuinfo_initialize failed. "
"May cause CPU EP performance degradation due to undetected CPU features.");
}
std::cout << "PASS: Safe logging pattern works without crash\n";
return 0;
}
""")

with tempfile.NamedTemporaryFile(suffix=".cc", mode="w", delete=False) as f:
f.write(source)
src_path = f.name

try:
exe_path = src_path.replace(".cc", "")
result = subprocess.run(
["g++", "-std=c++17", "-o", exe_path, src_path], check=False, capture_output=True, text=True
)
if result.returncode != 0:
print(f"FAIL: Compilation failed: {result.stderr}")
return False

result = subprocess.run([exe_path], check=False, capture_output=True, text=True, timeout=10)
if result.returncode != 0:
print(f"FAIL: Program crashed with exit code {result.returncode}")
print(f"stderr: {result.stderr}")
return False

if "PASS" in result.stdout:
print(result.stdout.strip())
print(f"stderr output (expected): {result.stderr.strip()}")
return True
print(f"FAIL: Unexpected output: {result.stdout}")
return False
finally:
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script defines module-level functions named test_* but they return booleans and rely on main()/print output rather than assertions. If this file is ever picked up by a test runner (e.g., pytest discovery), these will not behave as proper tests. Consider converting these to real pytest/unittest tests with assertions + skipping, or renaming/moving the script so it’s clearly an on-demand diagnostic and won’t be auto-collected.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +105
try:
exe_path = src_path.replace(".cc", "")
result = subprocess.run(
["g++", "-std=c++17", "-o", exe_path, src_path], check=False, capture_output=True, text=True
)
if result.returncode != 0:
print(f"FAIL: Compilation failed: {result.stderr}")
return False

result = subprocess.run([exe_path], check=False, capture_output=True, text=True, timeout=10)
if result.returncode != 0:
print(f"FAIL: Program crashed with exit code {result.returncode}")
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script assumes Linux + gcc/g++ + LD_PRELOAD availability, but it doesn’t guard for non-Linux platforms or for missing compilers/linker support. Adding explicit platform checks (e.g., sys.platform == 'linux') and tool detection (e.g., shutil.which('gcc')) with a clear SKIP message would make this more robust for developers running it in other environments.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +39
def get_ort_root():
"""Get the ORT repository root."""
return os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_ort_root() is currently unused. If it’s not needed, please remove it to avoid dead code; if it’s intended for future integration, consider using it (or adding a brief comment explaining why it exists).

Suggested change
def get_ort_root():
"""Get the ORT repository root."""
return os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

Copilot uses AI. Check for mistakes.
"""
Test 3: Verify LD_PRELOAD shim can hide sysfs files.

This compiles a small shim that intercepts open/fopen to return ENOENT
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the shim “intercepts open/fopen”, but the shim only overrides fopen. Either update the comment to match the implementation, or also intercept open/open64 if that’s needed to simulate the real failure mode.

Suggested change
This compiles a small shim that intercepts open/fopen to return ENOENT
This compiles a small shim that intercepts fopen to return ENOENT

Copilot uses AI. Check for mistakes.
Comment on lines +376 to +378
${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/win_arm_fp16_detection_fallback.patch &&
# https://github.com/microsoft/onnxruntime/issues/10038
${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/fix_missing_sysfs_fallback.patch
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix_missing_sysfs_fallback.patch modifies Linux-only sources (src/linux/processors.c), but it’s being applied unconditionally in the “Windows ARM64/ARM64EC” cpuinfo patch chain. That increases the chance of Windows builds breaking in the future if the Linux patch stops applying cleanly (even though the fix is Linux-specific). Consider applying this patch only under the Linux branch (or gating it by CMAKE_SYSTEM_NAME STREQUAL "Linux").

Suggested change
${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/win_arm_fp16_detection_fallback.patch &&
# https://github.com/microsoft/onnxruntime/issues/10038
${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/fix_missing_sysfs_fallback.patch
${Patch_EXECUTABLE} -p1 < ${PROJECT_SOURCE_DIR}/patches/cpuinfo/win_arm_fp16_detection_fallback.patch

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Importing onnxruntime on AWS Lambdas with ARM64 processor causes crash

3 participants