Skip to content

fix(code-hosting): recognize SSH repository hosts with userinfo#1375

Open
officialasishkumar wants to merge 1 commit intovolcengine:mainfrom
officialasishkumar:fix/ssh-repo-url-host-detection
Open

fix(code-hosting): recognize SSH repository hosts with userinfo#1375
officialasishkumar wants to merge 1 commit intovolcengine:mainfrom
officialasishkumar:fix/ssh-repo-url-host-detection

Conversation

@officialasishkumar
Copy link
Copy Markdown
Contributor

Description

Fix code-hosting URL domain matching so repository helpers use parsed hostnames instead of raw netloc values. This allows common clone URLs such as ssh://git@github.com/org/repo.git and explicit-port URLs such as https://github.com:443/org/repo to be recognized as configured code-hosting repositories.

The repository parser still keeps SSH-style GitHub URLs on the git clone path, so users who provide SSH URLs continue using their SSH credentials instead of being silently routed through GitHub ZIP downloads.

Related Issue

Fixes #1374

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Add shared hostname-based domain matching for code-hosting URL helpers.
  • Recognize ssh://git@... and explicit-port repository URLs in parse/detection helpers.
  • Preserve git-clone behavior for SSH repository sources in CodeRepositoryParser.
  • Add regression coverage for SSH URL userinfo and explicit-port repository URL cases.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Validated locally:

/tmp/openviking-url-venv/bin/python -m pytest --noconftest -o addopts='' tests/test_code_hosting_utils.py -q
python3 -m py_compile openviking/utils/code_hosting_utils.py openviking/parse/parsers/code/code.py tests/test_code_hosting_utils.py
/tmp/openviking-url-venv/bin/ruff format --check openviking/utils/code_hosting_utils.py openviking/parse/parsers/code/code.py tests/test_code_hosting_utils.py
/tmp/openviking-url-venv/bin/ruff check openviking/utils/code_hosting_utils.py openviking/parse/parsers/code/code.py tests/test_code_hosting_utils.py
git diff --check

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

Full repository pytest was not run because the global test bootstrap imports the bundled AGFS client. The change is covered with the focused code-hosting helper test module, compiled touched files, and Ruff checks.

Use parsed hostnames instead of raw netloc values when matching configured code-hosting domains. This keeps ssh://git@host and explicit-port repository URLs on the supported code-hosting path while preserving git clone handling for SSH repository sources.

Add focused regression coverage for SSH URL userinfo and explicit-port repository URL helpers.
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1374 - Partially compliant

Compliant requirements:

  • Fix recognition of ssh:// git URLs with userinfo
  • Fix recognition of URLs with explicit ports
  • Ensure parse_code_hosting_url, is_code_hosting_url, is_git_repo_url work correctly
  • Preserve git-clone behavior for SSH repository sources
  • Add regression tests for the fixed cases

Non-compliant requirements:

  • (No unfulfilled requirements)

Requires further human verification:

  • (No items require further human verification)
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Code Duplication

The domain-matching logic in _normalize_repo_url duplicates the implementation of _domain_matches from code_hosting_utils.py. This duplication can lead to maintenance drift and inconsistent behavior if the logic is updated in one place but not the other.

domains = {
    domain.lower() for domain in config.code.github_domains + config.code.gitlab_domains
}
host = (parsed.hostname or "").lower()
host_candidates = {host} if host else set()
try:
    port = parsed.port
except ValueError:
    port = None
if host and port is not None:
    host_candidates.add(f"{host}:{port}")
if host_candidates & domains and len(path_parts) >= 2:

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Remove duplicate domain matching logic

Replace the duplicate domain matching logic with a call to the shared
_domain_matches function from openviking.utils.code_hosting_utils. This eliminates
code duplication and ensures consistent domain checking behavior across the
codebase.

openviking/parse/parsers/code/code.py [270-281]

-domains = {
-    domain.lower() for domain in config.code.github_domains + config.code.gitlab_domains
-}
-host = (parsed.hostname or "").lower()
-host_candidates = {host} if host else set()
-try:
-    port = parsed.port
-except ValueError:
-    port = None
-if host and port is not None:
-    host_candidates.add(f"{host}:{port}")
-if host_candidates & domains and len(path_parts) >= 2:
+from openviking.utils.code_hosting_utils import _domain_matches
+...
+if _domain_matches(parsed, config.code.github_domains + config.code.gitlab_domains) and len(path_parts) >= 2:
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies duplicate domain matching logic and proposes replacing it with a call to the shared _domain_matches function, which improves maintainability and consistency. However, note that _domain_matches is a private function (prefixed with _), so importing it from another module may not follow best practices.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: ssh:// git repository URLs with userinfo are not recognized as code-hosting repos

2 participants