Skip to content

Add knowledge base classification using dit#1538

Merged
dogancanbakir merged 9 commits intodevfrom
feature/knowledge-base-dit
Mar 7, 2026
Merged

Add knowledge base classification using dit#1538
dogancanbakir merged 9 commits intodevfrom
feature/knowledge-base-dit

Conversation

@dogancanbakir
Copy link
Member

@dogancanbakir dogancanbakir commented Feb 15, 2026

Closes #1537

Add -kb and -fpt flags for ML-based page/form/field classification via dit. Works across all engines. Bumps Go to 1.25.x.

Summary by CodeRabbit

  • New Features

    • Adds knowledge-base page classification included in response output.
    • New CLI flags to enable classification and to filter results by page type.
  • Bug Fixes / Behavior

    • Output can now filter results by page type when classification data is present.
  • Chores

    • Bumped Go toolchain and base build image; updated related dependencies.
  • Documentation

    • Installation notes and CLI usage updated for Go 1.25+ and new flags.
  • Tests

    • Added captcha-detection tests covering multiple scenarios.

- Add -kb flag to enable page/form/field classification via dit
- Add -fpt flag to filter results by page type (auto-enables -kb)
- Classification works across all engines (standard, headless, hybrid)
- KB data included in jsonl output via knowledgebase field
- Bump Go version to 1.25.x across CI, Dockerfile, and docs
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 15, 2026

Walkthrough

Adds optional ML-based page/form/field classification via the dit library, new CLI flags to enable and filter classifications, integrates classification into crawler engines and responses, and applies optional page-type filtering at output time.

Changes

Cohort / File(s) Summary
Build & deps
Dockerfile, go.mod
Bump build base to golang:1.25.7-alpine; update Go toolchain to go1.25.7; add github.com/happyhackingspace/dit and bump transitive golang.org/x/net.
CLI & options
cmd/katana/main.go, pkg/types/options.go, README.md
Add -kb/--knowledge-base (bool) and -fpt/--filter-page-type (string slice); expose fields on Options; update README installation and flag docs.
Crawler options & classifier
pkg/types/crawler_options.go
Introduce DitClassifier *dit.Classifier; initialize when KB is enabled; add ClassifyPage(body string) map[string]any.
Response model
pkg/navigation/response.go
Add exported KnowledgeBase map[string]any to navigation.Response (JSON knowledgebase).
Engine integrations
pkg/engine/common/base.go, pkg/engine/headless/headless.go, pkg/engine/hybrid/crawl.go, pkg/engine/standard/crawl.go
Populate Response.KnowledgeBase by invoking ClassifyPage on page body across crawl engines; headless classifies before clearing body.
Output filtering & wiring
pkg/output/options.go, pkg/output/output.go
Add FilterPageType []string option; wire into writers; implement case-insensitive filtering using KnowledgeBase.PageType, rejecting matching results with "result filtered by page type".
Captcha detection & tests
pkg/engine/headless/captcha/captcha.go, pkg/engine/headless/captcha/identify_test.go
Add early captcha detection via ditcaptcha.DetectCaptchaInHTML; add TestDetectCaptchaWithDit unit test.
Minor housekeeping
pkg/engine/headless/captcha/capsolver/capsolver.go, pkg/engine/headless/captcha/helpers_test.go, pkg/engine/headless/types/..., internal/runner/healthcheck.go
Small error-handling and I/O changes (ignore Close errors, discard write returns), and fmt.Fprintf refactors in String builders.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Parser
    participant Opts as Options
    participant Crawler as Crawler Init
    participant Dit as Dit Classifier
    participant Engine as Crawl Engine
    participant Writer as Output Writer
    participant User as User

    CLI->>Opts: parse -kb and -fpt
    Opts->>Crawler: init with KnowledgeBase & FilterPageType
    Crawler->>Dit: create/init DitClassifier
    Engine->>Engine: fetch page content
    Engine->>Dit: ClassifyPage(body)
    Dit-->>Engine: {PageType, Forms, Fields}
    Engine->>Engine: set Response.KnowledgeBase
    Engine->>Writer: Write(Response)
    Writer->>Writer: check KnowledgeBase.PageType vs FilterPageType
    alt PageType matches filter
        Writer-->>User: skip result (filtered)
    else
        Writer-->>User: emit result (JSON/formatted)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through lines of code tonight,

Dit taught pages to call wrong from right,
Flags unfurled paths for the crawler's roam,
Knowledge packed neatly in each response home,
A rabbit cheered — the crawler's wiser now, huzzah!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Minor out-of-scope changes detected: Go version bumps (Dockerfile, go.mod), dependency updates (golang.org/x/net), and refactoring patterns (error handling in captcha module, string formatting in headless types) are unrelated to knowledge base classification. Consider moving out-of-scope changes (Go version bumps, string formatting refactoring, error handling patterns) to separate PRs to keep knowledge base integration focused and easier to review.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely and accurately summarizes the primary change: integrating dit-based knowledge base classification into the codebase.
Linked Issues check ✅ Passed All coding objectives from issue #1537 are met: ML classification via dit is integrated [#1537], two CLI flags are added and wired (-kb and -fpt) [#1537], classification works across all engines [#1537], and results appear in JSONL output under the knowledgebase field [#1537].

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/knowledge-base-dit

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@cmd/katana/main.go`:
- Line 217: The help text for the StringSlice flag defined via
flagSet.StringSliceVarP for options.FilterPageType (short name "fpt") doesn't
indicate that using -fpt will implicitly enable the knowledge-base flag (-kb);
update the flag's usage string to mention that "-fpt auto-enables -kb" (or
similar phrasing) so users see this behavior in the CLI help for the
FilterPageType flag.

In `@README.md`:
- Line 180: Update the flag documentation line that describes "-fpt,
-filter-page-type" to explicitly state that using -fpt will automatically enable
the -kb (knowledge-base classification) flag; mention both flags by name (-fpt /
-filter-page-type and -kb) so users know that page-type filtering implicitly
enables knowledge-base classification and won’t need to pass -kb separately.
🧹 Nitpick comments (2)
Dockerfile (1)

1-12: Consider adding a non-root user for improved container security.

The Go version bump to 1.25.7 aligns correctly with go.mod. However, per static analysis, the container runs as root which is a security best practice violation. Consider adding a non-root user.

🔒 Proposed fix to run as non-root user
 FROM alpine:3.23.2
 RUN apk add --no-cache bind-tools ca-certificates chromium
+RUN adduser -D -u 1000 katana
 COPY --from=build-env /app/katana /usr/local/bin/
+USER katana
 
 ENTRYPOINT ["katana"]
pkg/engine/common/base.go (1)

249-258: Avoid double body-to-string conversion.
You build string(body) twice; reusing a single string reduces allocations on large responses.

♻️ Suggested refactor
-		body, _ := io.ReadAll(resp.Body)
-		reader, _ := goquery.NewDocumentFromReader(bytes.NewReader(body))
+		body, _ := io.ReadAll(resp.Body)
+		bodyStr := string(body)
+		reader, _ := goquery.NewDocumentFromReader(bytes.NewReader(body))
 		var technologyKeys []string
 		if s.Options.Wappalyzer != nil {
 			technologies := s.Options.Wappalyzer.Fingerprint(resp.Header, body)
 			technologyKeys = mapsutil.GetKeys(technologies)
 		}
 		navigationResponse := &navigation.Response{
 			Depth:         depth + 1,
 			RootHostname:  hostname,
 			Resp:          resp,
-			Body:          string(body),
+			Body:          bodyStr,
 			Reader:        reader,
 			Technologies:  technologyKeys,
 			StatusCode:    resp.StatusCode,
 			Headers:       utils.FlattenHeaders(resp.Header),
-			KnowledgeBase: s.Options.ClassifyPage(string(body)),
+			KnowledgeBase: s.Options.ClassifyPage(bodyStr),
 		}

@neo-by-projectdiscovery-dev
Copy link

neo-by-projectdiscovery-dev bot commented Mar 4, 2026

Neo - PR Security Review

No security issues found

Highlights

  • Version bump from v1.4.0 to v1.5.0 in banner.go
  • Lint fix in healthcheck.go: improved error handling for fmt.Fprintf

Comment @neo help for available commands. · Open in Neo

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/engine/headless/captcha/captcha.go`:
- Around line 29-31: Do not treat ditcaptcha.DetectCaptchaInHTML(pageHTML) ==
ditcaptcha.CaptchaTypeNone as authoritative; remove the early return and instead
fall back to DOM-based detection via Identify(page) when DetectCaptchaInHTML
returns None. Update the logic in the function that currently checks "if ct :=
ditcaptcha.DetectCaptchaInHTML(pageHTML); ct == ditcaptcha.CaptchaTypeNone {
return false, nil }" so that when ct == CaptchaTypeNone you call Identify(page)
(or otherwise invoke the existing DOM inspector used elsewhere) and use its
result to decide whether a captcha exists; keep using
ditcaptcha.DetectCaptchaInHTML's positive findings but rely on Identify(page)
for enterprise/generic fallbacks and return the combined detection outcome and
any error accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b050a303-d330-466f-9cc5-b6a10ef2ceb8

📥 Commits

Reviewing files that changed from the base of the PR and between 5150df5 and 411698a.

⛔ Files ignored due to path filters (6)
  • .github/workflows/build-test.yml is excluded by !**/*.yml
  • .github/workflows/functional-test.yml is excluded by !**/*.yml
  • .github/workflows/release-binary.yml is excluded by !**/*.yml
  • .github/workflows/release-test.yml is excluded by !**/*.yml
  • .github/workflows/security-crawl-maze-score.yaml is excluded by !**/*.yaml
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (8)
  • README.md
  • cmd/katana/main.go
  • go.mod
  • pkg/engine/common/base.go
  • pkg/engine/headless/captcha/captcha.go
  • pkg/engine/headless/captcha/identify_test.go
  • pkg/engine/headless/headless.go
  • pkg/types/options.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/engine/headless/headless.go
  • go.mod
  • README.md
  • pkg/types/options.go
  • pkg/engine/common/base.go
  • cmd/katana/main.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/runner/healthcheck.go`:
- Line 80: The healthcheck output string is mislabeled — fmt.Fprintf(&test, "UDP
IPv4 connectivity to scanme.sh:80 => %s\n", testResult) reports port 80 while
the actual probe targets scanme.sh:53; update the formatted message to show the
correct port (53) or interpolate the actual address/port variable used for the
probe so the label matches the real check (modify the fmt.Fprintf call that
writes to test using testResult to reference :53 or the probe variable).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 67a3b344-91dc-4e66-8ff6-8b623e903b35

📥 Commits

Reviewing files that changed from the base of the PR and between 38a79ac and ba8cc32.

📒 Files selected for processing (1)
  • internal/runner/healthcheck.go

testResult = fmt.Sprintf("Ko (%s)", err)
}
test.WriteString(fmt.Sprintf("UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult))
_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix UDP IPv4 healthcheck label mismatch.

Line 80 reports port 80, but the actual check uses scanme.sh:53 (Line 72). This makes healthcheck output misleading.

Proposed fix
-	_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)
+	_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)
_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/runner/healthcheck.go` at line 80, The healthcheck output string is
mislabeled — fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n",
testResult) reports port 80 while the actual probe targets scanme.sh:53; update
the formatted message to show the correct port (53) or interpolate the actual
address/port variable used for the probe so the label matches the real check
(modify the fmt.Fprintf call that writes to test using testResult to reference
:53 or the probe variable).

@dogancanbakir dogancanbakir merged commit 2f9ecf4 into dev Mar 7, 2026
16 of 17 checks passed
@dogancanbakir dogancanbakir deleted the feature/knowledge-base-dit branch March 7, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add knowledge base classification (dit)

2 participants