Add knowledge base classification using dit by dogancanbakir · Pull Request #1538 · projectdiscovery/katana

dogancanbakir · 2026-02-15T18:43:18Z

Add -kb and -fpt flags for ML-based page/form/field classification via dit. Works across all engines. Bumps Go to 1.25.x.

Summary by CodeRabbit

New Features
- Adds knowledge-base page classification included in response output.
- New CLI flags to enable classification and to filter results by page type.
Bug Fixes / Behavior
- Output can now filter results by page type when classification data is present.
Chores
- Bumped Go toolchain and base build image; updated related dependencies.
Documentation
- Installation notes and CLI usage updated for Go 1.25+ and new flags.
Tests
- Added captcha-detection tests covering multiple scenarios.

- Add -kb flag to enable page/form/field classification via dit - Add -fpt flag to filter results by page type (auto-enables -kb) - Classification works across all engines (standard, headless, hybrid) - KB data included in jsonl output via knowledgebase field - Bump Go version to 1.25.x across CI, Dockerfile, and docs

coderabbitai · 2026-02-15T18:43:36Z

Walkthrough

Adds optional ML-based page/form/field classification via the dit library, new CLI flags to enable and filter classifications, integrates classification into crawler engines and responses, and applies optional page-type filtering at output time.

Changes

Cohort / File(s)	Summary
Build & deps `Dockerfile`, `go.mod`	Bump build base to `golang:1.25.7-alpine`; update Go toolchain to `go1.25.7`; add `github.com/happyhackingspace/dit` and bump transitive `golang.org/x/net`.
CLI & options `cmd/katana/main.go`, `pkg/types/options.go`, `README.md`	Add `-kb/--knowledge-base` (bool) and `-fpt/--filter-page-type` (string slice); expose fields on `Options`; update README installation and flag docs.
Crawler options & classifier `pkg/types/crawler_options.go`	Introduce `DitClassifier *dit.Classifier`; initialize when KB is enabled; add `ClassifyPage(body string) map[string]any`.
Response model `pkg/navigation/response.go`	Add exported `KnowledgeBase map[string]any` to `navigation.Response` (JSON `knowledgebase`).
Engine integrations `pkg/engine/common/base.go`, `pkg/engine/headless/headless.go`, `pkg/engine/hybrid/crawl.go`, `pkg/engine/standard/crawl.go`	Populate `Response.KnowledgeBase` by invoking `ClassifyPage` on page body across crawl engines; headless classifies before clearing body.
Output filtering & wiring `pkg/output/options.go`, `pkg/output/output.go`	Add `FilterPageType []string` option; wire into writers; implement case-insensitive filtering using `KnowledgeBase.PageType`, rejecting matching results with "result filtered by page type".
Captcha detection & tests `pkg/engine/headless/captcha/captcha.go`, `pkg/engine/headless/captcha/identify_test.go`	Add early captcha detection via `ditcaptcha.DetectCaptchaInHTML`; add `TestDetectCaptchaWithDit` unit test.
Minor housekeeping `pkg/engine/headless/captcha/capsolver/capsolver.go`, `pkg/engine/headless/captcha/helpers_test.go`, `pkg/engine/headless/types/...`, `internal/runner/healthcheck.go`	Small error-handling and I/O changes (ignore Close errors, discard write returns), and fmt.Fprintf refactors in String builders.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Parser
    participant Opts as Options
    participant Crawler as Crawler Init
    participant Dit as Dit Classifier
    participant Engine as Crawl Engine
    participant Writer as Output Writer
    participant User as User

    CLI->>Opts: parse -kb and -fpt
    Opts->>Crawler: init with KnowledgeBase & FilterPageType
    Crawler->>Dit: create/init DitClassifier
    Engine->>Engine: fetch page content
    Engine->>Dit: ClassifyPage(body)
    Dit-->>Engine: {PageType, Forms, Fields}
    Engine->>Engine: set Response.KnowledgeBase
    Engine->>Writer: Write(Response)
    Writer->>Writer: check KnowledgeBase.PageType vs FilterPageType
    alt PageType matches filter
        Writer-->>User: skip result (filtered)
    else
        Writer-->>User: emit result (JSON/formatted)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through lines of code tonight,

Dit taught pages to call wrong from right,
Flags unfurled paths for the crawler's roam,
Knowledge packed neatly in each response home,
A rabbit cheered — the crawler's wiser now, huzzah!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	Minor out-of-scope changes detected: Go version bumps (Dockerfile, go.mod), dependency updates (golang.org/x/net), and refactoring patterns (error handling in captcha module, string formatting in headless types) are unrelated to knowledge base classification.	Consider moving out-of-scope changes (Go version bumps, string formatting refactoring, error handling patterns) to separate PRs to keep knowledge base integration focused and easier to review.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title concisely and accurately summarizes the primary change: integrating dit-based knowledge base classification into the codebase.
Linked Issues check	✅ Passed	All coding objectives from issue `#1537` are met: ML classification via dit is integrated [`#1537`], two CLI flags are added and wired (-kb and -fpt) [`#1537`], classification works across all engines [`#1537`], and results appear in JSONL output under the knowledgebase field [`#1537`].

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/knowledge-base-dit

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@cmd/katana/main.go`:
- Line 217: The help text for the StringSlice flag defined via
flagSet.StringSliceVarP for options.FilterPageType (short name "fpt") doesn't
indicate that using -fpt will implicitly enable the knowledge-base flag (-kb);
update the flag's usage string to mention that "-fpt auto-enables -kb" (or
similar phrasing) so users see this behavior in the CLI help for the
FilterPageType flag.

In `@README.md`:
- Line 180: Update the flag documentation line that describes "-fpt,
-filter-page-type" to explicitly state that using -fpt will automatically enable
the -kb (knowledge-base classification) flag; mention both flags by name (-fpt /
-filter-page-type and -kb) so users know that page-type filtering implicitly
enables knowledge-base classification and won’t need to pass -kb separately.

🧹 Nitpick comments (2)

Dockerfile (1)

1-12: Consider adding a non-root user for improved container security.

The Go version bump to 1.25.7 aligns correctly with go.mod. However, per static analysis, the container runs as root which is a security best practice violation. Consider adding a non-root user.
🔒 Proposed fix to run as non-root user
 FROM alpine:3.23.2
 RUN apk add --no-cache bind-tools ca-certificates chromium
+RUN adduser -D -u 1000 katana
 COPY --from=build-env /app/katana /usr/local/bin/
+USER katana
 
 ENTRYPOINT ["katana"]

pkg/engine/common/base.go (1)

249-258: Avoid double body-to-string conversion.
You build string(body) twice; reusing a single string reduces allocations on large responses.

♻️ Suggested refactor

-		body, _ := io.ReadAll(resp.Body)
-		reader, _ := goquery.NewDocumentFromReader(bytes.NewReader(body))
+		body, _ := io.ReadAll(resp.Body)
+		bodyStr := string(body)
+		reader, _ := goquery.NewDocumentFromReader(bytes.NewReader(body))
 		var technologyKeys []string
 		if s.Options.Wappalyzer != nil {
 			technologies := s.Options.Wappalyzer.Fingerprint(resp.Header, body)
 			technologyKeys = mapsutil.GetKeys(technologies)
 		}
 		navigationResponse := &navigation.Response{
 			Depth:         depth + 1,
 			RootHostname:  hostname,
 			Resp:          resp,
-			Body:          string(body),
+			Body:          bodyStr,
 			Reader:        reader,
 			Technologies:  technologyKeys,
 			StatusCode:    resp.StatusCode,
 			Headers:       utils.FlattenHeaders(resp.Header),
-			KnowledgeBase: s.Options.ClassifyPage(string(body)),
+			KnowledgeBase: s.Options.ClassifyPage(bodyStr),
 		}

cmd/katana/main.go

README.md

neo-by-projectdiscovery-dev · 2026-03-04T17:11:37Z

Neo - PR Security Review

No security issues found

Highlights

Version bump from v1.4.0 to v1.5.0 in banner.go
Lint fix in healthcheck.go: improved error handling for fmt.Fprintf

_{Comment @neo help for available commands. · Open in Neo}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/engine/headless/captcha/captcha.go`:
- Around line 29-31: Do not treat ditcaptcha.DetectCaptchaInHTML(pageHTML) ==
ditcaptcha.CaptchaTypeNone as authoritative; remove the early return and instead
fall back to DOM-based detection via Identify(page) when DetectCaptchaInHTML
returns None. Update the logic in the function that currently checks "if ct :=
ditcaptcha.DetectCaptchaInHTML(pageHTML); ct == ditcaptcha.CaptchaTypeNone {
return false, nil }" so that when ct == CaptchaTypeNone you call Identify(page)
(or otherwise invoke the existing DOM inspector used elsewhere) and use its
result to decide whether a captcha exists; keep using
ditcaptcha.DetectCaptchaInHTML's positive findings but rely on Identify(page)
for enterprise/generic fallbacks and return the combined detection outcome and
any error accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b050a303-d330-466f-9cc5-b6a10ef2ceb8

📥 Commits

Reviewing files that changed from the base of the PR and between 5150df5 and 411698a.

⛔ Files ignored due to path filters (6)

.github/workflows/build-test.yml is excluded by !**/*.yml
.github/workflows/functional-test.yml is excluded by !**/*.yml
.github/workflows/release-binary.yml is excluded by !**/*.yml
.github/workflows/release-test.yml is excluded by !**/*.yml
.github/workflows/security-crawl-maze-score.yaml is excluded by !**/*.yaml
go.sum is excluded by !**/*.sum

📒 Files selected for processing (8)

README.md
cmd/katana/main.go
go.mod
pkg/engine/common/base.go
pkg/engine/headless/captcha/captcha.go
pkg/engine/headless/captcha/identify_test.go
pkg/engine/headless/headless.go
pkg/types/options.go

🚧 Files skipped from review as they are similar to previous changes (6)

pkg/engine/headless/headless.go
go.mod
README.md
pkg/types/options.go
pkg/engine/common/base.go
cmd/katana/main.go

pkg/engine/headless/captcha/captcha.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/runner/healthcheck.go`:
- Line 80: The healthcheck output string is mislabeled — fmt.Fprintf(&test, "UDP
IPv4 connectivity to scanme.sh:80 => %s\n", testResult) reports port 80 while
the actual probe targets scanme.sh:53; update the formatted message to show the
correct port (53) or interpolate the actual address/port variable used for the
probe so the label matches the real check (modify the fmt.Fprintf call that
writes to test using testResult to reference :53 or the probe variable).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 67a3b344-91dc-4e66-8ff6-8b623e903b35

📥 Commits

Reviewing files that changed from the base of the PR and between 38a79ac and ba8cc32.

📒 Files selected for processing (1)

internal/runner/healthcheck.go

coderabbitai · 2026-03-06T15:19:33Z

internal/runner/healthcheck.go

 		testResult = fmt.Sprintf("Ko (%s)", err)
 	}
-	test.WriteString(fmt.Sprintf("UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult))
+	_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)


⚠️ Potential issue | 🟡 Minor

Fix UDP IPv4 healthcheck label mismatch.

Line 80 reports port 80, but the actual check uses scanme.sh:53 (Line 72). This makes healthcheck output misleading.

Proposed fix

- _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult) + _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)

_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/runner/healthcheck.go` at line 80, The healthcheck output string is mislabeled — fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult) reports port 80 while the actual probe targets scanme.sh:53; update the formatted message to show the correct port (53) or interpolate the actual address/port variable used for the probe so the label matches the real check (modify the fmt.Fprintf call that writes to test using testResult to reference :53 or the probe variable).

fix: use golangci-lint v2 action for Go 1.25 compat

b136de5

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

cmd/katana/main.go Show resolved Hide resolved

README.md Show resolved Hide resolved

fix: use projectdiscovery/actions/setup/go in all workflows

fb60a5c

dogancanbakir requested review from Ice3man543, Mzack9999 and knakul853 February 15, 2026 19:06

dogancanbakir added 2 commits March 4, 2026 18:04

merge dev

f8628da

use dit for captcha type detection pre-check

411698a

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

pkg/engine/headless/captcha/captcha.go Outdated Show resolved Hide resolved

dogancanbakir and others added 4 commits March 4, 2026 20:22

fallback to data-sitekey check when dit detects none

39abb7e

fix lint errors

38a79ac

Merge branch 'dev' into feature/knowledge-base-dit

15dc0ee

fixing lint

ba8cc32

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

Mzack9999 approved these changes Mar 6, 2026

View reviewed changes

dogancanbakir merged commit 2f9ecf4 into dev Mar 7, 2026
16 of 17 checks passed

dogancanbakir deleted the feature/knowledge-base-dit branch March 7, 2026 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add knowledge base classification using dit#1538

Add knowledge base classification using dit#1538
dogancanbakir merged 9 commits intodevfrom
feature/knowledge-base-dit

dogancanbakir commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

neo-by-projectdiscovery-dev bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)
	_, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult)

Conversation

dogancanbakir commented Feb 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

neo-by-projectdiscovery-dev bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Neo - PR Security Review

Highlights

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dogancanbakir commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

neo-by-projectdiscovery-dev bot commented Mar 4, 2026 •

edited

Loading