Add knowledge base classification using dit#1538
Conversation
- Add -kb flag to enable page/form/field classification via dit - Add -fpt flag to filter results by page type (auto-enables -kb) - Classification works across all engines (standard, headless, hybrid) - KB data included in jsonl output via knowledgebase field - Bump Go version to 1.25.x across CI, Dockerfile, and docs
WalkthroughAdds optional ML-based page/form/field classification via the dit library, new CLI flags to enable and filter classifications, integrates classification into crawler engines and responses, and applies optional page-type filtering at output time. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Parser
participant Opts as Options
participant Crawler as Crawler Init
participant Dit as Dit Classifier
participant Engine as Crawl Engine
participant Writer as Output Writer
participant User as User
CLI->>Opts: parse -kb and -fpt
Opts->>Crawler: init with KnowledgeBase & FilterPageType
Crawler->>Dit: create/init DitClassifier
Engine->>Engine: fetch page content
Engine->>Dit: ClassifyPage(body)
Dit-->>Engine: {PageType, Forms, Fields}
Engine->>Engine: set Response.KnowledgeBase
Engine->>Writer: Write(Response)
Writer->>Writer: check KnowledgeBase.PageType vs FilterPageType
alt PageType matches filter
Writer-->>User: skip result (filtered)
else
Writer-->>User: emit result (JSON/formatted)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cmd/katana/main.go`:
- Line 217: The help text for the StringSlice flag defined via
flagSet.StringSliceVarP for options.FilterPageType (short name "fpt") doesn't
indicate that using -fpt will implicitly enable the knowledge-base flag (-kb);
update the flag's usage string to mention that "-fpt auto-enables -kb" (or
similar phrasing) so users see this behavior in the CLI help for the
FilterPageType flag.
In `@README.md`:
- Line 180: Update the flag documentation line that describes "-fpt,
-filter-page-type" to explicitly state that using -fpt will automatically enable
the -kb (knowledge-base classification) flag; mention both flags by name (-fpt /
-filter-page-type and -kb) so users know that page-type filtering implicitly
enables knowledge-base classification and won’t need to pass -kb separately.
🧹 Nitpick comments (2)
Dockerfile (1)
1-12: Consider adding a non-root user for improved container security.The Go version bump to 1.25.7 aligns correctly with
go.mod. However, per static analysis, the container runs as root which is a security best practice violation. Consider adding a non-root user.🔒 Proposed fix to run as non-root user
FROM alpine:3.23.2 RUN apk add --no-cache bind-tools ca-certificates chromium +RUN adduser -D -u 1000 katana COPY --from=build-env /app/katana /usr/local/bin/ +USER katana ENTRYPOINT ["katana"]pkg/engine/common/base.go (1)
249-258: Avoid double body-to-string conversion.
You buildstring(body)twice; reusing a single string reduces allocations on large responses.♻️ Suggested refactor
- body, _ := io.ReadAll(resp.Body) - reader, _ := goquery.NewDocumentFromReader(bytes.NewReader(body)) + body, _ := io.ReadAll(resp.Body) + bodyStr := string(body) + reader, _ := goquery.NewDocumentFromReader(bytes.NewReader(body)) var technologyKeys []string if s.Options.Wappalyzer != nil { technologies := s.Options.Wappalyzer.Fingerprint(resp.Header, body) technologyKeys = mapsutil.GetKeys(technologies) } navigationResponse := &navigation.Response{ Depth: depth + 1, RootHostname: hostname, Resp: resp, - Body: string(body), + Body: bodyStr, Reader: reader, Technologies: technologyKeys, StatusCode: resp.StatusCode, Headers: utils.FlattenHeaders(resp.Header), - KnowledgeBase: s.Options.ClassifyPage(string(body)), + KnowledgeBase: s.Options.ClassifyPage(bodyStr), }
Neo - PR Security ReviewNo security issues found Highlights
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/engine/headless/captcha/captcha.go`:
- Around line 29-31: Do not treat ditcaptcha.DetectCaptchaInHTML(pageHTML) ==
ditcaptcha.CaptchaTypeNone as authoritative; remove the early return and instead
fall back to DOM-based detection via Identify(page) when DetectCaptchaInHTML
returns None. Update the logic in the function that currently checks "if ct :=
ditcaptcha.DetectCaptchaInHTML(pageHTML); ct == ditcaptcha.CaptchaTypeNone {
return false, nil }" so that when ct == CaptchaTypeNone you call Identify(page)
(or otherwise invoke the existing DOM inspector used elsewhere) and use its
result to decide whether a captcha exists; keep using
ditcaptcha.DetectCaptchaInHTML's positive findings but rely on Identify(page)
for enterprise/generic fallbacks and return the combined detection outcome and
any error accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b050a303-d330-466f-9cc5-b6a10ef2ceb8
⛔ Files ignored due to path filters (6)
.github/workflows/build-test.ymlis excluded by!**/*.yml.github/workflows/functional-test.ymlis excluded by!**/*.yml.github/workflows/release-binary.ymlis excluded by!**/*.yml.github/workflows/release-test.ymlis excluded by!**/*.yml.github/workflows/security-crawl-maze-score.yamlis excluded by!**/*.yamlgo.sumis excluded by!**/*.sum
📒 Files selected for processing (8)
README.mdcmd/katana/main.gogo.modpkg/engine/common/base.gopkg/engine/headless/captcha/captcha.gopkg/engine/headless/captcha/identify_test.gopkg/engine/headless/headless.gopkg/types/options.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/engine/headless/headless.go
- go.mod
- README.md
- pkg/types/options.go
- pkg/engine/common/base.go
- cmd/katana/main.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/runner/healthcheck.go`:
- Line 80: The healthcheck output string is mislabeled — fmt.Fprintf(&test, "UDP
IPv4 connectivity to scanme.sh:80 => %s\n", testResult) reports port 80 while
the actual probe targets scanme.sh:53; update the formatted message to show the
correct port (53) or interpolate the actual address/port variable used for the
probe so the label matches the real check (modify the fmt.Fprintf call that
writes to test using testResult to reference :53 or the probe variable).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 67a3b344-91dc-4e66-8ff6-8b623e903b35
📒 Files selected for processing (1)
internal/runner/healthcheck.go
| testResult = fmt.Sprintf("Ko (%s)", err) | ||
| } | ||
| test.WriteString(fmt.Sprintf("UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)) | ||
| _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult) |
There was a problem hiding this comment.
Fix UDP IPv4 healthcheck label mismatch.
Line 80 reports port 80, but the actual check uses scanme.sh:53 (Line 72). This makes healthcheck output misleading.
Proposed fix
- _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult)
+ _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n", testResult) | |
| _, _ = fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:53 => %s\n", testResult) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/runner/healthcheck.go` at line 80, The healthcheck output string is
mislabeled — fmt.Fprintf(&test, "UDP IPv4 connectivity to scanme.sh:80 => %s\n",
testResult) reports port 80 while the actual probe targets scanme.sh:53; update
the formatted message to show the correct port (53) or interpolate the actual
address/port variable used for the probe so the label matches the real check
(modify the fmt.Fprintf call that writes to test using testResult to reference
:53 or the probe variable).
Closes #1537
Add
-kband-fptflags for ML-based page/form/field classification via dit. Works across all engines. Bumps Go to 1.25.x.Summary by CodeRabbit
New Features
Bug Fixes / Behavior
Chores
Documentation
Tests