Three workshop submissions sharing ~70% of the body. Build order below is priority order, not chronological. Tasks marked [SHARED] feed all three papers; tasks marked [WORKSHOP] are workshop-specific.
- Three branches off current draft:
aiwild,ai4good,fmai - Shared
core/for figures, tables, body sections reused across all three - Single
experiments/directory; results organized by experiment name - Each branch owns its own intro, related work, threat model wrapper, discussion
Currently the paper reports PSNR only. Limitations section flags this.
- Dependencies:
lpips(Alex backbone),pytorch_msssim - Re-run 300-CIFAR eval at ε=2.0 and ε=5.0 reporting PSNR, LPIPS, SSIM
- Re-run cross-VAE eval with same three metrics
- Update Table 1, Figure 4, all reconstruction-quality claims
- Pure metric addition, no new data collection
Reviewers will want operating-point info, not just AUC.
- Compute precision and recall at fixed FPR thresholds (0.1%, 1%, 5%) for each ε
- Report confusion matrices at the 1% FPR operating point
- Same data as existing detector sweep, just different summary stats
Current robustness tests use JPEG Q=10 and aggressive resize. Replace with realistic deployment distortions.
- Add: PNG→JPEG Q=70→PNG roundtrip
- Add: social-media-style downscale (resize to 1024 max edge) + JPEG Q=80
- Keep: VAE re-encode, screenshot simulation, additive noise σ=0.10
- Drop or move-to-appendix: JPEG Q=10 (too unrealistic)
- Optional if fast: one neural-compression baseline (e.g.,
compressaibmshj2018)
Highest-leverage new result. Without this, the FMAI paper is "theoretically a covert channel exists." With this, it's "here is the channel running through a realistic agent pipeline."
- Two VLM agents (use Claude Sonnet 4.5 or GPT-4V via API) with PatchSteg embed/extract as a tool
- Shared workspace contains an image and a caption (mimics a multi-agent platform)
- Third agent acts as monitor: sees pixels + caption, outputs flagged/not-flagged + natural-language description of image
- Task: sender embeds 32-bit secret payload into a cover image, posts to workspace with innocuous caption; receiver retrieves and extracts; monitor inspects
- Metrics:
- End-to-end payload recovery rate (target ≥95% at ε=2.0)
- Monitor flag rate on stego vs clean (target: indistinguishable)
- Monitor's natural-language descriptions: do they mention anything visually wrong? (qualitative table)
- Latency and cost per message (operational realism)
- Scale: 50 trials, 32-bit payloads, 5 cover images, varied captions and payloads
- Output: one figure + one table for FMAI paper, one figure for AI4GOOD paper if time permits
Current sanitization is described as "preliminary safety probes." FMAI's "verified fixes" output requires a real Pareto.
- Sanitizers to evaluate:
- Latent smoothing (Gaussian blur on latent), σ ∈ {0.1, 0.3, 1.0}
- Quantile reflection (existing implementation)
- Additive Gaussian noise on latent, σ ∈ {0.05, 0.1, 0.3}
- JPEG re-encode at Q ∈ {95, 85, 70, 50}
- SDEdit low-strength regeneration (strength ∈ {0.1, 0.2, 0.3}) — most expensive, do last
- Metrics axes:
- X-axis: bit accuracy after sanitization (lower = better defense)
- Y-axis: LPIPS distance from original clean image (lower = better fidelity preservation)
- Each sanitizer is a curve across its parameter sweep
- Scale: 100 stego images at ε=2.0 across natural-image set
- Output: single Pareto plot. Identify which sanitizers Pareto-dominate. Expected finding: no sanitizer is free; all hurt benign quality measurably. State this as a negative result for defenses.
Current detector AUC tests one fixed sender against one fixed detector. FMAI cares about robustness under adaptation.
- Train latent-statistics logistic regression detector on existing protocol
- Implement gradient-aware sender:
- At each carrier position, compute detector gradient w.r.t. latent
- Project ε·d onto null space of detector gradient direction
- Renormalize to maintain ‖perturbation‖ = ε
- Metrics: detector AUC at ε=2.0 with adaptive vs non-adaptive sender; bit accuracy under both
- Scale: 100 images, single ε
- Output: one row added to detector table; one paragraph in defense section about arms race
Removes the upscaled-CIFAR limitation.
- Dataset: 100 native 512×512 images from LAION-Aesthetics or COCO
- Run full PatchSteg pipeline at ε ∈ {2.0, 5.0}
- Report bit accuracy + PSNR + LPIPS + SSIM
- Output: one row in main results table; mention in limitations that 512×512 is now covered
Strongest mechanistic claim in the paper if it works. Cuttable.
- For 30 carrier positions across 10 images:
- Compute local Jacobian J = ∂(Enc∘Dec)/∂z at z̄ via finite differences (4 channels, small δ)
- SVD on J
- Alignment score = ⟨d, top singular vector⟩ × top singular value
- Correlate alignment score with empirical bit-recovery margin
- Output: scatter plot, Pearson correlation, brief mechanistic claim if r > 0.5
- Full adaptive attack against a learning monitor (out of scope, mention as future work)
- More VAE backbones beyond SD-VAE-MSE, SD-VAE-EMA, SDXL-VAE
- More datasets beyond CIFAR-10, native 512×512 (LAION/COCO)
- CDF-PatchSteg extension experiments (move fully to appendix)
- More structural-image synthetics (move to appendix)
Body reuse: ~95% from current draft after P1-P3 are done.
Intro rewrite (1 page):
- Lead with agentic threat scenario, not interpretability
- Open: multimodal agents are increasingly deployed in environments where they share images with each other and with monitoring infrastructure (cite OpenAI Operator, OpenClaw, AIWILD about-page language)
- If the same VAE backbone is used by sender, receiver, and intermediate processing, an information channel exists in the latent that pixel-domain monitoring cannot see
- Introduce PatchSteg as the probe demonstrating this channel exists, is reliable, and is hard to detect with naive monitors
- Treat interpretability framing as one consequence, not the spine
New threat model section (~3/4 page):
- Sender capabilities: can perturb latent before decode
- Receiver capabilities: can re-encode received image
- Monitor capabilities: sees pixels, may have access to a detector model
- Shared knowledge: sender and receiver share VAE checkpoint and seed
- In scope: covert info passing through a deployed multimodal pipeline
- Out of scope: full adaptive attacks against a learning monitor
Move Section 7 (Safety and Monitoring) earlier and rename:
- New section title: "Implications for Multi-Agent Image-Mediated Communication"
- Every claim should reference the threat model
Expand discussion (~1/2 page):
- Detection-evasion arms race
- Platform-level defenses (mandatory latent sanitization on shared image infrastructure)
- Implications for monitoring agent interactions in deployment
Trim:
- CDF-PatchSteg → appendix
- Synthetic-image content-dependence figure → appendix or compress to one figure
Cite heavily: Motwani et al. (secret collusion) as load-bearing, not side citation.
Page limit: 9 pages regular track. Use AIWILD's modified ICML template.
Body reuse: ~85% from AIWILD branch after AIWILD is shipped.
Intro rewrite (1 page):
- Cooperative AI is about trust, commitment, and verifiability between AI systems
- PatchSteg is the failure case: two AI systems can coordinate covertly through shared media in a way that breaks the verifiability assumption
- Open with vignette: multi-agent system where verifiability is supposed to be enforced through transparent communication, but participants can coordinate hidden side payments or strategy through pixel content that looks innocuous to a monitor
- Cite Motwani et al. heavily; CAIF cares about exactly this kind of coordination failure
New "scenarios where this matters" section (~1/2 page):
- AI-mediated content moderation systems passing images through shared VAE-based preprocessing
- Cooperative multi-agent platforms where agents share generated visualizations as part of their coordination
- AI-assisted civic discourse tools where covert content embedded in shared images undermines legitimacy of public deliberation
- Frame each as a "potentially affected setting," not as an attack we executed
Discussion swap:
- AIWILD's discussion is about agent monitoring
- AI4GOOD's is about cooperative-AI verifiability and what defenses look like at platform/governance level
- One paragraph on policy-relevant implications (don't go full TAIGR, just signal the concern)
Optional inclusion: simplest version of agent-in-loop demo (P4) as one figure + one table if available by submission time. Workshop is interdisciplinary, so concrete scenarios carry more weight here than dense experimental tables.
Submission detail: mark "Cooperative AI" track on OpenReview submission. This is what makes the paper eligible for the CAIF prize and travel funding.
Page limit: 2-8 pages.
Body reuse: ~70% from AIWILD branch, plus all new experiments P4-P7.
Restructure body around FMAI's four research outputs:
-
§3 Operational Definition
- What is the failure: covert latent channel surviving decode-reencode round trip
- Boundaries: requires shared VAE checkpoint; baseline protocol requires shared seed
- Falsifiable mechanistic hypothesis: stable directions correspond to high-singular-value directions of composite Enc∘Dec Jacobian aligned with d
-
§4 Reproducible Trigger
- PatchSteg pipeline (current Section 3 content)
- Code release commitment with anonymized URL
-
§5 Diagnostics
- Stability maps (existing)
- Content correlation (existing)
- Detector AUC + operating-point precision/recall (P2)
- Agent-in-loop measurements (P4)
- Native-res validation (P7)
-
§6 Verified Fixes
- Sanitizer Pareto curves (P5)
- Adaptive sender results (P6)
- Honest framing of what works and what doesn't
Add explicit negative results framing:
- Sanitizer Pareto: no defense gives free disruption — frame as negative result for defenses
- Adaptive sender: if it breaks the detector, frame as negative result for detection
- FMAI explicitly welcomes well-documented negative results
Limitations update:
- Composite Jacobian analysis: partial (if P8 done) or future work (if not)
- Agent demo uses two specific VLMs; may not generalize to all multimodal architectures
- Sanitizers tested on 100 images per setting (smallest evaluation in paper)
Page limit: 8 pages excluding references and appendix.
- P1, P2, P3 (shared metric/detector/robustness fixes) → unblocks AIWILD
- AIWILD intro + threat model + section restructure → AIWILD ships
- AI4GOOD intro + scenarios section → AI4GOOD ships
- P4 (agent-in-loop demo) → highest-leverage FMAI addition
- P5 (sanitizer Pareto) → FMAI "verified fixes"
- P6 (adaptive sender) → FMAI rigor
- P7 (native high-res) → removes stated limitation
- FMAI restructure around four-output framework → FMAI ships
- P8 (Jacobian) only if everything else is done
- First cut: P8 (Jacobian)
- Second cut: P7 (native high-res) — keep limitation as-is
- Third cut: P6 (adaptive sender) — substitute with stronger P5 sweep
- Do not cut: P1, P4, P5