Skip to content

[Bug] LTX 2.3 dev produces motion distortion (works fine in comfyui). Distill is fine. #1579

@LostRuins

Description

@LostRuins

Git commit

sd-master-0e4ee04-bin-win-vulkan-x64

Operating System & Version

Windows 10

GGML backends

Vulkan

Command-line arguments used

sd-cli.exe -M vid_gen --diffusion-model ltx-2.3-22b-dev-Q4_K_S.gguf --vae ltx-2.3-22b-dev_video_vae.safetensors --audio-vae ltx-2.3-22b-dev_audio_vae.safetensors --llm gemma-3-12b-it-Q4_K_S.gguf --embeddings-connectors ltx-2.3-22b-dev_embeddings_connectors.safetensors -p "HD 4K video, Two men wearing business suits swordfighting in the garden" --cfg-scale 6.0 --sampling-method euler -v -n "worst quality, low quality, blurry, distorted, artifacts" -W 512 -H 512 --diffusion-fa --offload-to-cpu --video-frames 33 --fps 24

Steps to reproduce

Hello! The distilled models work perfectly, however I am noticing some pretty weird glitches/artifacts when using the dev models. These artifacts only show up in specific kinds of videos, basically anything with rapid changing movement. The distilled models are perfectly fine, and ComfyUI works fine too.

Generation settings used are listed above. I used fewer frames at a reduce resolution due to my GPU limitations. But I have tested at multiple resolutions, all have this issue. I am not sure if it's an inherent flaw in the model or a bug in stable-diffusion.cpp, however it works fine in ComfyUI hence highlighting it here.

First, this is the video output of the exact CLI args I used above. This is exactly reproducible on the latest sd-master-0e4ee04-bin-win-vulkan-x64

upstream_exact.mp4

Now, compare to your original basic prompt: a lovely cat with the negative prompt worst quality, low quality, blurry, distorted, artifacts. This works okay because there is low movement.
1

The problem arises whenever there is rapid motion.

HD 4K video, Two men wearing business suits swordfighting in the garden

Image Image

HD 4K video, Gentleman wearing a suit running fast

Image

As you can see, any time there is fast motion everything becomes a glitchy smudge. For troubleshooting, I have also included my generation logs.

also cc: @henk717 (who raised this issue first) and @wbruna (who might wish to try repro, maybe)

What you expected to happen

Produces clear video

What actually happened

Smudgy blurry video during fast motion

Logs / error messages / stack trace

C:\Users\user\Desktop\sd-master-0e4ee04-bin-win-vulkan-x64>sd-cli.exe -M vid_gen --diffusion-model C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf --vae C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors --audio-vae C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors --llm C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf --embeddings-connectors C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors -p "HD 4K video, Two men wearing business suits swordfighting in the garden" --cfg-scale 6.0 --sampling-method euler -v -n "worst quality, low quality, blurry, distorted, artifacts" -W 512 -H 512 --diffusion-fa --offload-to-cpu --video-frames 33 --fps 24
[DEBUG] main.cpp:597  - version: stable-diffusion.cpp version unknown, commit 0e4ee04
[DEBUG] main.cpp:598  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:599  - SDCliParams {
  mode: vid_gen,
  output_path: "output.png",
  image_path: "",
  metadata_format: "text",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.avi",
  preview_fps: 24,
  taesd_preview: false,
  preview_noisy: false,
  metadata_raw: false,
  metadata_brief: false,
  metadata_all: false
}
[DEBUG] main.cpp:600  - SDContextParams {
  n_threads: 16,
  model_path: "",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf",
  llm_vision_path: "",
  diffusion_model_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf",
  high_noise_diffusion_model_path: "",
  embeddings_connectors_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors",
  vae_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors",
  audio_vae_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  hires_upscalers_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: true,
  max_vram: 0,
  backend: "",
  params_backend: "",
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:601  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "HD 4K video, Two men wearing business suits swordfighting in the garden",
  negative_prompt: "worst quality, low quality, blurry, distorted, artifacts",
  clip_skip: -1,
  width: 512,
  height: 512,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 6.00, img_cfg: 6.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: euler, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=inf, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 33,
  fps: 24,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
  hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, custom_sigmas: [], upscale_tile_size: 128 },
  vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0, "" },
}
[DEBUG] ggml_extend.hpp:60   - ggml_vulkan: Found 2 Vulkan devices:
[DEBUG] ggml_extend.hpp:60   - ggml_vulkan: 0 = Intel(R) RaptorLake-S Mobile Graphics Controller (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
[DEBUG] ggml_extend.hpp:60   - ggml_vulkan: 1 = NVIDIA GeForce RTX 4090 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
[DEBUG] ggml_extend_backend.cpp:311  - Found 3 backend devices:
[DEBUG] ggml_extend_backend.cpp:314  - #0: Vulkan0
[DEBUG] ggml_extend_backend.cpp:314  - #1: Vulkan1
[DEBUG] ggml_extend_backend.cpp:314  - #2: CPU
[DEBUG] ggml_extend_backend.cpp:291  - Initializing backend: Vulkan1
[DEBUG] ggml_extend_backend.cpp:291  - Initializing backend: CPU
[INFO ] stable-diffusion.cpp:272  - loading diffusion model from 'C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf'
[INFO ] model.cpp:216  - load C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf using gguf format
[DEBUG] model.cpp:265  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf'
[INFO ] stable-diffusion.cpp:319  - loading llm from 'C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf'
[INFO ] model.cpp:216  - load C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf using gguf format
[DEBUG] model.cpp:265  - init from 'C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf'
[INFO ] stable-diffusion.cpp:333  - loading vae from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors'
[INFO ] model.cpp:219  - load C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:350  - loading embeddings connectors from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors'
[INFO ] model.cpp:219  - load C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:357  - loading LTX audio VAE from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors'
[INFO ] model.cpp:219  - load C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:375  - Version: LTXAV
[INFO ] stable-diffusion.cpp:403  - Weight type stat:                      f32: 2961 |    q4_K: 1870 |    q5_K: 210  |    q6_K: 1    |    bf16: 1531
[INFO ] stable-diffusion.cpp:404  - Conditioner weight type stat:          f32: 289  |    q4_K: 326  |    q5_K: 10   |    q6_K: 1
[INFO ] stable-diffusion.cpp:405  - Diffusion model weight type stat:      f32: 2672 |    q4_K: 1544 |    q5_K: 200  |    bf16: 28
[INFO ] stable-diffusion.cpp:406  - VAE weight type stat:                 bf16: 272
[DEBUG] stable-diffusion.cpp:408  - ggml tensor size = 400 bytes
[DEBUG] gemma_tokenizer.cpp:32   - vocab size: 262144
[DEBUG] gemma_tokenizer.cpp:40   - merges size 514905
[DEBUG] llm.hpp:1516 - llm: num_layers = 48, vocab_size = 262208, hidden_size = 3840, intermediate_size = 15360
[INFO ] stable-diffusion.cpp:797  - using VAE for encoding / decoding
[INFO ] stable-diffusion.cpp:899  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:928  - loading weights
[DEBUG] ggml_extend.hpp:2711 - gemma3_12b params backend buffer size =  9661.05 MB(RAM) (626 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltxav_text_projection params backend buffer size =  2205.02 MB(RAM) (4 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltxav params backend buffer size =  13328.05 MB(RAM) (4444 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltx_video_vae params backend buffer size =  1385.02 MB(RAM) (170 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltx_audio_vae params backend buffer size =  339.88 MB(RAM) (1285 tensors)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:822  - model files processing completed in 0.01s
[DEBUG] model.cpp:921  - using 16 threads for model loading
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf
  |=================================>                | 4444/6573 - 4.25GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf
  |======================================>           | 5070/6573 - 3.39GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors
  |=======================================>          | 5240/6573 - 3.07GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors
  |=======================================>          | 5244/6573 - 2.71GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors
  |==================================================| 6573/6573 - 2.68GB/s←[K
[INFO ] model.cpp:1155 - loading tensors completed, taking 8.38s (read: 4.80s, memcpy: 0.00s, convert: 0.37s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:1024 - finished loaded file
[INFO ] stable-diffusion.cpp:1106 - total params memory size = 26919.01MB (VRAM 0.00MB, RAM 26919.01MB): text_encoders 11866.07MB(RAM), diffusion_model 13328.05MB(RAM), vae 1724.89MB(RAM), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1196 - running in Flux FLOW mode
[INFO ] denoiser.hpp:625  - get_sigmas with LTX2 scheduler
[DEBUG] denoiser.hpp:538  - LTX2 scheduler: tokens=1280, shift=1.0417, stretch=1, terminal=0.1000
[INFO ] stable-diffusion.cpp:3352 - sampling using Euler method
[DEBUG] bpe_tokenizer.cpp:207  - split prompt "HD 4K video, Two men wearing business suits swordfighting in the garden" to tokens ["HD", "▁", "4", "K", "▁video", ",", "▁", "Two", "▁men", "▁wearing", "▁business", "▁suits", "▁sword", "fighting", "▁in", "▁the", "▁garden", ]
[DEBUG] ggml_extend.hpp:1930 - gemma3_12b compute buffer size: 1658.01 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - gemma3_12b offload params (9661.05 MB, 626 tensors) to runtime backend (Vulkan1), taking 3.83s
[DEBUG] ggml_extend.hpp:1930 - ltxav_text_projection compute buffer size: 26.12 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltxav_text_projection offload params (2205.02 MB, 4 tensors) to runtime backend (Vulkan1), taking 0.82s
[DEBUG] conditioner.hpp:2405 - computing LTXAV condition graph completed, taking 14507 ms
[DEBUG] bpe_tokenizer.cpp:207  - split prompt "worst quality, low quality, blurry, distorted, artifacts" to tokens ["worst", "▁quality", ",", "▁", "low", "▁quality", ",", "▁", "bl", "urry", ",", "▁", "dist", "orted", ",", "▁", "artifacts", ]
[DEBUG] ggml_extend.hpp:1930 - gemma3_12b compute buffer size: 1658.01 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - gemma3_12b offload params (9661.05 MB, 626 tensors) to runtime backend (Vulkan1), taking 3.90s
[DEBUG] ggml_extend.hpp:1930 - ltxav_text_projection compute buffer size: 26.12 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltxav_text_projection offload params (2205.02 MB, 4 tensors) to runtime backend (Vulkan1), taking 0.74s
[DEBUG] conditioner.hpp:2405 - computing LTXAV condition graph completed, taking 6795 ms
[INFO ] stable-diffusion.cpp:4816 - get_learned_condition completed, taking 21.31s
[INFO ] stable-diffusion.cpp:5110 - generate_video 512x512x33
[DEBUG] stable-diffusion.cpp:5172 - sample 16x16x5
[DEBUG] ggml_extend.hpp:1930 - ltxav compute buffer size: 306.78 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltxav offload params (13328.05 MB, 4444 tensors) to runtime backend (Vulkan1), taking 5.22s
  |==================================================| 20/20 - 3.12s/it←[K
[INFO ] stable-diffusion.cpp:5210 - sampling completed, taking 62.72s
[INFO ] stable-diffusion.cpp:5361 - generating latent video completed, taking 63.72s
[DEBUG] stable-diffusion.cpp:5377 - decode audio latent 16x35x8x1
[DEBUG] ggml_extend.hpp:1930 - ltx_audio_vae compute buffer size: 84.59 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltx_audio_vae offload params (339.87 MB, 1285 tensors) to runtime backend (Vulkan1), taking 0.08s
[INFO ] ltx_audio_vae.h:1034 - ltx audio vae decode completed, taking 3.01s
[INFO ] stable-diffusion.cpp:5386 - decoding audio latent completed, taking 3.05s
[DEBUG] stable-diffusion.cpp:4841 - decode_video_outputs latent 16x16x5x128
[DEBUG] ggml_extend.hpp:1930 - ltx_video_vae compute buffer size: 6112.63 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltx_video_vae offload params (1385.02 MB, 170 tensors) to runtime backend (Vulkan1), taking 2.06s
[DEBUG] vae.hpp:210  - computing vae decode graph completed, taking 3.94s
[INFO ] stable-diffusion.cpp:4846 - decode_first_stage completed, taking 3.94s
[DEBUG] stable-diffusion.cpp:4858 - decode_video_outputs decoded 512x512x33x3
[INFO ] stable-diffusion.cpp:5408 - generate_video completed in 93.04s
[INFO ] main.cpp:508  - save result video to 'output.avi'

Additional context / environment details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions