[Bug] LTX 2.3 dev produces motion distortion (works fine in comfyui). Distill is fine.

### Git commit

sd-master-0e4ee04-bin-win-vulkan-x64

### Operating System & Version

Windows 10

### GGML backends

Vulkan

### Command-line arguments used

sd-cli.exe -M vid_gen --diffusion-model ltx-2.3-22b-dev-Q4_K_S.gguf --vae ltx-2.3-22b-dev_video_vae.safetensors --audio-vae ltx-2.3-22b-dev_audio_vae.safetensors --llm gemma-3-12b-it-Q4_K_S.gguf --embeddings-connectors ltx-2.3-22b-dev_embeddings_connectors.safetensors -p "HD 4K video, Two men wearing business suits swordfighting in the garden" --cfg-scale 6.0 --sampling-method euler -v -n "worst quality, low quality, blurry, distorted, artifacts" -W 512 -H 512 --diffusion-fa --offload-to-cpu --video-frames 33 --fps 24

### Steps to reproduce

Hello! The distilled models work perfectly, however I am noticing some pretty weird glitches/artifacts when using the **dev** models. These artifacts only show up in specific kinds of videos, basically anything with **rapid changing movement**. The distilled models are perfectly fine, and ComfyUI works fine too. 

Generation settings used are listed above. I used fewer frames at a reduce resolution due to my GPU limitations. But I have tested at multiple resolutions, all have this issue. I am not sure if it's an inherent flaw in the model or a bug in stable-diffusion.cpp, however it works fine in ComfyUI hence highlighting it here.

First, this is the video output of the exact CLI args I used above. This is exactly reproducible on the latest sd-master-0e4ee04-bin-win-vulkan-x64

https://github.com/user-attachments/assets/ff3eafc1-06fd-4453-9ee5-f1ed3385ef12

Now, compare to your original basic prompt: `a lovely cat` with the negative prompt `worst quality, low quality, blurry, distorted, artifacts`. **This works okay because there is low movement**.
<img width="640" height="400" alt="1" src="https://github.com/user-attachments/assets/29fef5cb-5422-4188-abcc-7c66e645a1d8" />

The problem arises whenever there is **rapid motion**. 

`HD 4K video, Two men wearing business suits swordfighting in the garden`

<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/b6c698b2-0286-41f1-b852-495435da6f98" />

<img width="640" height="400" alt="Image" src="https://github.com/user-attachments/assets/5fee2ac2-3687-44f6-be32-fd91f014b783" />

`HD 4K video, Gentleman wearing a suit running fast`

<img width="640" height="512" alt="Image" src="https://github.com/user-attachments/assets/6a17f71c-e5fc-4e01-af21-dcd7ffeb55dd" />

As you can see, any time there is fast motion everything becomes a glitchy smudge. For troubleshooting, I have also included my generation logs.

also cc: @henk717 (who raised this issue first) and @wbruna (who might wish to try repro, maybe)

### What you expected to happen

Produces clear video

### What actually happened

Smudgy blurry video during fast motion

### Logs / error messages / stack trace
```
C:\Users\user\Desktop\sd-master-0e4ee04-bin-win-vulkan-x64>sd-cli.exe -M vid_gen --diffusion-model C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf --vae C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors --audio-vae C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors --llm C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf --embeddings-connectors C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors -p "HD 4K video, Two men wearing business suits swordfighting in the garden" --cfg-scale 6.0 --sampling-method euler -v -n "worst quality, low quality, blurry, distorted, artifacts" -W 512 -H 512 --diffusion-fa --offload-to-cpu --video-frames 33 --fps 24
[DEBUG] main.cpp:597  - version: stable-diffusion.cpp version unknown, commit 0e4ee04
[DEBUG] main.cpp:598  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:599  - SDCliParams {
  mode: vid_gen,
  output_path: "output.png",
  image_path: "",
  metadata_format: "text",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.avi",
  preview_fps: 24,
  taesd_preview: false,
  preview_noisy: false,
  metadata_raw: false,
  metadata_brief: false,
  metadata_all: false
}
[DEBUG] main.cpp:600  - SDContextParams {
  n_threads: 16,
  model_path: "",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf",
  llm_vision_path: "",
  diffusion_model_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf",
  high_noise_diffusion_model_path: "",
  embeddings_connectors_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors",
  vae_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors",
  audio_vae_path: "C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  hires_upscalers_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: true,
  max_vram: 0,
  backend: "",
  params_backend: "",
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:601  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "HD 4K video, Two men wearing business suits swordfighting in the garden",
  negative_prompt: "worst quality, low quality, blurry, distorted, artifacts",
  clip_skip: -1,
  width: 512,
  height: 512,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 6.00, img_cfg: 6.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: euler, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf, extra_sample_args: ),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=inf, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 33,
  fps: 24,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
  hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, custom_sigmas: [], upscale_tile_size: 128 },
  vae_tiling_params: { 0, 0, 0, 0, 0.5, 0, 0, "" },
}
[DEBUG] ggml_extend.hpp:60   - ggml_vulkan: Found 2 Vulkan devices:
[DEBUG] ggml_extend.hpp:60   - ggml_vulkan: 0 = Intel(R) RaptorLake-S Mobile Graphics Controller (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
[DEBUG] ggml_extend.hpp:60   - ggml_vulkan: 1 = NVIDIA GeForce RTX 4090 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
[DEBUG] ggml_extend_backend.cpp:311  - Found 3 backend devices:
[DEBUG] ggml_extend_backend.cpp:314  - #0: Vulkan0
[DEBUG] ggml_extend_backend.cpp:314  - #1: Vulkan1
[DEBUG] ggml_extend_backend.cpp:314  - #2: CPU
[DEBUG] ggml_extend_backend.cpp:291  - Initializing backend: Vulkan1
[DEBUG] ggml_extend_backend.cpp:291  - Initializing backend: CPU
[INFO ] stable-diffusion.cpp:272  - loading diffusion model from 'C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf'
[INFO ] model.cpp:216  - load C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf using gguf format
[DEBUG] model.cpp:265  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf'
[INFO ] stable-diffusion.cpp:319  - loading llm from 'C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf'
[INFO ] model.cpp:216  - load C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf using gguf format
[DEBUG] model.cpp:265  - init from 'C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf'
[INFO ] stable-diffusion.cpp:333  - loading vae from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors'
[INFO ] model.cpp:219  - load C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:350  - loading embeddings connectors from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors'
[INFO ] model.cpp:219  - load C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:357  - loading LTX audio VAE from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors'
[INFO ] model.cpp:219  - load C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:375  - Version: LTXAV
[INFO ] stable-diffusion.cpp:403  - Weight type stat:                      f32: 2961 |    q4_K: 1870 |    q5_K: 210  |    q6_K: 1    |    bf16: 1531
[INFO ] stable-diffusion.cpp:404  - Conditioner weight type stat:          f32: 289  |    q4_K: 326  |    q5_K: 10   |    q6_K: 1
[INFO ] stable-diffusion.cpp:405  - Diffusion model weight type stat:      f32: 2672 |    q4_K: 1544 |    q5_K: 200  |    bf16: 28
[INFO ] stable-diffusion.cpp:406  - VAE weight type stat:                 bf16: 272
[DEBUG] stable-diffusion.cpp:408  - ggml tensor size = 400 bytes
[DEBUG] gemma_tokenizer.cpp:32   - vocab size: 262144
[DEBUG] gemma_tokenizer.cpp:40   - merges size 514905
[DEBUG] llm.hpp:1516 - llm: num_layers = 48, vocab_size = 262208, hidden_size = 3840, intermediate_size = 15360
[INFO ] stable-diffusion.cpp:797  - using VAE for encoding / decoding
[INFO ] stable-diffusion.cpp:899  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:928  - loading weights
[DEBUG] ggml_extend.hpp:2711 - gemma3_12b params backend buffer size =  9661.05 MB(RAM) (626 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltxav_text_projection params backend buffer size =  2205.02 MB(RAM) (4 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltxav params backend buffer size =  13328.05 MB(RAM) (4444 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltx_video_vae params backend buffer size =  1385.02 MB(RAM) (170 tensors)
[DEBUG] ggml_extend.hpp:2711 - ltx_audio_vae params backend buffer size =  339.88 MB(RAM) (1285 tensors)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:811  - NOT using mmap for 'C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors' (mmap disabled by caller)
[INFO ] model.cpp:822  - model files processing completed in 0.01s
[DEBUG] model.cpp:921  - using 16 threads for model loading
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev-Q4_K_S.gguf
  |=================================>                | 4444/6573 - 4.25GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\gemma-3-12b-it-Q4_K_S.gguf
  |======================================>           | 5070/6573 - 3.39GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev_video_vae.safetensors
  |=======================================>          | 5240/6573 - 3.07GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev_embeddings_connectors.safetensors
  |=======================================>          | 5244/6573 - 2.71GB/s←[K
[DEBUG] model.cpp:937  - loading tensors from C:\Users\user\Desktop\ltx-2.3-22b-dev_audio_vae.safetensors
  |==================================================| 6573/6573 - 2.68GB/s←[K
[INFO ] model.cpp:1155 - loading tensors completed, taking 8.38s (read: 4.80s, memcpy: 0.00s, convert: 0.37s, copy_to_backend: 0.00s)
[DEBUG] stable-diffusion.cpp:1024 - finished loaded file
[INFO ] stable-diffusion.cpp:1106 - total params memory size = 26919.01MB (VRAM 0.00MB, RAM 26919.01MB): text_encoders 11866.07MB(RAM), diffusion_model 13328.05MB(RAM), vae 1724.89MB(RAM), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1196 - running in Flux FLOW mode
[INFO ] denoiser.hpp:625  - get_sigmas with LTX2 scheduler
[DEBUG] denoiser.hpp:538  - LTX2 scheduler: tokens=1280, shift=1.0417, stretch=1, terminal=0.1000
[INFO ] stable-diffusion.cpp:3352 - sampling using Euler method
[DEBUG] bpe_tokenizer.cpp:207  - split prompt "HD 4K video, Two men wearing business suits swordfighting in the garden" to tokens ["HD", "▁", "4", "K", "▁video", ",", "▁", "Two", "▁men", "▁wearing", "▁business", "▁suits", "▁sword", "fighting", "▁in", "▁the", "▁garden", ]
[DEBUG] ggml_extend.hpp:1930 - gemma3_12b compute buffer size: 1658.01 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - gemma3_12b offload params (9661.05 MB, 626 tensors) to runtime backend (Vulkan1), taking 3.83s
[DEBUG] ggml_extend.hpp:1930 - ltxav_text_projection compute buffer size: 26.12 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltxav_text_projection offload params (2205.02 MB, 4 tensors) to runtime backend (Vulkan1), taking 0.82s
[DEBUG] conditioner.hpp:2405 - computing LTXAV condition graph completed, taking 14507 ms
[DEBUG] bpe_tokenizer.cpp:207  - split prompt "worst quality, low quality, blurry, distorted, artifacts" to tokens ["worst", "▁quality", ",", "▁", "low", "▁quality", ",", "▁", "bl", "urry", ",", "▁", "dist", "orted", ",", "▁", "artifacts", ]
[DEBUG] ggml_extend.hpp:1930 - gemma3_12b compute buffer size: 1658.01 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - gemma3_12b offload params (9661.05 MB, 626 tensors) to runtime backend (Vulkan1), taking 3.90s
[DEBUG] ggml_extend.hpp:1930 - ltxav_text_projection compute buffer size: 26.12 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltxav_text_projection offload params (2205.02 MB, 4 tensors) to runtime backend (Vulkan1), taking 0.74s
[DEBUG] conditioner.hpp:2405 - computing LTXAV condition graph completed, taking 6795 ms
[INFO ] stable-diffusion.cpp:4816 - get_learned_condition completed, taking 21.31s
[INFO ] stable-diffusion.cpp:5110 - generate_video 512x512x33
[DEBUG] stable-diffusion.cpp:5172 - sample 16x16x5
[DEBUG] ggml_extend.hpp:1930 - ltxav compute buffer size: 306.78 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltxav offload params (13328.05 MB, 4444 tensors) to runtime backend (Vulkan1), taking 5.22s
  |==================================================| 20/20 - 3.12s/it←[K
[INFO ] stable-diffusion.cpp:5210 - sampling completed, taking 62.72s
[INFO ] stable-diffusion.cpp:5361 - generating latent video completed, taking 63.72s
[DEBUG] stable-diffusion.cpp:5377 - decode audio latent 16x35x8x1
[DEBUG] ggml_extend.hpp:1930 - ltx_audio_vae compute buffer size: 84.59 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltx_audio_vae offload params (339.87 MB, 1285 tensors) to runtime backend (Vulkan1), taking 0.08s
[INFO ] ltx_audio_vae.h:1034 - ltx audio vae decode completed, taking 3.01s
[INFO ] stable-diffusion.cpp:5386 - decoding audio latent completed, taking 3.05s
[DEBUG] stable-diffusion.cpp:4841 - decode_video_outputs latent 16x16x5x128
[DEBUG] ggml_extend.hpp:1930 - ltx_video_vae compute buffer size: 6112.63 MB(VRAM)
[INFO ] ggml_extend.hpp:2170 - ltx_video_vae offload params (1385.02 MB, 170 tensors) to runtime backend (Vulkan1), taking 2.06s
[DEBUG] vae.hpp:210  - computing vae decode graph completed, taking 3.94s
[INFO ] stable-diffusion.cpp:4846 - decode_first_stage completed, taking 3.94s
[DEBUG] stable-diffusion.cpp:4858 - decode_video_outputs decoded 512x512x33x3
[INFO ] stable-diffusion.cpp:5408 - generate_video completed in 93.04s
[INFO ] main.cpp:508  - save result video to 'output.avi'
```
### Additional context / environment details

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] LTX 2.3 dev produces motion distortion (works fine in comfyui). Distill is fine. #1579

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] LTX 2.3 dev produces motion distortion (works fine in comfyui). Distill is fine. #1579

Description

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions