- load data from the embeddings (it's all saved in npy files) and convert to string
- shove as much as we can into system prompt (we have way more embeddings than will reasonably fit in the context window, prob a good idea to maybe do a random forest vote type classification provided the LLM doesn't take too long)
- make embeddings for test set (didn't do this yet..., maybe will try later)
- ask it to classify (maybe a couple times)
- parse the result (shouldn't be too bad provided LLM follows the format)
- pray that it's better than 50/50 (it's probably not)
- Final Result: Correct: 16, Incorrect: 29, Accuracy: 0.35555555555555557
To get started, make a new Python environment. (Named "ece209as" here, but name it whatever you want):
conda create -n "ece209as" python=3.13.2 ipython
conda activate ece209as
Install the dependencies:
pip install -r requirements.txt
Note: if you have a Nvidia GPU, you might want to also pip install -r requirements2.txt
which includes CUDA supported versions of PyTorch
You should also download and unzip the embeddings.zip
file to get all of the image embeddings. You can get it from Google Drive. The organization is as follows:
embeddings/
embeddings_test/
FAKE/ <-- Embeddings for AI generated images
REAL/ <-- embeddings for real images
embeddings_train/
FAKE/
REAL/
Embeddings were generated from images in the Kaggle CIFAKE dataset
(picked mostly arbitrarily based on existing recipes)
- Large Language Model: meta-llama/Llama-3.2-3B-Instruct
- Vision Transformer: google/vit-base-patch16-224-in21k
(This is an idea of what could be the minimum implementation. In-Context learning is probably simpler---although we might face some limitations due to maximum model context size---so we'll start with that. Time permitting, we can try fine-tuning.)
- Load a labeled dataset of AI-generated vs. real images.
- Use a Vision Transformer to generate 1-dimensional image embeddings for those images.
- Create a system prompt with a bunch of examples (as many as we can fit).
- This could also be done in a user prompt, but that would probably blow up context size even more...
- Give a user prompt with a new image embedding and pray that it classifies it.
Inspired from the LICO Paper
Each x is an image and each y is a label if the image is AI-generated, labeled 'yes', or not, labeled 'no'.
Predict y given x.
x: <image embedding>, y: <yes|no>
... (many repetitions for in-context learning)
x: <image embedding>, y: <yes|no>
x: <new image embedding>, y:
- LLM In-Context Learning
- Final Result: Correct: 16, Incorrect: 29, Accuracy: 0.35555555555555557
- DistilBert Encoder-Only Fine Tuning
{'loss': 0.5475, 'grad_norm': 6.174920082092285, 'learning_rate': 0.0, 'epoch': 500.0}
{'train_runtime': 16860.6709, 'train_samples_per_second': 47.448, 'train_steps_per_second': 0.208, 'train_loss': 0.5886086750030518, 'epoch': 500.0}
100%|█████████████████████████████████████| 3500/3500 [4:41:00<00:00, 4.82s/it]
100%|█████████████████████████████████████████████| 2/2 [00:01<00:00, 1.50it/s]
[2025-03-07 01:31:08] Results:
{'eval_loss': 0.6895899772644043, 'eval_runtime': 2.1466, 'eval_samples_per_second': 186.337, 'eval_steps_per_second': 0.932, 'epoch': 500.0}
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00, 2.15it/s]
[2025-03-07 01:31:10] Accuracy: 0.68
[2025-03-07 01:31:10] Loading embeddings...
[2025-03-07 01:31:14] Loaded embeddings from embeddings/embeddings_test/REAL and embeddings/embeddings_test/FAKE.
Map: 100%|███████████████████████████| 400/400 [00:00<00:00, 2103.73 examples/s]
Map: 100%|█████████████████████████| 1600/1600 [00:00<00:00, 2125.36 examples/s]
100%|█████████████████████████████████████████████| 7/7 [00:06<00:00, 1.16it/s]
[2025-03-07 01:31:23] Test Accuracy: 0.676875
- CLIP Multimodal Image-Text Pair Prediction
[2025-03-06 19:32:22] Starting training for 10 epochs
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:19<00:00, 3.68it/s]
Epoch 1 Accuracy: 39.19%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:12<00:00, 3.72it/s]
Epoch 2 Accuracy: 39.07%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:11<00:00, 3.72it/s]
Epoch 3 Accuracy: 38.33%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:13<00:00, 3.71it/s]
Epoch 4 Accuracy: 39.68%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:15<00:00, 3.70it/s]
Epoch 5 Accuracy: 39.20%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:20<00:00, 3.68it/s]
Epoch 6 Accuracy: 40.73%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:22<00:00, 3.66it/s]
Epoch 7 Accuracy: 39.95%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:16<00:00, 3.70it/s]
Epoch 8 Accuracy: 39.51%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:15<00:00, 3.70it/s]
Epoch 9 Accuracy: 39.44%
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [11:16<00:00, 3.69it/s]
Epoch 10 Accuracy: 39.62%