Skip to content

Understanding and Using Large Language Models (LLMs)

Aron Homberg edited this page Jun 30, 2024 · 6 revisions

The Basics

Understanding LLMs: Think of LLMs (Large Language Models) as advanced tools that process and generate text using sophisticated statistical algorithms. These models simulate intelligence by recognizing patterns and structures in large datasets of text. It’s important to note that there is no real intelligence, not even an artificial one. LLMs generate responses based on learned patterns from their training data. From a mathematical point of view, you could as well call them function approximators.

LLMs work with text embeddings: Before training and when receiving a "prompt" (a text message to generate a response for), natural language is broken down into smaller units. These units can be words, subwords, or even characters, depending on the tokenization strategy used. For example, "Let's eat, grandpa" might be tokenized into ["Let's", "eat", ",", "grandpa"] and further represented in a normalized, numerical form like this: [0.53495394, 0.32834801, 0.3334564, 0.39458397]. Each number in this list is called a token's embedding. The whole list is called a text embedding.

These embeddings are vectors in high-dimensional space. This means that not only is each word stored sequentially as a number, but also the lexical structure of the text and many other properties are encoded. The embeddings convey syntactic and semantic information, capturing relationships between words and their meanings. To turn text into text embeddings, a specific pre-trained embedding model, such as text-embedding-3-large is used. Such models are trained on large text corpora in multiple languages. This enables text embeddings to convey the individual properties of all languages, including non-Latin languages. The resulting text embedding representation, even of multi-lingual text, is coherent and captures the text's meaning from a human perspective. An embedding model's performance can be measured using data-set benchmarks, such as MTEB. The following leaderboard lists and compares the best performing embedding models out there.

Training LLMs from Text Embeddings

When LLMs are trained using text embeddings or later used to respond to user texts/prompts, they do not understand meaning in the way humans do and cannot reason about the text. However, they are highly adept at identifying and generating text that aligns with the patterns and structures they have learned during training. If you'd like to learn more about how attention works in Transformer models, the original Google Brain paper from 2017 would be a good start.

Data Preparation: Training an LLM involves several steps, starting from collecting raw text, deciding on a text corpus to train on, and preprocessing the raw text data to generate the final model. Initially, vast amounts of text data are collected from diverse sources. This raw text data is embedded using already mentioned text embedding models. In the latest model releases of the major vendors, pictures, videos, and audio data are embedded alongside text. The result is called a multi-modal large language model that, when the training data is labeled well, helps the model pay attention to the relations of objects in images to the meaning of text, or even advanced relations like the movement of objects in video clips, or the relation of specific sounds in audio to text descriptions.

Training Process: Once the text data is converted into embeddings, LLMs are trained using these embeddings. Like in most neural networks, the training process uses an algorithm called backpropagation. This involves feeding the embeddings into the model and adjusting the model's parameters to minimize the difference between the predicted outputs and the actual training text. The measurement of this difference is the task of the loss function. Training an LLM is an iterative process, meaning it's step-by-step. Each step is called an "epoch." After each epoch, the loss function checks if the result is better or worse compared to the previous epoch's output. The process of adjusting the weights and biases of an LLM during training is called gradient descent, as it iteratively reduces the error by moving in the direction of the steepest decrease in loss. Eventually, further training does not improve the model's output. At this point, the training is deemed complete. The result of such a training process is an LLM that isn't yet optimized for chat or for providing ethical and politically correct responses.

Conclusions: LLMs are trained to perform well in reproducing a wide variety of texts across different contexts. However, the goal is to have the model generalize the patterns in the training text, not to become a complex and expensive copy-and-paste machine. This is why training and test data are split, and many methods have been developed to prevent models from fitting too perfectly to the training data - a condition called overfitting. If a model fits too well to its training data, it would almost always return the exact training data for a given user input. In contrast, underfitting —a condition where a model has not been exposed to sufficient training data—leads to it being unable to generalize from the training text's patterns at all, often resulting in gibberish responses to user inputs. It should now be clearer why a well-trained LLM can be regarded as a proficient function approximator. An LLM iteratively approximates a likely meaningful series of output tokens based on the input tokens provided, and this process is purely based on sophisticated statistical methods, not reasoning.

Post base model training: To optimize an LLM for applications such as chatbots and safe use, further training is necessary, which involves:

  • Conversational data: Training a base model further using curated conversational data optimizes the base model for conversational use in a chatbot.
  • Reinforcement Learning from Human Feedback (RLHF): Letting humans provide feedback for certain model responses. "Is answer A or B better?", "Was this answer helpful?". When industry experts in a certain domain provide such feedback, such as those working in the teams of OpenAI, the weights and biases of the neural network can be adjusted quickly and effectively to generate more desired responses (from a human perspective).
  • In production: RLHF, Consensus: A model can be effectively adjusted based on a consensus of many millions of users of a chatbot. This is what happens when chatbots are in production for many years. Model vendors will create snapshots of their models. Because of this, model behavior may alter over time. It may answer more appropriately but can also become less of an expert in some domains. If a majority of users are opinionated on a topic, learning from RLHF consensus feedback may result in responses that suffer from human confirmation bias.
  • Customer-specific fine-tuning: Some vendors, as part of their commercial and API offering, allow models to be further specialized and trained using few-shot learning. Question/Answer structured data can be uploaded to adjust the weights and biases of a copy of a production LLM.

Data Sources: The data used to train LLMs comes from a wide range of sources, including:

  • Webpages: Large-scale web crawls provide diverse and extensive text data. GPTBot is crawling the internet daily.
  • Books: Digitized books across various genres and languages contribute rich, structured text. BookCorpus was the foundation of OpenAI's GPT-1 model.
  • Articles: News articles, academic papers, and other periodicals offer up-to-date and high-quality text. Some vendors are negotiating with publishers.
  • Social Media: Posts and comments from social media platforms provide conversational and informal text. Elon Musk's AI will be trained on Twitter data.
  • Forums and Blogs: Online discussions and personal blogs offer niche and community-specific language usage.
  • Public Domain Texts: Texts that are freely available in the public domain, including historical documents and literature, such as Project Gutenberg and many others.

By leveraging such diverse datasets, LLMs can be trained to generate text across a wide range of contexts and topics. We can say that all major vendors' LLM models that score high across many metrics, including HumanEval, contain a "compression of the meaning of almost all public data of the internet." While these models don't store the exact words, images, and sounds, they store their patterns and relations to each other.

How an LLM Works

An LLM (Large Language Model) is hosted on a server and accelerated by a GPU with large amounts of dedicated video memory, typically 40GB or more. GPUs specialize in accelerating vector calculus operations. Vendors such as OpenAI, Anthropic, and Google operate data centers that have hundreds of thousands of servers with access to GPU accelerators produced by NVIDIA and other manufacturers. When a user prompt needs to be processed by an LLM, the entire model is loaded into video memory. The user prompt is usually accepted by an API, whether through ChatGPT, Bing Chat, or any website acting as the user's front-end.

After authenticating the API request, the prompt is turned into a text embedding. It is then fed into a sequence of specific small NLP models to check for content and EULA violations. This process is called guardrailing. A score is usually calculated per guardrail category. The text embedding is then fed into the LLM orchestrator code for running the inference process. In this process, the LLM uses its learned patterns to generate a response based on the input embedding.

Inference Process

  1. Inference: The processed embeddings are fed into the LLM. The model processes these embeddings through its layers, applying the learned patterns to generate a response embedding. A good analogy to this process in humans is the act of "thinking aloud." With each word we hear and speak, the next thought forms based on the previous context. While humans can employ advanced cognitive functions like deductive reasoning, the model cannot. It simply predicts the next word based on the previous context without reaching conscious conclusions - it exclusively pays attention to it's training data. The quality of the LLM's response depends on several factors: the organization and quality of the training data, the efficiency of the text embedding model, the models attention mechanism, hyperparameters and the clarity and precision of the prompt and its context. If any of these factors are suboptimal, or if the attention mechanism doesn't function well in a specific case, the model may produce hallucinated, distorted, or inappropriate results.

  2. Decoding: The response embedding is converted back into human-readable text through a decoding process. This step involves various techniques and hyperparameters to ensure the generated text is coherent and contextually appropriate:

    • Temperature OpenAI Anthropic: Controls the randomness of the output. A higher temperature (e.g., 0.8) makes the output more random, while a lower temperature (e.g., 0.2) makes it more deterministic. This may simulate or correlate partly with human creativity.
    • Top-k: Limits the sampling pool to the top k most probable next tokens. For instance, if top-k is set to 1000, only the 1000 most likely next words are considered. In OpenAI models, this parameter is not available as of now. It seems to be open-end.
    • Top-p (Nucleus Sampling): Limits the sampling pool to the smallest set of tokens whose cumulative probability exceeds a threshold p. For example, with top-p set to 0.9, the model considers the smallest set of words that make up 90% of the probability distribution. This correlates with focus in humans.
    • Penalties: Adjusts the likelihood of repeated tokens. For example, a repetition penalty discourages the model from generating the same word or phrase multiple times. If set to negative values, the model will come up with an increased variety of words.
    • Stop Words: Specific words or phrases that, when generated, will stop the model from producing further output. These are essential for controlling the length and relevance of the response, but also to implement hard guardrails should the model might come up with harmful responses.
    • Seed: When an LLM uses seeded random number generators, you can set the initial state of the random number generator using the seed hyperparameter. This means that every time you prompt the LLM with the same input, same seed and same hyperparameters, you will get the same sequence of random numbers, and consequently, very similar outputs, allowing for reproducability, albeit not fully deterministic. Note: RedakTool provides three multi-hyperparameter controls: Creativity, Focus and Vocabulary richness. These controls tune all hyperparameters of a vendor's model automatically, if not set explicitly.
  3. Post-processing: The generated response may undergo additional checks and modifications to ensure it meets content guidelines and quality standards before being sent back to the user. Stop word filtering may apply here, ensuring that unwanted words do not appear in the final output.

Infrastructure and Scaling

The deployment of LLMs involves significant infrastructure to handle the computational demands. While LLMs can run locally or be hosted on-premise in private data centers, even using open-source software like ollama, it is often deemed cost-inefficient by industry experts. This is because GPUs and the energy they consume are drastically more expensive when not bought in bulk. Many major model vendors also cross-finance their real expenses using investor capital. Another thing to consider is model performance. While open-source models are improving, they still lag behind commercial offerings as of mid-2024. However, this sentence might not age well.

Essentials to consider when operating open-source LLMs locally:

  • Data Centers: Facilities with vast numbers of servers equipped with high-end GPUs designed for machine learning workloads.
  • GPUs: Graphics Processing Units from manufacturers like NVIDIA that provide the necessary computational power for processing large-scale models efficiently.
  • Scalability: Systems should be designed to scale horizontally, meaning they can add more servers to handle increasing loads in parallel, ensuring consistent performance and availability.

Real-time Usage

When using an LLM in real-time applications like RedakTool, the following steps are typical:

  1. User Interaction: The user types a message/text/query into the application interface. RedakTool.ai simplifies this process by providing content extraction and professionally crafted, task-specific dynamic prompts. It also helps auto-tune hyperparameters to produce the best results.
  2. API Call: The application sends the user's input to the LLM server via an API call. This is why RedakTool.ai asks for an API key per model vendor.
  3. Processing: The server processes the input, as outlined in the inference process, and generates a response. When using commercial offerings that provide API access, most model providers include contact terms that rule out reinforcement learning from API data. Your data should be safe from training.
  4. Response Delivery: The server sends the generated response back to the application, like RedakTool, which displays it.

By leveraging the power of LLMs, applications can provide seemingly intelligent, contextually relevant responses in real-time, improving efficiency when working on specific text-related tasks.

I hope you now understand that LLMs are not intelligent. If they reason correctly, they do so because the training data has the information correctly, and the attention mechanisms of the transformer model architecture of the Generative Pre-Trained Transformer (GPT) made it resample the text well. But LLMs and RedakTool can be a great tool or toolbox for improving your work with text. It's better to call these tools Intelligence Assisted (IA) rather than Artificial Intelligence (AI).

Economics