Skip to content

Latest commit

 

History

History
117 lines (94 loc) · 4.02 KB

lexical_analyzer.md

File metadata and controls

117 lines (94 loc) · 4.02 KB

Lexical Analyzer

In C++, a lexer, also known as a lexical analyzer or scanner, is a crucial component of the compiler or interpreter. Its primary role is to process the input source code and convert it into a sequence of tokens. Tokens are meaningful units such as keywords, identifiers, literals, operators, and punctuation.

Here's a detailed breakdown of the lexer's function:

  1. Input Handling: The lexer reads the raw input text of the source code character by character.
  2. Tokenization: It groups sequences of characters into tokens. For example, the characters int would be recognized as a keyword token.
  3. Classification: The lexer classifies each token into a specific category such as identifier, keyword, literal, operator, or delimiter.
  4. Skipping Whitespaces and Comments: Lexers typically ignore whitespaces and comments since they are not meaningful in the context of parsing the language syntax.
  5. Error Handling: If the lexer encounters unrecognized sequences of characters that do not conform to the language's lexical rules, it generates an error.

Example of Tokens in C++

For the following C++ code:

int main() {
    int a = 5;
    std::cout << a << std::endl;
}

The lexer would generate tokens like:

  • int (keyword)
  • main (identifier)
  • ( (left parenthesis)
  • ) (right parenthesis)
  • { (left brace)
  • int (keyword)
  • a (identifier)
  • = (assignment operator)
  • 5 (integer literal)
  • ; (semicolon)
  • std (identifier)
  • :: (scope resolution operator)
  • cout (identifier)
  • << (insertion operator)
  • a (identifier)
  • << (insertion operator)
  • std (identifier)
  • :: (scope resolution operator)
  • endl (identifier)
  • ; (semicolon)
  • } (right brace)

Lexers in Practice

In practice, lexers are usually generated using tools like:

  • Flex (Fast Lexical Analyzer Generator): It takes a set of rules defined in a lex specification file and generates C or C++ code for the lexer.
  • Handwritten Lexers: Sometimes, lexers are written manually in C++ for more control and optimization. This involves defining a set of regular expressions and corresponding actions for token recognition.

Integration with Parsers

The lexer works closely with the parser. While the lexer tokenizes the input, the parser takes these tokens and builds an abstract syntax tree (AST) based on the grammar rules of the programming language.

Simple Example of a Lexer in C++

Here's a simplified example of a basic lexer in C++ without using external tools:

#include <iostream>
#include <string>
#include <cctype>
#include <vector>

enum TokenType {
    TOKEN_INT,
    TOKEN_IDENTIFIER,
    TOKEN_OPERATOR,
    TOKEN_UNKNOWN
};

struct Token {
    TokenType type;
    std::string value;
};

std::vector<Token> lex(const std::string& input) {
    std::vector<Token> tokens;
    size_t i = 0;

    while (i < input.length()) {
        if (std::isspace(input[i])) {
            ++i;
            continue;
        }

        if (std::isalpha(input[i])) {
            size_t start = i;
            while (i < input.length() && std::isalnum(input[i])) ++i;
            tokens.push_back({TOKEN_IDENTIFIER, input.substr(start, i - start)});
        } else if (std::isdigit(input[i])) {
            size_t start = i;
            while (i < input.length() && std::isdigit(input[i])) ++i;
            tokens.push_back({TOKEN_INT, input.substr(start, i - start)});
        } else {
            tokens.push_back({TOKEN_OPERATOR, std::string(1, input[i])});
            ++i;
        }
    }

    return tokens;
}

int main() {
    std::string code = "int main() { int a = 5; }";
    std::vector<Token> tokens = lex(code);

    for (const auto& token : tokens) {
        std::cout << "Token: " << token.value << ", Type: " << token.type << std::endl;
    }

    return 0;
}

In this example, the lex function tokenizes a simple string input, identifying identifiers, integers, and operators. This is a basic and illustrative example; a real-world lexer would handle more complex scenarios and additional token types.