Academy/AI Core Concepts Encyclopedia/Core Concepts of Large Language Models: Token, Embedding, and Transformer

Free Chapter 14 minChapter 2/5

Core Concepts of Large Language Models: Token, Embedding, and Transformer

Deep dive into the core mechanisms, mainstream models, and key parameters of LLMs.

本章学习要点

第 2 / 5 章

Differentiate the meanings of the four levels: AI, AGI, ASI, ANI

Understand the relationship between Machine Learning, Deep Learning, and Neural Networks

Master the difference between Training and Inference

Understand the practical significance of model parameter counts (7B/70B/405B)

Differentiate the advantages and disadvantages of open-source vs. closed-source models

When you use ChatGPT or Claude, every word you type goes through a series of precise processing steps. This chapter will unveil the core mechanisms of large language models, helping you truly understand the meanings behind high-frequency terms like Token, Embedding, and Transformer.

Token: The AI's Smallest Reading Unit

**What is a Token?** Large language models do not read text word by word; instead, they process text by splitting it into Tokens. A Token is the smallest unit of text the model processes.

**English Tokens**: Approximately 1 Token ≈ 0.75 words, or 1 word ≈ 1.3 Tokens. "Hello world" = 2 Tokens, "Artificial intelligence" = 2-3 Tokens.

**Chinese Tokens**: Token segmentation for Chinese is more complex. A single Chinese character typically occupies 1-2 Tokens, and common phrases may be encoded as 1 Token. "人工智能" (Artificial Intelligence) might be 2-3 Tokens.

**Why are Tokens important?** Because all LLM limitations and billing are based on Tokens: the context window is measured in Tokens, API calls are billed per Token, and generation speed is measured in Tokens/second.

实用建议

Quick Token estimation: For English, roughly words × 1.3. For Chinese, roughly characters × 1.5. Most AI platforms provide Token counter tools. In practice, you can test a few typical requests first to estimate costs.

Context Window

The context window is the maximum number of Tokens a model can process at once. You can think of it as the model's "working memory"—the larger the window, the more information the model can consider simultaneously.

**Context windows of mainstream models**: GPT-4o (128K Tokens), Claude 3.5 Sonnet (200K Tokens), Gemini 1.5 Pro (1M Tokens), Kimi (200K Tokens). 128K Tokens is roughly equivalent to a 300-page book.

**Impact of context window**: Window too small → long conversations or documents cannot be fully processed. Window very large → can analyze an entire book or codebase at once, but Token costs are also higher.

Embedding: Turning Words into Numbers

**What is Embedding?** Computers cannot directly understand words; they need to convert words into numerical vectors (a string of numbers) for processing. Embedding is this conversion process.

**Vector**: Each Token is converted into a high-dimensional vector, for example, an array containing 768 or 1536 numbers. These numbers encode the semantic information of the Token—words with similar meanings are closer in the vector space.

**Semantic similarity**: In the Embedding space, the vectors for "king" and "queen" are very close, while "king" and "apple" are far apart. This is the foundation for AI's understanding of semantics.

**Applications of Embedding**: Semantic search (finding results with similar meaning, not just keyword matching), RAG (Retrieval-Augmented Generation), recommendation systems, text classification.

重要提醒

Embedding is prerequisite knowledge for understanding RAG (Retrieval-Augmented Generation). The core of RAG is: first convert documents into Embeddings and store them in a vector database; during a query, find relevant documents via Embedding similarity.

Vector Database

While regular databases store structured data (tables, numbers, text), vector databases are specialized for storing and retrieving Embedding vectors. Their core capability is "similarity search"—given a query vector, quickly find the most similar vectors.

**Mainstream vector databases**: Pinecone (cloud-hosted, ready-to-use), Weaviate (open-source, feature-rich), Milvus (open-source, popular among Chinese users), ChromaDB (lightweight, good for prototyping), Qdrant (high-performance, written in Rust).

Transformer: The Foundational Architecture of Modern AI

The Transformer is a neural network architecture proposed in Google's 2017 paper "Attention Is All You Need." It is the underlying architecture for all mainstream large models like GPT, BERT, and Claude.

**Core innovation—Attention Mechanism**: The core of the Transformer is the "Self-Attention mechanism." Simply put, when the model processes a sentence, the attention mechanism allows each word to "pay attention" to other words in the sentence, understanding the relationships between words.

**Example**: "Xiao Ming gave the apple to Xiao Hong, and she was very happy"—the attention mechanism enables the model to understand that "she" refers to "Xiao Hong" and not "apple," because the model calculates the association strength between "she" and other words in the sentence.

**Why is the Transformer so important?** Previous RNNs (Recurrent Neural Networks) could only process text sequentially, unable to compute in parallel, resulting in slow training. The Transformer can process entire sequences in parallel, greatly improving training efficiency, making it possible to train ultra-large-scale models.

Overview of Mainstream Large Models

**GPT Series (OpenAI)**: GPT-4o is the current flagship, GPT-4o mini is the cost-effective version. Strengths are general capabilities and tool use.

**Claude Series (Anthropic)**: Claude 3.5 Sonnet is one of the strongest models in overall performance. Strengths are long-text analysis, code, and safety.

**Gemini Series (Google)**: Gemini 1.5 Pro has an ultra-long context window (1M Tokens). Strengths are multimodality and information integration.

**Open-source representatives**: LLaMA 3 (Meta, strong general capabilities), Qwen 2.5 (Alibaba, excellent Chinese capabilities), DeepSeek V3 (DeepSeek, outstanding reasoning), Mistral (France, high efficiency).

Key Parameters: Temperature and Top-P

**Temperature**: Controls the randomness of the output. Temperature=0 yields the most deterministic output (same each time), Temperature=1 yields more random and diverse output. Recommended 0-0.3 for code writing, 0.7-1.0 for creative writing.

**Top-P (Nucleus Sampling)**: Another parameter controlling diversity. Top-P=0.1 means selecting only from the top 10% highest probability Tokens, Top-P=1.0 means selecting from all Tokens. Typically used in conjunction with Temperature.

注意事项

Temperature and Top-P usually do not need to be adjusted simultaneously. It's recommended to fix one (e.g., Top-P=1) and only adjust the other (Temperature) to control output style. Adjusting both can easily lead to unpredictable results.

Chapter Terminology Quick Reference

**Token**: The smallest unit of text processed by an LLM. **Context Window**: The maximum number of Tokens a model can process at once. **Embedding**: The process of converting text into numerical vectors. **Vector Database**: A database specialized for storing and retrieving Embeddings. **Transformer**: The foundational architecture of modern large models. **Attention Mechanism**: The core technology enabling models to understand relationships between words. **Temperature**: Parameter controlling output randomness. **Top-P**: Parameter controlling sampling range.

Text Processing Pipeline

Input Text

Tokenization

Embedding (Vectorization)

Transformer Processing

Generate Output Tokens

Decode to Text

Mainstream Model Comparison

GPT-4o(OpenAI/Closed) | Claude 3.5(Anthropic/Closed) | Gemini 1.5(Google/Closed) | LLaMA 3(Meta/Open) | Qwen 2.5(Alibaba/Open)

Chapter Quiz

1/4

1What is a Token?

After understanding the core mechanisms of large language models, the next chapter will teach you key techniques for interacting with AI—Prompt Engineering, RAG, and Function Calling.

Previous Chapter

AI Fundamentals and Models: From Machine Learning to Large Language Models

Next Chapter

Prompting and Interaction Techniques: Prompt, RAG, and Function Calling

Course Chapters

AI Fundamentals and Models: From Machine Learning to Large Language Models Core Concepts of Large Language Models: Token, Embedding, and Transformer Prompting and Interaction Techniques: Prompt, RAG, and Function Calling AI Agents and Tool Ecosystems: MCP, LangChain, and Development Tools AI Industry Terminology and Business Concepts: From SaaS to AI Governance

Finished? Mark as completed

Complete all chapters to earn your certificate

Explore more course content

View the full curriculum, certification guides, and career templates

View Full Course