Academy/Data Center Engineer/Introduction to Power Systems: UPS, PDU, and Power Distribution Architecture

Free Chapter 10 minChapter 3/5

Introduction to Power Systems: UPS, PDU, and Power Distribution Architecture

Core components of data center power systems and how they work.

本章学习要点

第 3 / 5 章

Understand the definition, classification, and scale tiers of data centers

Master the growth trends and opportunities in the data center industry in the AI era

Learn about the daily work and skill requirements of a data center engineer

Define a clear learning path to enter the data center field from scratch

The power system is the lifeline of a data center. Without a stable power supply, even the most advanced servers are just piles of scrap metal. The design of a data center's power system directly determines its reliability tier and is a core area that data center engineers must deeply understand.

Why Are Data Centers So Demanding on Power?

A medium-sized data center may have annual electricity costs reaching tens of millions of yuan, with power costs accounting for 40-60% of total operating expenses. More critically, even a few milliseconds of power interruption can cause server downtime and data loss. For financial trading systems, a single power outage could result in losses of millions of dollars.

The AI era has made this issue even more prominent. A single NVIDIA DGX H100 AI server can consume up to 10kW of power, which is 5-10 times that of a traditional server. An AI training cluster may require tens of megawatts of power—equivalent to the electricity consumption of a small town.

UPS: Uninterruptible Power Supply

The UPS (Uninterruptible Power Supply) is the core line of defense in a data center's power system. When utility power fails, the UPS is responsible for providing uninterrupted power supply before the diesel generator starts (typically requiring 10-15 seconds).

Three Types of UPS

**Online/Double-conversion**: Utility power is first converted to DC to charge the batteries, then converted back to AC by an inverter to supply the load. Power is always processed by the UPS, with zero switching time. This is the most common type in data centers, offering the highest power quality but slightly lower efficiency (approx. 94-96%).

**Offline/Standby**: Under normal conditions, utility power directly supplies the load; it only switches to battery power when utility power fails. Switching time is about 5-12 milliseconds. Low cost but insufficient reliability, not suitable for data centers.

**Line-interactive**: Combines features of online and offline types, offering some voltage regulation capability. Suitable for small server rooms or non-critical loads.

Common Interview Question: Why do data centers prefer online UPS? Answer: Zero switching time, continuous power conditioning, and the highest power quality.

PDU: Power Distribution Unit

The PDU (Power Distribution Unit) is responsible for distributing power from the UPS to each individual server. You can think of a PDU as the data center's 'power splitter'.

PDU Hierarchy

**Main Switchboard**: Receives power from the transformer or UPS and performs primary distribution. Usually located in the electrical room.

**Floor PDU**: Distributes power to various server room areas. Contains circuit breakers and monitoring equipment.

**Rack PDU**: Installed inside server racks, directly powering the servers. Divided into basic types (providing only outlets) and intelligent types (capable of remotely monitoring power and current per outlet).

Intelligent PDUs are standard in modern data centers, allowing operations teams to monitor the power consumption of each server in real-time, promptly detect anomalies, and plan capacity rationally.

Power Distribution Architecture: From Utility to Server

A complete data center power path is: Utility Power → High-Voltage Distribution → Transformer (Step-down) → ATS (Automatic Transfer Switch) → UPS → PDU → Server.

**N+1 Architecture**: N UPS units meet the load demand, with one additional unit added for redundancy. For example, in a 3+1 architecture, 3 UPS units can support the entire load, and the 4th is a backup. Any single unit failure does not affect operation. Corresponds to Tier III standards.

**2N Architecture**: Two completely independent power paths, each capable of independently supporting the entire load. The two paths are completely separate from the utility power intake. This is the basis for Tier IV standards—even if one entire path fails completely, the other can ensure business continuity.

**2N+1 Architecture**: Adds one redundant UPS to each path on top of the 2N architecture. This is the highest level of reliability configuration, typically adopted only by core financial systems and hyperscale data centers.

What Do Data Center Engineers Need to Master?

Core skills related to power systems include: understanding single-line diagrams (SLD), being able to read data center power system design drawings; UPS maintenance, including battery testing, module replacement, firmware upgrades; PDU management, including load balancing, capacity planning, fault troubleshooting; and ATS testing, regularly simulating utility power failure to verify automatic switching functionality.

实用建议

When learning about power systems, it's recommended to visit an actual data center. Many IDC operators offer free tours. Seeing UPS and PDU equipment in person will deepen your understanding of the theoretical knowledge tenfold.

注意事项

AI server power density is 5-10 times that of traditional servers, meaning traditional N+1 power distribution architectures may be insufficient. New AI data centers should prioritize 2N architectures to meet high power density demands.

重要提醒

UPS battery lifespan is typically 3-5 years and is the most easily overlooked failure point in data centers. Regular battery testing and timely replacement are key to ensuring power continuity—the root cause of many data center power outage incidents is battery aging.

Power system knowledge is a core exam topic for the CDCP certification and a high-frequency topic in data center engineer interviews. The full course will further explain cooling systems and network architecture.

Data Center Basic Infrastructure Flow

User Request

Load Balancer

Application Server

Database

Before learning Prompt Engineering, you need to understand how Large Language Models (LLMs) work. You don't need to delve into mathematical details, but understanding the basic principles will help you write better prompts and know why some prompts work while others don't.

What is a Large Language Model?

A Large Language Model (LLM) is essentially a 'text predictor' trained on massive amounts of text. Given a piece of text, it predicts the text most likely to come next. ChatGPT, Claude, DeepSeek, Gemini—all are based on LLMs at their core.

To draw an analogy: If you type 'today's weather' in your phone's keyboard, the keyboard might automatically suggest words like 'is great' or 'is nice'. The principle of an LLM is similar, but its 'vocabulary' and 'comprehension ability' are billions of times more powerful—it has read almost all publicly available text on the internet, enabling it to generate coherent, logical, and even creative content.

More precisely, an LLM performs 'conditional probability prediction': given all previous Tokens (we'll explain Tokens in detail later), it calculates the probability distribution for the next Token, then selects one. This seemingly simple mechanism, at a sufficiently large model and data scale, has given rise to astonishing emergent capabilities—including reasoning, programming, translation, and creation.

How Did It Learn?

LLM training can be simplified into three stages:

Stage One: Pre-training

The model reads massive amounts of text data from the internet—books, articles, web pages, code, papers, Wikipedia, etc.—learning language patterns and knowledge. This is like a student reading all the books in the world, forming a broad knowledge base.

The amount of pre-training data is staggering. A GPT-4-level model's training data may exceed 13 trillion Tokens (equivalent to tens of billions of pages of text). Training requires thousands of high-end GPUs (like NVIDIA A100/H100) running for months, costing over $100 million. This is why only a few large companies can train foundational models from scratch.

Stage Two: Supervised Fine-tuning (SFT)

The model after pre-training is just a 'completer'—it can continue a half-sentence, but it's not good at answering questions according to instructions. The SFT stage trains the model using a large number of high-quality 'instruction-response' pairs, teaching it to understand user intent and give helpful responses.

This training data is typically written by professional human annotators. For example: the instruction is 'Write a quicksort function in Python', and the response is a complete piece of Python code with explanation. Through tens of thousands to hundreds of thousands of such high-quality examples, the model learns the pattern of 'when a user asks like this, I should answer like that'.

Stage Three: Alignment / RLHF

Using techniques like Reinforcement Learning from Human Feedback (RLHF), the model learns to answer questions according to human values and preferences. This step addresses the 'safety' issue—making the model refuse harmful requests, avoid bias, and remain honest.

The RLHF process is: Have the model generate multiple responses to the same question → Human evaluators select the best one → Train a 'reward model' to learn human preferences → Use the reward model to guide the LLM to generate responses more aligned with human expectations.

You can analogize the entire process to education: Pre-training = Reading extensively (building a knowledge base), SFT = Doing exercises (learning how to answer questions), RLHF = Teacher grading homework (learning what constitutes a good answer).

实用建议

Understanding the three training stages is important for Prompt Engineering: Pre-training determines what the model 'knows', SFT determines 'how the model answers', and RLHF determines 'what the model cannot say'. When your prompt triggers a safety filter, it's RLHF at work.

Core LLM Parameters

When using AI tools, you'll encounter several key parameters. Understanding them will help you control the output more precisely:

Temperature

Controls the randomness of the output. When Temperature=0, the model always chooses the highest probability Token, producing the most deterministic, almost identical output each time; when Temperature=1, the model explores more possibilities, producing more diverse but also less predictable output.

**Usage Suggestions**: Factual Q&A, code generation → 0-0.3 (for accuracy); business copywriting, casual conversation → 0.5-0.7 (for naturalness); creative writing, brainstorming → 0.8-1.0 (for freshness).

Context Window

The maximum number of Tokens the model can process at once, equivalent to the model's 'working memory'. GPT-4o supports 128K Tokens (~a 300-page book), Claude 3.5 supports 200K Tokens. Content exceeding the window will be truncated or lost.

**Impact on Prompt Engineering**: The context window determines how much background information, examples, and instructions you can include in a single conversation. A larger window allows you to give the AI more context, usually improving answer quality—but Tokens are also more expensive.

Top-P (Nucleus Sampling)

Another parameter controlling diversity. Top-P=0.1 means only selecting from the top 10% of Tokens by cumulative probability. It's generally recommended to keep Top-P=1 and only adjust Temperature—adjusting both simultaneously can easily produce unpredictable results.

What Can LLMs Do? What Can't They Do?

Understanding the principles of LLMs allows us to define their capabilities boundaries—crucial for writing good prompts:

Areas of Strength

**Language Understanding and Generation**: Translation, summarization, rewriting, expansion, style conversion—this is the core capability of LLMs.

**Code Generation and Debugging**: Generating code from natural language descriptions, explaining code, finding bugs. Top models now surpass most human developers on standard programming tests.

**Creative Writing**: Stories, poetry, ad copy, scripts. Good at imitating various writing styles and tones.

**Information Integration and Analysis**: Reading large volumes of documents and extracting key information, generating structured summaries, comparative analysis.

**Logical Reasoning**: With sufficiently good prompt guidance, can perform multi-step logical reasoning, causal analysis, and decision support.

Areas of Weakness

**Real-time Information**: Training data has a cutoff date. If you ask about recent events, the model may not know or give outdated information. Solution: Use RAG (Retrieval-Augmented Generation) or web search functionality.

**Precise Mathematical Calculation**: LLMs are 'predicting text' not truly 'calculating'; multi-digit multiplication, complex equation solving often produce errors. Solution: Have the LLM call a code interpreter or calculator tool.

**Memory Management**: LLMs have no long-term memory across conversations. Each new conversation starts from scratch. Information within the context window, if too long, may be 'forgotten' by the model in the middle (known as the 'lost-in-the-middle' problem).

**100% Accuracy**: LLMs 'hallucinate'—confidently fabricating non-existent facts. This is not a bug, but an inherent limitation of the prediction mechanism.

重要提醒

The hallucination problem of LLMs is something Prompt Engineers must always be vigilant about. For critical information involving data, law, medicine, finance, etc., always perform manual verification. A good Prompt Engineer knows how to design prompts to reduce hallucination risk (e.g., asking the model to cite sources, say 'I don't know' when uncertain).

Comparison of Mainstream LLMs

Understanding the characteristics of different models helps in choosing the most suitable tool:

**GPT-4o (OpenAI)**: Strong overall capabilities, excellent tool calling and coding abilities, richest ecosystem. Suitable for most scenarios.

**Claude 3.5 Sonnet (Anthropic)**: Outstanding long-text analysis and coding abilities, leading safety design, 200K ultra-long context. Suitable for long document processing and scenarios requiring high safety.

**Gemini 1.5 Pro (Google)**: Ultra-long context window (1M Tokens), strong multimodal capabilities, deep integration with Google ecosystem. Suitable for processing ultra-long documents and video content.

**DeepSeek V3 (DeepSeek)**: Outstanding reasoning abilities, excellent performance in Chinese, open-source and deployable. Suitable for scenarios requiring local deployment or with limited budgets.

**Qwen 2.5 (Alibaba)**: Top-tier Chinese language capabilities, open-source and free for commercial use, many parameter scale options. Suitable for Chinese-dominant application scenarios.

注意事项

Don't blindly believe in the 'strongest model'. Different models have their own advantages in different tasks. A good Prompt Engineer selects the most appropriate (not the most expensive) model based on task characteristics. For example, simple text classification can be done with GPT-4o mini, no need for GPT-4o.

How Does This Relate to Prompt Engineering?

Understanding that the essence of an LLM is a 'text predictor' clarifies why prompts are so important—the prompt you give it determines its prediction direction. Prompt Engineering is essentially 'guiding the model's prediction direction through carefully designed input to obtain optimal output'.

A good prompt is like a good exam question: clear, specific, with context. Vague questions get vague answers, just like a poorly worded exam question leaves students confused.

Comparison of Bad vs. Good Prompts

**Bad**: 'Write an article' → The model doesn't know what topic, length, style, or audience, resulting in random and generic output.

**Good**: 'You are a tech journalist with 10 years of experience. Please write an 800-word in-depth analysis for readers of 36Kr on the impact of AI on the accounting industry. It must include data support and specific cases, with a professional but not obscure tone.' → The model has a clear direction, and output quality will be several times higher.

**Why is it good?** Because the good prompt contains 5 key pieces of information: Role (tech journalist), Audience (36Kr readers), Format (800-word in-depth analysis), Topic (AI's impact on accounting), Requirements (data + cases + tone). This information helps the model significantly narrow the prediction space, thereby generating more precise content.

The Value of Prompt Engineering

Prompt Engineering is currently the AI skill with the highest return on investment, for three reasons:

**Zero Barrier to Entry**: No programming foundation or math background required. You can start learning if you can type.

**Immediate Results**: You can apply a technique to your work immediately after learning it. Unlike learning programming, which requires long accumulation before producing results.

**High Versatility**: No matter your industry—marketing, law, finance, education, design—prompt skills can improve your efficiency in using AI.

**High Career Value**: According to LinkedIn 2025 data, positions requiring Prompt Engineering skills offer salaries 15-25% higher on average than similar roles. This skill is spreading from being a standalone role to an essential skill for almost all knowledge workers.

实用建议

The best way to learn Prompt Engineering is 'learning by doing'. Starting today, consciously optimize your prompts every time you use an AI tool, comparing the effects before and after optimization. Within a week, you'll feel significant progress.

In the next chapter, we'll move into practical application, learning the three foundational techniques of role setting, Few-shot prompting, and format control—these three techniques alone can increase your AI efficiency by at least 3 times.

Three Stages of LLM Training

Pre-training (Read Extensively / Build Knowledge)

SFT Supervised Fine-tuning (Do Exercises / Learn to Answer)

RLHF Alignment (Grade Homework / Learn Safety)

LLM Capability Boundaries

Strengths: Language Understanding + Code Generation + Creative Writing + Information Analysis | Weaknesses: Real-time Info + Precise Calculation + Long-term Memory + 100% Accuracy

Temperature vs. Output Quality Relationship

Temperature 0 (Most Deterministic / Repetitive)

0.3 (Accurate / Suitable for Code)

0.7 (Natural / Suitable for Copywriting)

1.0 (Diverse / Suitable for Creativity)

Previous Chapter

Data Center Tier Standards: Tier I to Tier IV

Next Chapter

Hands-on Project: Designing a Tier III Data Center Cooling Solution

Course Chapters

What is a Data Center? Why It's the Golden Industry of the AI Era Data Center Tier Standards: Tier I to Tier IV Introduction to Power Systems: UPS, PDU, and Power Distribution Architecture Hands-on Project: Designing a Tier III Data Center Cooling SolutionUnlock with assessment Data Center Engineer Career Guide and Professional DevelopmentUnlock with assessment

Finished? Mark as completed

Complete all chapters to earn your certificate

Explore more course content

View the full curriculum, certification guides, and career templates

View Full Course