Hướng Dẫn Cài đặt Proxmox Cluster Và High Availability (HA)

Proxmox Cluster và High Availability là gì?

Proxmox Cluster cho phép bạn quản lý nhiều Proxmox nodes từ một giao diện duy nhất, chia sẻ storage và di chuyển VMs giữa các nodes không downtime. High Availability (HA) đảm bảo VMs tự động restart trên node khác khi node gốc bị sự cố, giảm thiểu downtime và đảm bảo business continuity.

Yêu cầu trước khi cài đặt

Tối thiểu 3 nodes (recommend để đảm bảo quorum)
Mỗi node có dung lượng disk đủ cho VM và Ceph OSDs
Network riêng cho Corosync (cluster communication)
SSH key-based authentication giữa các nodes
Đồng hồ hệ thống đồng bộ (NTP)

Bước 1: Chuẩn bị Network

# Trên mỗi node, cấu hình network riêng cho cluster (option)
# /etc/network/interfaces

auto eno2
iface eno2 inet static
    address 10.10.10.11/24  # Node 1
    # address 10.10.10.12/24  # Node 2
    # address 10.10.10.13/24  # Node 3

# Firewall rules cho cluster
# Allow port 5405 (corosync), 22 (ssh), 8006 (web)
iptables -A INPUT -p udp --dport 5405 -j ACCEPT
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 8006 -j ACCEPT

Bước 2: Tạo Cluster

# Trên node 1 (primary node)
# Tạo cluster với tên và multicast address
pvecm create vnhte-cluster -link 0 10.10.10.11 -link 1 10.10.10.12

# Kiểm tra trạng thái cluster
pvecm status

# Expected output:
# Quorum: 1
# Status: Local node is master
# Nodes: 1, Procs: 1

Bước 3: Thêm Nodes vào Cluster

# Trên Node 2 và Node 3:
# Trước tiên, cài đặt Proxmox VE mới và đăng nhập vào web UI

# Trên Node 2 - Web UI:
# Datacenter > Cluster > Join Cluster
# Điền IP/hostname của Node 1, username/password, cluster name

# Hoặc CLI trên Node 2:
pvecm add 10.10.10.11 -username root@pam -password YOUR_PASSWORD

# Verify trên node 1:
pvecm status
# Expected:
# Quorum: 2
# Nodes: 2
# Status: All nodes connected

Bước 4: Cấu hình HA Manager

# Cài đặt ha-manager (thường đã có sẵn)
apt install ha-manager -y

# Enable HA trên cluster
ha-manager status

# Cấu hình group cho nodes (optional)
# /etc/pve/ha-groups.cfg
group:
    nodes: node1,node2,node3
    # Hoặc với type
    nodes: node1,node2,node3 type=cluster

# Tạo HA resource (VM)
ha-manager add vm:100 --state started

# Verify resource
ha-manager status
# Expected: vm:100 started, node1

Bước 5: Cấu hình Shared Storage

HA cần shared storage để VM có thể chạy trên bất kỳ node nào. Proxmox hỗ trợ nhiều loại:

NFS Storage

# Trên Node 1 - Thêm NFS storage
# Datacenter > Storage > Add > NFS

# Hoặc CLI:
pvesm add nfs shared-storage \
    --server 192.168.1.50 \
    --export /export/path \
    --content rootdir,images \
    --shared 1

# Enable shared flag để dùng cho HA
pvesm set shared-storage --shared 1

Ceph Storage

# Cài đặt Ceph trên các nodes
pveceph install

# Tạo Ceph cluster
pveceph init -network 10.10.10.0/24

# Tạo OSDs
pveceph osd create /dev/sdb

# Tạo pools cho VMs
pveceph pool create VMStorage -pg 32

# Thêm Ceph storage
pvesm add ceph VMStorage \
    --content images,rootdir \
    --pool VMStorage

Bước 6: Test HA Failover

# Xem trạng thái HA resources
ha-manager status

# List VMs được quản lý bởi HA
ha-manager list

# Di chuyển VM từ node này sang node khác (live migrate)
qm migrate {vm-id} target-node --online

# Simulate node failure (shutdown node thật)
# VM sẽ tự động start trên node còn lại sau ~30 giây

# Xem logs nếu có vấn đề
journalctl -u pve-ha-lrm
journalctl -u pve-ha-crond

Quorum và Fencing

Quorum đảm bảo chỉ một node chính active tại một thời điểm. Fencing ngăn VM chạy trên nhiều nodes cùng lúc (split-brain).

# Kiểm tra quorum
pvecm info

# Expected: Quorum: 2 (với 2 nodes)
# Với 3 nodes: Quorum: 2 (majority)

# Cấu hình expected votes (nếu cần)
pvecm expected 3

# Fencing - watchdog timer
# Cài đặt watchdog module
modprobe softdog

# Thêm vào /etc/modules
softdog

Troubleshooting Common Issues

Vấn đề	Nguyên nhân	Cách xử lý
Nodes không thấy nhau	Firewall block	Mở ports 5405, 22, 8006
Quorum lost	Network partition	Kiểm tra network, có thể cần 3 nodes
HA VM không start	Không có shared storage	Cấu hình shared storage trước
Split-brain VMs	Không có fencing	Cấu hình watchdog fencing
Cluster join failed	Hostname conflict	Đổi hostname unique cho mỗi node

Best Practices cho Production

Minimum 3 nodes: Đảm bảo quorum đa số
Network redundancy: Dùng bond hoặc multiple links cho corosync
Shared storage: Ceph hoặc NFS với redundancy
Watchdog fencing: Bắt buộc để tránh split-brain
Regular testing: Test failover định kỳ để đảm bảo hoạt động

FAQ – Câu hỏi thường gặp

Có thể có 2-node cluster không? Có, nhưng cần cấu hình extra quorum votes để tránh split-brain.
Cluster có cần internet không? Không, local network là đủ, chỉ cần NTP cho đồng bộ đồng hồ.
Làm sao remove node khỏi cluster? Stop HA services trên node, sau đó dùng pvecm delnode nodename.
HA có hoạt động với local storage không? Không, HA yêu cầu shared storage để VM có thể chạy trên bất kỳ node nào.
Thời gian failover là bao lâu? Thường 30-60 giây tùy cấu hình và storage speed.