ZFS 常见问题

tip

renameat2/overlayfs ZFS v2.2+
- zfs_rename: support RENAME_ flags
- feature zilsaxattr
不要用 /dev/z0, 用 /dev/zvol/data/db - zd0 可能会变

abbr.	stand for	cn
SPA	Storage Pool Allocator
vdev	Virtual Device	虚拟设备
ZIL	ZFS Intent Log
TXG	Transaction Group
SLOG	Sync Log
ARC	Adaptive Replacement Cache	自适应替换缓存
L2ARC	Level 2 ARC	二级 ARC

zfs get all | grep -E 'used\b|logicalused|compression|\bcompress'

zfs get all | grep -E 'sync'

如何选择 RAIDZ/mirror/dRAID

RAIDZ - striped vdevs - RAID5/6/7
- 66%
  - 3wide RADIZ1
  - 6wide RADIZ2
  - 9wide RADIZ3
- N*W RAIDZx
  - N group
  - W wide
- 不能/不方便扩容
- 固定 parity
mirror - RAID10
- 50%
- degraded 性能更好
- 恢复快
- 扩容方便
dRAID
- 更灵活

参考

2015 ZFS: You should use mirror vdevs, not RAIDZ.
- by Author of jimsalterjrs/sanoid

修复

# -t temporary 重启后恢复
zpool offline main scsi-0000
zpool replace main scsi-0000 scsi-1111

# -e 如果新的硬盘更大
zpool online main scsi-1111

resilver

group 里全部扫
会很慢

raidz1 to raidz2

不可以

https://serverfault.com/a/799952/190601

查看实际大小

# 查看压缩后的大小
du -h .
# 查看实际大小
du --apparent-size -h .

目录下很多文件时非常慢

尝试关闭 atime

计算使用空间

compressratio - 压缩率
- 1/compressratio = 压缩比
- compressratio=logicalused/used
used - 实际占用空间
logicalused - 逻辑占用空间
占用空间也和什么时候开启的 compression 有关
- 开启 compression 之后新写入数据会压缩
占用空间会对齐，因此可能会比逻辑更多

zfs get all | grep -E 'used\b|logicalused|compression|\bcompress'

data                 used                  884G                  -
data                 compressratio         1.47x                 -
data                 compression           lz4                   local
data                 logicalused           1.24T                 -

zfs compression vs application compression

zfs 压缩
- 全量压缩，简单易用
- 压缩率受 block 大小影响
- 支持 lz4、zstd
应用压缩
- 涉及到应用功能是否支持
- 压缩的范围和 ZFS 压缩的范围不同
  - 一般应用只压缩数据
- 压缩率不一定就比 ZFS 压缩率高

zfs vs pg
- PostgreSQL 14 支持 LZ4 TOAST
  - default_toast_compression=lz4
  - 可以在建表时设置 col1 text COMPRESSION lz4
- PostgreSQL 15 支持 LZ4 WAL

ZFS 缓存

ZIL - ZFS Intent Log - 缓冲 WRITE 操作
SLOG - Separate Intent Log
- zpool add tank log
- 不需要特别大的设备 - 例如 16G, 64G SSD 足矣
ARC - 缓存 READ 操作 - Adaptive Replacement Cache
- 内存
L2ARC
- zpool add tank cache
- 不需要特别大的设备 - 例如 128G SSD
- 系统重启后缓存依然可用

zpool add tank log ada3             # 添加 ZIL - 单磁盘
zpool add tank log mirror ada3 ada4 # 添加 ZIL - RAID1 - 坏一个 SSD 写入的数据也不会丢
zpool add tank cache ada3           # 添加 L2ARC

ZFS 性能估算

调优应先找到瓶颈在哪里。

RAIDZn 顺序 4KB 读取 - 无 cache 场景
- RAIDZ1 - N/(N-1) * IOPS
- RAIDZ2 - N/(N-2) * IOPS
- RAIDZ3 - N/(N-3) * IOPS
- 有 cache 时，则上限为 cache 磁盘的 IOPS
写入性能
- 无法直接估算，zfs 内部 zil 为异步写入
- 额外的 ZIL 设备可提升 write 性能
- 默认会在每个磁盘预留空间存储 ZIL
性能影响因素
- recordsize - 默认 128k
- compression
- ashift
- dedup - 默认关闭 - 特殊场景去重能提升性能
- atime - 默认开启 - 一般不需要，可关闭提升读取性能
- logbias - 默认 latency, 可设置为 throughput, 减少使用额外 zil 设备
- sync
  - 关闭最多丢失 30s 数据 - 如果场景允许丢失，则不影响
  - 通过 UPS 确保存储比网络后异常可考虑关闭 sync
- primarycache
- secondarycache

zfs import

正常系统启动会从缓存导入 - zfs import -c /etc/zfs/zpool.cache
如果缓存丢失，则可以直接搜索磁盘
- 例如: 更换了系统
zpool-import.8

# 查看 可导入 的 pool
# 使用 lsblk 搜索
zpool import
# 执行导入 - 导入所的
zpool import -a

# 手动指定搜索目录
zpool import -d /dev/disk/by-id

关闭所有 atime

zfs get atime | grep '\son\s' | cut -d ' ' -f 1 | xargs -n1 sudo zfs set atime=off

atime=on temporary

MOUNT_EXTRA_OPTIONS="-o atime=off"

zvol vs zfs

zvol - 块设备
- raidz、压缩
- 没有所有 zfs 伴随的能力
- blocksize=8k
zfs - 文件系统 - dataset
- 快照、克隆
- 文件系统有一定特性 - 也有缺陷
  - ~~主要缺陷: 不支持 rename2/overlay~~ - ZFS v2.2+
- recordsize=128k

High System Usage

z_wr_iss
spl_dynamic_tas
z_wr_iss_h
l2arc_feed
z_wr_int_h
rcu_sched
txg_sync
z_ioctl_int
kworker/0:1-events
z_null_iss
z_null_int
dp_sync_taskq
z_wr_int
arc_reap
ksoftirqd
dbuf_evict
mmp
migration/0

zfs list slow

dataset 多了后 zfs list 非常慢

time zfs list | wc -l

# docker zfs volume 使用的命令
zfs list -s name -o name,guid,available -H -p
zfs list -r -t all -Hp -o name,origin,used,available,mountpoint,compression,type,volsize,quota,referenced,written,logicalused,usedbydataset main/docker

# containerd
zfs list -Hp -o name,origin,used,available,mountpoint,compression,type,volsize,quota,referenced,written,logicalused,usedbydataset data/var/k3s/snapshotter/60519

758

real    0m1.777s
user    0m0.177s
sys     0m1.599s

https://github.com/openzfs/zfs/discussions/8898

time zfs list -s name -o name,guid,available -H -p > zfs-list.txt

real    2m10.183s
user    0m3.016s
sys     2m6.836s

wc -l zfs-list.txt
# 20177 zfs-list.txt

ZFS vs Hard RAID

ZFS 有校验和,和可避免位翻转等问题,而 RAID 主要用于避免整个磁盘的损坏
ZFS 只需要 HBAs (host bus adapter ) 而不需要 RAID 控制器
最多只需要 Z2, Z3 很少使用,并且可能会有问题,有其他的办法来避免可能的错误
ZFS 并不是 RAID, 而是一个软件,一个文件系统
ZFS 重建比 RAID 更快,例如 1TB 的云盘,实际数据只有 100MB, 那么 ZFS 只需要 100MB 的 IO, 而 RAID 需要 1TB 的 IO.
scrub 是用来保证数据安全的,而不是保证磁盘健康的.不是自动的,需要定时调度.
额外特性
- 自定义划分存储空间
- 可根据应用调优
- 加密
- 增量同步

"PFA"s, as in Pre-Failure Alerts
ZFS vs RAID6

z0 is write-protected but explicit read-write mode requested

umount /dev/z0
e2fsck /dev/z0
mount /dev/z0

Superblock needs_recovery flag is clear, but journal has data.

Buffer I/O error on dev zd0, logical block 0, lost async page write

磁盘满了

zfs list -o space,mountpoint

is in use and contains a unknown filesystem

mdraid, lvm, multipath

cat /proc/mdstat

mdadm --stop /dev/md127

zvol 扩容

zfs get volsize data/vol      # 当前
zfs set volsize=500G data/vol # 修改 Quota
resize2fs /dev/zvol/data/vol  # 扩容 fs

cannot label 'sdf': failed to detect device partitions on '/dev/sdf1': 19

Missing /dev/zvol

apk add zfs zfs-{scripts,udev}

udevadm trigger

cannot trim: no devices in pool support trim operations

zpool trim data

hdparm -I /dev/sda | grep -i trim # 检查 TRIM 支持

SATA 控制器
https://github.com/openzfs/zfs/discussions/14231
- L2ARC device is in use as a cache
https://github.com/openzfs/zfs/issues/13108

retry UNAVAL

zpool online data DISK
zpool clear data
zpool scrub data # 推荐

remount zvol rw

mount -o remount,rw /data/docker

cache 异常后导致 zvol 被重新挂载为 ro
clear cache 的 error 后还是无法挂载，因为 fs 损坏

[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 5767264 starting block 14909936)[0m
[0;31mBuffer I/O error on device zd0, logical block 14909936[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 5898267 starting block 11927556)[0m
[0;31mBuffer I/O error on device zd0, logical block 11927556[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 5898258 starting block 20496389)[0m
[0;31mBuffer I/O error on device zd0, logical block 20496389[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 5898266 starting block 2630818)[0m
[0;31mBuffer I/O error on device zd0, logical block 2630818[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 2919521 starting block 16194810)[0m
[0;31mBuffer I/O error on device zd0, logical block 16194810[0m
[0;31mBuffer I/O error on device zd0, logical block 16194811[0m
[0;31mBuffer I/O error on device zd0, logical block 16194812[0m
[0;31mBuffer I/O error on device zd0, logical block 16194813[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 2920494 starting block 14332529)[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 2883634 starting block 24493815)[0m
[0;31mBuffer I/O error on device zd0, logical block 24493815[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 2883634 starting block 24493816)[0m
[0;31mBuffer I/O error on device zd0, logical block 14332529[0m
[0;31mBuffer I/O error on dev zd0, logical block 0, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 1, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 2, lost async page write[0m
[0;33mEXT4-fs error (device zd0): [0;31;1mext4_check_bdev_write_error:217: comm kworker/u8:0: Error while async write back metadata[0m
[0;33mEXT4-fs (zd0): [0;31mprevious I/O error to superblock detected[0m
[0;31mBuffer I/O error on dev zd0, logical block 5, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 6, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 8, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 1048588, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 1048589, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 1466067, lost async page write[0m
[0;31mBuffer I/O error on dev zd0, logical block 1505175, lost async page write[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 2883634 starting block 24493838)[0m
[0;33mEXT4-fs error (device zd0): [0;31;1mext4_check_bdev_write_error:217: comm VM Periodic Tas: Error while async write back metadata[0m
[0;33mEXT4-fs warning (device zd0): [0;1mext4_end_bio:343: I/O error 3 writing to inode 2883634 starting block 24493839)[0m
[0;31mAborting journal on device zd0-8.[0m
[0;33mEXT4-fs error (device zd0) in ext4_convert_unwritten_io_end_vec:4859: [0;31;1mIO failure[0m
[0;33mEXT4-fs (zd0): [0mfailed to convert unwritten extents to written extents -- potential data loss!  (inode 2883634, error -5)
[0;33mJBD2: [0;31mI/O error when updating journal superblock for zd0-8.[0m
[0;33mEXT4-fs error (device zd0): [0;31;1mext4_journal_check_start:83: comm k3s-server: Detected aborted journal[0m
[0;33mEXT4-fs (zd0): [0;31mprevious I/O error to superblock detected[0m
[0;33mEXT4-fs error (device zd0): [0;31;1mext4_journal_check_start:83: comm http-nio-8080-P: Detected aborted journal[0m
[0;33mEXT4-fs (zd0): [0;31mI/O error while writing superblock[0m
[0;33mEXT4-fs (zd0): [0;31;1mRemounting filesystem read-only[0m
[0;33mEXT4-fs (zd0): [0;31mI/O error while writing superblock

停止服务自动启动
reboot
fsck

umount /dev/zd0
fsck -y /dev/zd0
mount -a

# ensure mount point working as expected
touch /data/docker/test

# start service

zfs destory container snapshots

zfs list > zfs.txt
# main/1poezhz45yv210xqwve9vft0d
grep -E '^main/\w{25}\W' zfs.txt | cut -f 1 -d ' ' | xargs -n 1 sudo zfs destroy -r -R

Feature Flags

zpool get all | grep feature@
zpool upgrade -v

async_destroy                         (read-only compatible)
     Destroy filesystems asynchronously.
empty_bpobj                           (read-only compatible)
     Snapshots use less space.
lz4_compress
     LZ4 compression algorithm support.
multi_vdev_crash_dump
     Crash dumps to multiple vdev pools.
spacemap_histogram                    (read-only compatible)
     Spacemaps maintain space histograms.
enabled_txg                           (read-only compatible)
     Record txg at which a feature is enabled
hole_birth
     Retain hole birth txg for more precise zfs send
extensible_dataset
     Enhanced dataset functionality, used by other features.
embedded_data
     Blocks which compress very well use even less space.
bookmarks                             (read-only compatible)
     "zfs bookmark" command
filesystem_limits                     (read-only compatible)
     Filesystem and snapshot limits.
large_blocks
     Support for blocks larger than 128KB.
large_dnode
     Variable on-disk size of dnodes.
sha512
     SHA-512/256 hash algorithm.
skein
     Skein hash algorithm.
edonr
     Edon-R hash algorithm.
userobj_accounting                    (read-only compatible)
     User/Group object accounting.
encryption
     Support for dataset level encryption
project_quota                         (read-only compatible)
     space/object accounting based on project ID.
device_removal
     Top-level vdevs can be removed, reducing logical pool size.
obsolete_counts                       (read-only compatible)
     Reduce memory used by removed devices when their blocks are freed or remapped.
zpool_checkpoint                      (read-only compatible)
     Pool state can be checkpointed, allowing rewind later.
spacemap_v2                           (read-only compatible)
     Space maps representing large segments are more efficient.
allocation_classes                    (read-only compatible)
     Support for separate allocation classes.
resilver_defer                        (read-only compatible)
     Support for deferring new resilvers when one is already running.
bookmark_v2
     Support for larger bookmarks
redaction_bookmarks
     Support for bookmarks which store redaction lists for zfs redacted send/recv.
redacted_datasets
     Support for redacted datasets, produced by receiving a redacted zfs send stream.
bookmark_written
     Additional accounting, enabling the written#<bookmark> property (space written since a bookmark), and estimates of send stream sizes for incrementals from bookmarks.
log_spacemap                          (read-only compatible)
     Log metaslab changes on a single spacemap and flush them periodically.
livelist                              (read-only compatible)
     Improved clone deletion performance.
device_rebuild                        (read-only compatible)
     Support for sequential mirror/dRAID device rebuilds
zstd_compress
     zstd compression algorithm support.

flag	ver	for
zstd_compress	v2.1	zstd 压缩算法
draid	v2.1	支持分布式备用 RAID
zilsaxattr	v2.2	ZIL 支持 xattr=sa 扩展属性日志
head_errlog	v2.2	per-dataset on-disk error logs
blake3	v2.2	BLAKE3 哈希算法
block_cloning	v2.2	通过 Block Reference Table 支持块克隆
vdev_zaps_v2	v2.2	支持 root vdev ZAP

cannot create '/data/db': pool must be upgraded to set this property or value

sudo zpool upgrade -a

如何选择 RAIDZ/mirror/dRAID​

修复​

raidz1 to raidz2​

查看实际大小​

目录下很多文件时非常慢​

计算使用空间​

zfs compression vs application compression​

ZFS 缓存​

ZFS 性能估算​

zfs import​

关闭所有 atime​

atime=on temporary​

zvol vs zfs​

High System Usage​

zfs list slow​

ZFS vs Hard RAID​

z0 is write-protected but explicit read-write mode requested​

Superblock needs_recovery flag is clear, but journal has data.​

is in use and contains a unknown filesystem​

zvol 扩容​

cannot label 'sdf': failed to detect device partitions on '/dev/sdf1': 19​

Missing /dev/zvol​

cannot trim: no devices in pool support trim operations​

retry UNAVAL​

remount zvol rw​

zfs destory container snapshots​

Feature Flags​

cannot create '/data/db': pool must be upgraded to set this property or value​

如何选择 RAIDZ/mirror/dRAID

修复

raidz1 to raidz2

查看实际大小

目录下很多文件时非常慢

计算使用空间

zfs compression vs application compression

ZFS 缓存

ZFS 性能估算

zfs import

关闭所有 atime

atime=on temporary

zvol vs zfs

High System Usage

zfs list slow

ZFS vs Hard RAID

z0 is write-protected but explicit read-write mode requested

Superblock needs_recovery flag is clear, but journal has data.

is in use and contains a unknown filesystem

zvol 扩容

cannot label 'sdf': failed to detect device partitions on '/dev/sdf1': 19

Missing /dev/zvol

cannot trim: no devices in pool support trim operations

retry UNAVAL

remount zvol rw

zfs destory container snapshots

Feature Flags

cannot create '/data/db': pool must be upgraded to set this property or value