Linux Kernel 日志常见问题
修改 dmesg 大小
- 内核参数
log_buf_len=n[KMG]
- 默认 2^CONFIG_LOG_BUF_SHIFT
- CONFIG_LOG_CPU_MAX_BUF_SHIFT
- admin-guide/kernel-parameters.txt
- https://elinux.org/Debugging_by_printing
# CONFIG_LOG_BUF_SHIFT=14 -> 16K
cat /boot/config-lts | grep LOG_BUF_SHIFT
TCP: too many orphaned sockets
一般内存满了导致
# x x pages
# pages*4k 实际内存使用
cat /proc/sys/net/ipv4/tcp_mem
cat /proc/sys/net/ipv4/tcp_max_orphans
/proc/sys/net/ipv4/tcp_max_orphans
Maximal number of TCP sockets not attached to any user file handle,
held by system. If this number is exceeded orphaned connections are
reset immediately and warning is printed. This limit exists only to
prevent simple DoS attacks, you _must_ not rely on this or lower the
limit artificially, but rather increase it (probably, after increasing
installed memory), if network conditions require more than default value,
and tune network services to linger and kill such states more aggressively.
Let me remind you again: each orphan eats up to 64K of unswappable memory.
EXT4-fs error (device sde2): comm containerd: bad entry in directory: inode out of bounds
EXT4-fs error (device sde2): htree_dirblock_to_tree:1106: inode #399580: block 1582282: comm containerd: bad entry in directory: inode out of bounds - offset=140, inode=4593891, rec_len=20, size=4096 fake=0
traps: tmux: server[5422] general protection fault ip:7f3fbcbb80be sp:7fff1eeff140 error:0 in ld-musl-x86_64.so.1[7f3fbcba6000+4c000]
traps: apk[21618] trap stack segment ip:7f862d16cd85 sp:7ffdf388efb0 error:0 in libapk.so.2.14.0[7f862d16b000+1a000]
TCP: request_sock_TCP: Possible SYN flooding on port 20247. Sending cookies. Check SNMP counters
# >= 2048
sysctl net.core.somaxconn
# >= 512
sysctl net.ipv4.tcp_max_syn_backlog
HP HPSA Driver
EDAC MC0: 1 UE UE overwrote CE on any memory
- MC0 为 #0 内存条
- CE - Correctable Errors
- UE - Uncorrectable Errors
- EDAC - Error Detection and Correction - 内存错误检测和矫正
- csrowX - Chip-Select Row
- chX - Channel table
内存异常信息
[0;33mEDAC MC0[0;1m: 1 UE ie31200 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x0 offset:0x0 grain:8)[0m
[0;33mEDAC MC0[0;1m: 1 UE UE overwrote CE on any memory ( page:0x0 offset:0x0 grain:8)
- /sys/devices/system/edac
- mc/ - memory controller system
- pci/
lsmod | grep edac
ie31200_edac 16384 0
关闭 Log 异常信息
echo 0 > /sys/module/edac_core/parameters/edac_mc_log_ce
# pci_parity_count
echo "1" > /sys/devices/system/edac/pci/check_pci_parity
Invalid ELF header magic
Invalid ELF header magic: != \x7fELF
磁盘损坏
sd 0:0:8:0: [sdh] tag#383 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
sd 0:0:8:0: [sdh] tag#383 Sense Key : 0x4 [current]
sd 0:0:8:0: [sdh] tag#383 ASC=0x15 ASCQ=0x1
sd 0:0:8:0: [sdh] tag#383 CDB: opcode=0x2a 2a 00 3b aa 6b c0 00 00 a8 00
blk_update_request: I/O error, dev sdh, sector 1001024448 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0
zio pool=data vdev=/dev/disk/by-id/scsi-35000c5008953f263-part1 error=5 type=2 offset=512523468800 size=86016 flags=40080c80
hpsa 0000:03:00.0: scsi 0:0:8:0: resetting physical Direct-Access SEAGATE ST600MP0005 PHYS DRV SSDSmartPathCap- En- Exp=1
AER: Corrected error received: 0000:04:00.0
[0;33mmpt3sas 0000:04:00.0: [0;1mPCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)[0m
[0;33mmpt3sas 0000:04:00.0: [0;1m device [1000:0087] error status/mask=00000001/00002000[0m
[0;33mmpt3sas 0000:04:00.0: [0;1m [ 0] RxErr (First)[0m
pcieport 0000:00:02.2: AER: Multiple Corrected error received: 0000:04:00.0
pci=nomsi pci=noaer pcie_aspm=off
Write Protect is on
- USB Flash Driver 已损坏,进入写保护模式
- 如果是正常的磁盘,可以尝试关闭
hdparm -r0 /dev/sdc
[0;33musb 2-3[0m: new SuperSpeed Gen 1 USB device number 11 using xhci_hcd
[0;33musb 2-3[0m: New USB device found, idVendor=0781, idProduct=5583, bcdDevice= 1.00
[0;33musb 2-3[0m: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[0;33musb 2-3[0m: Product: Ultra Fit
[0;33musb 2-3[0m: Manufacturer: SanDisk
[0;33musb 2-3[0m: SerialNumber: 4C530001180206120545
[0;33musb-storage 2-3:1.0[0m: USB Mass Storage device detected
[0;33mscsi host7[0m: usb-storage 2-3:1.0
[0;33mscsi 7:0:0:0[0m: Direct-Access SanDisk Ultra Fit 1.00 PQ: 0 ANSI: 6
[0;33msd 7:0:0:0[0m: [sdc] 60063744 512-byte logical blocks: (30.8 GB/28.6 GiB)
[0;33msd 7:0:0:0[0m: [sdc] Write Protect is on
[0;33msd 7:0:0:0[0m: [sdc] Mode Sense: 43 00 80 00
[0;33msd 7:0:0:0[0m: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[0;33m sdc[0m: sdc1 sdc2 sdc3
[0;33msd 7:0:0:0[0m: [sdc] Attached SCSI removable disk
Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
- DPO - Disable Page Out
- caching hint that indicates the data referenced by the command is not likely to be accessed again and therefore is not a good candidate to keep or maintain within cache.
- FUA - Force Unit Access
- caching hint that indicates the data should be referenced directly from the media of the device. That is cache should be bypassed for this command.
- 参考
rcu_sched detected stalls on CPUs/tasks
[0;33mrcu[0;31m: INFO: rcu_sched detected stalls on CPUs/tasks:[0m
[0;33mrcu[0;31m: 1-....: (1 GPs behind) idle=ad1/1/0x4000000000000000 softirq=22550431/22550433 fqs=2[0m
[0;1m (detected by 3, t=18024 jiffies, g=33604573, q=182)[0m
[0mSending NMI from CPU 3 to CPUs 1:
[0;1mNMI backtrace for cpu 1[0m
[0;33mCPU[0;1m: 1 PID: 2394 Comm: z_wr_iss Tainted: P W O 5.15.16-0-lts #1-Alpine[0m
[0;33mHardware name[0;1m: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C232D2I, BIOS P2.20 07/20/2017[0m
[0;33mRIP[0;1m: 0010:raidz_copy_abd_cb+0x20/0x90 [zfs][0m
[0;33mCode[0;1m: 39 f0 72 c3 31 c0 c3 0f 1f 00 0f 1f 44 00 00 48 89 d1 48 c1 e9 05 74 75 48 83 ee 80 48 83 ef 80 31 c0 48 8d 56 80 c5 fd 6f 02 <c5> fd 6f 4a 20 c5 fd 6f 52 40 c5 fd 6f 5a 60 48 8d 57 80 c5 fd 7f[0m
[0;33mRSP[0;1m: 0000:ffffb4168094fab8 EFLAGS: 00000083[0m
[0;33mRAX[0;1m: 0000000000002368 RBX: 0000000000056000 RCX: 0000000000002b00[0m
[0;33mRDX[0;1m: ffffb416de69dd00 RSI: ffffb416de69dd80 RDI: ffffb416b0a93d80[0m
[0;33mRBP[0;1m: ffffb4168094fb28 R08: 0000000000056000 R09: ffffffffc1b646d0[0m
[0;33mR10[0;1m: 0000000000000002 R11: 0000000000056000 R12: ffffb4168094faf8[0m
[0;33mR13[0;1m: 0000000000056000 R14: ffff953bf2c8fb60 R15: 0000000000000000[0m
[0;33mFS[0;1m: 0000000000000000(0000) GS:ffff95410fc80000(0000) knlGS:0000000000000000[0m
[0;33mCS[0;1m: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033[0m
[0;33mCR2[0;1m: 000000c002aff000 CR3: 0000000301026001 CR4: 00000000003706e0[0m
[0;33mDR0[0;1m: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000[0m
[0;33mDR3[0;1m: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400[0m
[0;1mCall Trace:[0m
[0;1m <TASK>[0m
[0;1m abd_iterate_func2+0x1ec/0x340 [zfs][0m
[0;1m ? raidz_zero_abd_cb+0x60/0x60 [zfs][0m
[0;1m avx2_gen_p+0x40/0x90 [zfs][0m
[0;1m vdev_raidz_math_generate+0x4b/0x70 [zfs][0m
[0;1m vdev_raidz_generate_parity_row+0x30/0x440 [zfs][0m
[0;1m ? vdev_raidz_map_alloc+0x2f4/0x390 [zfs][0m
[0;1m vdev_raidz_io_start+0x1fb/0x320 [zfs][0m
[0;1m zio_vdev_io_start+0x109/0x350 [zfs][0m
[0;1m zio_nowait+0xc5/0x1b0 [zfs][0m
[0;1m vdev_mirror_io_start+0xa2/0x250 [zfs][0m
[0;1m zio_vdev_io_start+0x2d3/0x350 [zfs][0m
[0;1m zio_execute+0x83/0x120 [zfs][0m
[0;1m taskq_thread+0x2d0/0x500 [spl][0m
[0;1m ? wake_up_q+0x90/0x90[0m
[0;1m ? zio_gang_tree_free+0x60/0x60 [zfs][0m
[0;1m ? taskq_thread_spawn+0x50/0x50 [spl][0m
[0;1m kthread+0x127/0x150[0m
[0;1m ? set_kthread_struct+0x40/0x40[0m
[0;1m ret_from_fork+0x22/0x30[0m
[0;1m </TASK>[0m
[0;33mrcu[0;31m: rcu_sched kthread timer wakeup didn't happen for 2506 jiffies! g33604573 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402[0m
[0;33mrcu[0;31m: Possible timer handling issue on cpu=2 timer-softirq=4398823[0m
[0;33mrcu[0;31m: rcu_sched kthread starved for 2508 jiffies! g33604573 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2[0m
[0;33mrcu[0;31m: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.[0m
[0;33mrcu[0;31m: RCU grace-period kthread stack dump:[0m
[0;33mtask:rcu_sched state:I stack[0m: 0 pid: 14 ppid: 2 flags:0x00004000
[0mCall Trace:
[0m <TASK>
[0m __schedule+0x31f/0x14e0
[0m ? lock_timer_base+0x61/0x80
[0m ? __mod_timer+0x170/0x3e0
[0m schedule+0x44/0xa0
[0m schedule_timeout+0x95/0x140
[0m ? __bpf_trace_tick_stop+0x10/0x10
[0m rcu_gp_fqs_loop+0x100/0x320
[0m rcu_gp_kthread+0xab/0x140
[0m ? rcu_gp_init+0x4a0/0x4a0
[0m kthread+0x127/0x150
[0m ? set_kthread_struct+0x40/0x40
[0m ret_from_fork+0x22/0x30
[0m </TASK>
ACPI Error: No handler for Region POWR
添加 acpi_ipmi 后异常停止
# 尝试添加 module
modprobe ipmi_si
modprobe acpi_ipmi
ACPI Error: No handler for Region [POWR] (00000000a03df149) [IPMI] (20190816/evregion-127)
ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-261)
ACPI Error: Aborting method _SB.PMI0._PMM due to previous error (AE_NOT_EXIST) (20190816/psparse-529)
ACPI Error: AE_NOT_EXIST, Evaluating _PMM (20190816/power_meter-325)
L1TF CPU bug present and SMT on, data leak possible
- 只是警告,CPU 有 Hyper-Threading/SMT 特性
- 可以在 BIOS 关闭 SMT - 但不建议
- Linux 默认开启了 mitigations=on - 可以考虑关闭以提高性能
- mitigations
L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
ext4 filesystem being mounted at /boot supports timestamps until 2038 (0x7fffffff)
- 提高 ext4 inode size 以克服 2038y 问题
- inode size 128 -> inode size 256
- 初始化分区时
mkfs.ext4 -I 256 /dev/sda1
- 初始化分区时
dev=$(findmnt /boot -no SOURCE)
tune2fs -l $dev | grep "Inode size:"
# Inode size: 128
device reported invalid CHS sector
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/f8:f8:d0:01:72/05:00:18:00:00/40 tag 31 ncq dma 782336 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: configured for UDMA/133
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
sd 0:0:0:0: [sda] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06 cmd_age=94s
sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x2a 2a 00 18 6e 11 a8 00 05 c8 00
blk_update_request: I/O error, dev sda, sector 409866664 op 0x1:(WRITE) flags 0x0 phys_seg 94 prio class 0
磁盘异常并伴随 fs 错误。
smartctl -a /dev/sda
FS-Cache: Duplicate cookie detected
- NFS 引起
- 影响不大
- 参考
ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length
- 可能和 ACPI 电源监控有关
- 如果是 HP 服务器可能是由于 HP ACPI 不符合标准导致
ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180810/exfield-393)
ACPI Error: Method parse/execution failed _SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180810/psparse-516)
ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180810/power_meter-338)
# 如果使用了 lm_sensors
# 此时的电源显示应该为 0
sensors
配置关闭电源监控
/etc/sensors3.conf
chip "power_meter-acpi-0"
ignore power1
# 尝试关闭电 源监控
echo "blacklist acpi_power_meter" >> /etc/modprobe.d/hwmon.conf
ext4 filesystem being remounted at /newroot/run/redis supports timestamps until 2038 (0x7fffffff)
- 警告 ext4 时间支持问题
FW version command failed -5
mei 0000:00:16.0-56213584-9a29-4916-badf-0fb7ed682aeb: Could not read FW version
mei 0000:00:16.0-56213584-9a29-4916-badf-0fb7ed682aeb: FW version command failed -5
EDAC DEBUG: ie31200_check: MC0
- 内存问题,尝试更换内存。
- 如果是双通道,但是只有一根内存条,尝试补齐
pstore: crypto_comp_decompress failed, ret = -22!
pstore: crypto_comp_decompress failed, ret = -22!
pstore: decompression failed: -22
- fs/pstore/platform.c#L280
- 与该目录相关
/sys/fs/pstore/
- 与升级内核有关
- 参考 pstore: crypto_comp_decompress failed
# root 执行 - sudo 不会展开
rm /sys/fs/pstore/dmesg*