跳到主要内容

Linux Kernel 日志常见问题

修改 dmesg 大小

# CONFIG_LOG_BUF_SHIFT=14 -> 16K
cat /boot/config-lts | grep LOG_BUF_SHIFT

TCP: too many orphaned sockets

一般内存满了导致

# x x pages
# pages*4k 实际内存使用
cat /proc/sys/net/ipv4/tcp_mem
cat /proc/sys/net/ipv4/tcp_max_orphans
/proc/sys/net/ipv4/tcp_max_orphans

Maximal number of TCP sockets not attached to any user file handle,
held by system. If this number is exceeded orphaned connections are
reset immediately and warning is printed. This limit exists only to
prevent simple DoS attacks, you _must_ not rely on this or lower the
limit artificially, but rather increase it (probably, after increasing
installed memory), if network conditions require more than default value,
and tune network services to linger and kill such states more aggressively.

Let me remind you again: each orphan eats up to 64K of unswappable memory.

EXT4-fs error (device sde2): comm containerd: bad entry in directory: inode out of bounds

EXT4-fs error (device sde2): htree_dirblock_to_tree:1106: inode #399580: block 1582282: comm containerd: bad entry in directory: inode out of bounds - offset=140, inode=4593891, rec_len=20, size=4096 fake=0
traps: tmux: server[5422] general protection fault ip:7f3fbcbb80be sp:7fff1eeff140 error:0 in ld-musl-x86_64.so.1[7f3fbcba6000+4c000]
traps: apk[21618] trap stack segment ip:7f862d16cd85 sp:7ffdf388efb0 error:0 in libapk.so.2.14.0[7f862d16b000+1a000]

TCP: request_sock_TCP: Possible SYN flooding on port 20247. Sending cookies. Check SNMP counters

# >= 2048
sysctl net.core.somaxconn
# >= 512
sysctl net.ipv4.tcp_max_syn_backlog

HP HPSA Driver

EDAC MC0: 1 UE UE overwrote CE on any memory

  • MC0 为 #0 内存条
  • CE - Correctable Errors
  • UE - Uncorrectable Errors
  • EDAC - Error Detection and Correction - 内存错误检测和矫正
  • csrowX - Chip-Select Row
  • chX - Channel table

内存异常信息

EDAC MC0: 1 UE ie31200 UE on mc#0csrow#0channel#1 (csrow:0 channel:1 page:0x0 offset:0x0 grain:8)
EDAC MC0: 1 UE UE overwrote CE on any memory ( page:0x0 offset:0x0 grain:8)
  • /sys/devices/system/edac
    • mc/ - memory controller system
    • pci/
lsmod | grep edac
ie31200_edac           16384  0

关闭 Log 异常信息

echo 0 > /sys/module/edac_core/parameters/edac_mc_log_ce
# pci_parity_count
echo "1" > /sys/devices/system/edac/pci/check_pci_parity

Invalid ELF header magic

Invalid ELF header magic: != \x7fELF

磁盘损坏

sd 0:0:8:0: [sdh] tag#383 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
sd 0:0:8:0: [sdh] tag#383 Sense Key : 0x4 [current]
sd 0:0:8:0: [sdh] tag#383 ASC=0x15 ASCQ=0x1
sd 0:0:8:0: [sdh] tag#383 CDB: opcode=0x2a 2a 00 3b aa 6b c0 00 00 a8 00
blk_update_request: I/O error, dev sdh, sector 1001024448 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0
zio pool=data vdev=/dev/disk/by-id/scsi-35000c5008953f263-part1 error=5 type=2 offset=512523468800 size=86016 flags=40080c80
hpsa 0000:03:00.0: scsi 0:0:8:0: resetting physical Direct-Access SEAGATE ST600MP0005 PHYS DRV SSDSmartPathCap- En- Exp=1

AER: Corrected error received: 0000:04:00.0

mpt3sas 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
mpt3sas 0000:04:00.0:  device [1000:0087] error status/mask=00000001/00002000
mpt3sas 0000:04:00.0:  [ 0] RxErr (First)
pcieport 0000:00:02.2: AER: Multiple Corrected error received: 0000:04:00.0
pci=nomsi pci=noaer pcie_aspm=off

Write Protect is on

  • USB Flash Driver 已损坏,进入写保护模式
  • 如果是正常的磁盘,可以尝试关闭 hdparm -r0 /dev/sdc
usb 2-3: new SuperSpeed Gen 1 USB device number 11 using xhci_hcd
usb 2-3: New USB device found, idVendor=0781, idProduct=5583, bcdDevice= 1.00
usb 2-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 2-3: Product: Ultra Fit
usb 2-3: Manufacturer: SanDisk
usb 2-3: SerialNumber: 4C530001180206120545
usb-storage 2-3:1.0: USB Mass Storage device detected
scsi host7: usb-storage 2-3:1.0
scsi 7:0:0:0: Direct-Access SanDisk Ultra Fit 1.00 PQ: 0 ANSI: 6
sd 7:0:0:0: [sdc] 60063744 512-byte logical blocks: (30.8 GB/28.6 GiB)
sd 7:0:0:0: [sdc] Write Protect is on
sd 7:0:0:0: [sdc] Mode Sense: 43 00 80 00
sd 7:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sdc: sdc1 sdc2 sdc3
sd 7:0:0:0: [sdc] Attached SCSI removable disk

Write cache: disabled, read cache: enabled, doesn't support DPO or FUA

  • DPO - Disable Page Out
    • caching hint that indicates the data referenced by the command is not likely to be accessed again and therefore is not a good candidate to keep or maintain within cache.
  • FUA - Force Unit Access
    • caching hint that indicates the data should be referenced directly from the media of the device. That is cache should be bypassed for this command.
  • 参考

rcu_sched detected stalls on CPUs/tasks

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 1-....: (1 GPs behind) idle=ad1/1/0x4000000000000000 softirq=22550431/22550433 fqs=2
 (detected by 3, t=18024 jiffies, g=33604573, q=182)
Sending NMI from CPU 3 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 2394 Comm: z_wr_iss Tainted: P W O 5.15.16-0-lts #1-Alpine
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C232D2I, BIOS P2.20 07/20/2017
RIP: 0010:raidz_copy_abd_cb+0x20/0x90 [zfs]
Code: 39 f0 72 c3 31 c0 c3 0f 1f 00 0f 1f 44 00 00 48 89 d1 48 c1 e9 05 74 75 48 83 ee 80 48 83 ef 80 31 c0 48 8d 56 80 c5 fd 6f 02 <c5> fd 6f 4a 20 c5 fd 6f 52 40 c5 fd 6f 5a 60 48 8d 57 80 c5 fd 7f
RSP: 0000:ffffb4168094fab8 EFLAGS: 00000083
RAX: 0000000000002368 RBX: 0000000000056000 RCX: 0000000000002b00
RDX: ffffb416de69dd00 RSI: ffffb416de69dd80 RDI: ffffb416b0a93d80
RBP: ffffb4168094fb28 R08: 0000000000056000 R09: ffffffffc1b646d0
R10: 0000000000000002 R11: 0000000000056000 R12: ffffb4168094faf8
R13: 0000000000056000 R14: ffff953bf2c8fb60 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff95410fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c002aff000 CR3: 0000000301026001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 abd_iterate_func2+0x1ec/0x340 [zfs]
 ? raidz_zero_abd_cb+0x60/0x60 [zfs]
 avx2_gen_p+0x40/0x90 [zfs]
 vdev_raidz_math_generate+0x4b/0x70 [zfs]
 vdev_raidz_generate_parity_row+0x30/0x440 [zfs]
 ? vdev_raidz_map_alloc+0x2f4/0x390 [zfs]
 vdev_raidz_io_start+0x1fb/0x320 [zfs]
 zio_vdev_io_start+0x109/0x350 [zfs]
 zio_nowait+0xc5/0x1b0 [zfs]
 vdev_mirror_io_start+0xa2/0x250 [zfs]
 zio_vdev_io_start+0x2d3/0x350 [zfs]
 zio_execute+0x83/0x120 [zfs]
 taskq_thread+0x2d0/0x500 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_gang_tree_free+0x60/0x60 [zfs]
 ? taskq_thread_spawn+0x50/0x50 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
 </TASK>
rcu: rcu_sched kthread timer wakeup didn't happen for 2506 jiffies! g33604573 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
rcu: Possible timer handling issue on cpu=2 timer-softirq=4398823
rcu: rcu_sched kthread starved for 2508 jiffies! g33604573 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_sched state:I stack: 0 pid: 14 ppid: 2 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x31f/0x14e0
 ? lock_timer_base+0x61/0x80
 ? __mod_timer+0x170/0x3e0
 schedule+0x44/0xa0
 schedule_timeout+0x95/0x140
 ? __bpf_trace_tick_stop+0x10/0x10
 rcu_gp_fqs_loop+0x100/0x320
 rcu_gp_kthread+0xab/0x140
 ? rcu_gp_init+0x4a0/0x4a0
 kthread+0x127/0x150
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30
 </TASK>

ACPI Error: No handler for Region POWR

添加 acpi_ipmi 后异常停止

# 尝试添加 module
modprobe ipmi_si
modprobe acpi_ipmi
ACPI Error: No handler for Region [POWR] (00000000a03df149) [IPMI] (20190816/evregion-127)
ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-261)
ACPI Error: Aborting method _SB.PMI0._PMM due to previous error (AE_NOT_EXIST) (20190816/psparse-529)
ACPI Error: AE_NOT_EXIST, Evaluating _PMM (20190816/power_meter-325)

L1TF CPU bug present and SMT on, data leak possible

  • 只是警告,CPU 有 Hyper-Threading/SMT 特性
  • 可以在 BIOS 关闭 SMT - 但不建议
  • Linux 默认开启了 mitigations=on - 可以考虑关闭以提高性能
  • mitigations
L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.

ext4 filesystem being mounted at /boot supports timestamps until 2038 (0x7fffffff)

  • 提高 ext4 inode size 以克服 2038y 问题
  • inode size 128 -> inode size 256
    • 初始化分区时 mkfs.ext4 -I 256 /dev/sda1
dev=$(findmnt /boot -no SOURCE)
tune2fs -l $dev | grep "Inode size:"
# Inode size: 128

device reported invalid CHS sector

ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/f8:f8:d0:01:72/05:00:18:00:00/40 tag 31 ncq dma 782336 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: configured for UDMA/133
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
sd 0:0:0:0: [sda] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06 cmd_age=94s
sd 0:0:0:0: [sda] tag#18 CDB: opcode=0x2a 2a 00 18 6e 11 a8 00 05 c8 00
blk_update_request: I/O error, dev sda, sector 409866664 op 0x1:(WRITE) flags 0x0 phys_seg 94 prio class 0

磁盘异常并伴随 fs 错误。

smartctl -a /dev/sda

ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length

ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180810/exfield-393)
ACPI Error: Method parse/execution failed _SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180810/psparse-516)
ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180810/power_meter-338)
# 如果使用了 lm_sensors
# 此时的电源显示应该为 0
sensors

配置关闭电源监控

/etc/sensors3.conf

chip "power_meter-acpi-0"
ignore power1
# 尝试关闭电源监控
echo "blacklist acpi_power_meter" >> /etc/modprobe.d/hwmon.conf

ext4 filesystem being remounted at /newroot/run/redis supports timestamps until 2038 (0x7fffffff)

  • 警告 ext4 时间支持问题

FW version command failed -5

mei 0000:00:16.0-56213584-9a29-4916-badf-0fb7ed682aeb: Could not read FW version
mei 0000:00:16.0-56213584-9a29-4916-badf-0fb7ed682aeb: FW version command failed -5

EDAC DEBUG: ie31200_check: MC0

  • 内存问题,尝试更换内存。
  • 如果是双通道,但是只有一根内存条,尝试补齐

pstore: crypto_comp_decompress failed, ret = -22!

pstore: crypto_comp_decompress failed, ret = -22!
pstore: decompression failed: -22
# root 执行 - sudo 不会展开
rm /sys/fs/pstore/dmesg*

pcieport 0000:00:1c.7: AER: Corrected error received: 0000:00:1c.7

定位问题

[519708.849337] pcieport 0000:00:1c.7: AER: Corrected error received: 0000:00:1c.7
[519708.849346] pcieport 0000:00:1c.7: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[519708.849349] pcieport 0000:00:1c.7: AER: device [8086:a297] error status/mask=00000001/00002000
[519708.849352] pcieport 0000:00:1c.7: AER: [ 0] RxErr
apk add pciutils
lspci -vs 0000:00:1c.7
00:1c.7 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #8 (rev f0) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 124
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: df100000-df1fffff [size=1M]
Prefetchable memory behind bridge: [disabled]
Capabilities: <access denied>
Kernel driver in use: pcieport

perf: interrupt took too long

  • Linux perf 日志
  • 对系统没有影响,可以理解为在自动调整处理频率
[109932.035738] perf: interrupt took too long (2511 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[110540.025443] perf: interrupt took too long (3146 > 3138), lowering kernel.perf_event_max_sample_rate to 63300
[111374.568374] perf: interrupt took too long (3935 > 3932), lowering kernel.perf_event_max_sample_rate to 50700
[112979.009891] perf: interrupt took too long (4927 > 4918), lowering kernel.perf_event_max_sample_rate to 40500
[121152.410414] perf: interrupt took too long (6159 > 6158), lowering kernel.perf_event_max_sample_rate to 32400

ata1.00: exception Emask 0x0 SAct 0x80800000 SErr 0x0 action 0x6

硬盘异常

ata1.00: exception Emask 0x0 SAct 0x1e020 SErr 0x0 action 0x6
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/88:28:78:2d:07/00:00:3f:00:00/40 tag 5 ncq dma 69632 in
res 41/84:88:c0:2d:07/00:00:3f:00:00/00 Emask 0x410 (ATA bus error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { ICRC ABRT }
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: configured for UDMA/33
sd 0:0:0:0: [sda] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
sd 0:0:0:0: [sda] tag#5 Sense Key : 0xb [current]
sd 0:0:0:0: [sda] tag#5 ASC=0x47 ASCQ=0x0
sd 0:0:0:0: [sda] tag#5 CDB: opcode=0x28 28 00 3f 07 2d 78 00 00 88 00
blk_update_request: I/O error, dev sda, sector 1057435000 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 0
ata1: EH complete

Longhorn iSCSI 异常后日志

ata1.00: exception Emask 0x0 SAct 0x140180 SErr 0x0 action 0x6
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/20:40:00:37:c7/00:00:47:00:00/40 tag 8 ncq dma 16384 in
res 41/84:20:00:37:c7/00:00:47:00:00/00 Emask 0x410 (ATA bus error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { ICRC ABRT }
ata1: hard resetting link
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
sd 0:0:0:0: [sda] tag#8 Sense Key : 0xb [current]
sd 0:0:0:0: [sda] tag#8 ASC=0x47 ASCQ=0x0
sd 0:0:0:0: [sda] tag#8 CDB: opcode=0x28 28 00 47 c7 37 00 00 00 20 00
blk_update_request: I/O error, dev sda, sector 1204238080 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
ata1: EH complete
scsi host4: iSCSI Initiator over TCP/IP
scsi 4:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
scsi 4:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5
sd 4:0:0:1: Power-on or device reset occurred
sd 4:0:0:1: [sdb] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
sd 4:0:0:1: [sdb] Write Protect is off
sd 4:0:0:1: [sdb] Mode Sense: 69 00 10 08
sd 4:0:0:1: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
sd 4:0:0:1: [sdb] Attached SCSI disk
session2: session recovery timed out after 120 secs
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 21241888 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
buffer_io_error: 24 callbacks suppressed
Buffer I/O error on dev sdc, logical block 2655236, lost sync page write
JBD2: Error -5 detected when updating journal superblock for sdc-8.
Aborting journal on device sdc-8.
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 21241888 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
Buffer I/O error on dev sdc, logical block 2655236, lost sync page write
JBD2: Error -5 detected when updating journal superblock for sdc-8.
sd 5:0:0:1: rejecting I/O to offline device
blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
Buffer I/O error on dev sdc, logical block 0, lost sync page write
EXT4-fs: 24 callbacks suppressed
EXT4-fs (sdc): I/O error while writing superblock

Unable to access opcode bytes at RIP 0xa906

apk[5410]: segfault at a930 ip 000000000000a930 sp 00007fff2eb32cd8 error 14 in apk[55df206ac000+4000]
Code: Unable to access opcode bytes at RIP 0xa906.

BTRFS critical (device dm-2): corrupt leaf

只能离线 btrfs check,访问局部文件出现错误,但是整体文件系统还是可以访问的。

  • RAID 5
md2: [Self Heal] Retry sector [3253080] round [1/3] choose d-disk
md2: [Self Heal] Retry sector [3253080] round [1/3] finished: get same result, retry next round
md2: [Self Heal] Retry sector [3253080] round [2/3] start: sh-sector [1626584], d-disk [2:sdj3], p-disk [0:sdh3], q-disk [1:sdi3]
md2: [Self Heal] Retry sector [3253080] round [2/3] choose p-disk
md2: [Self Heal] Retry sector [3253080] round [2/3] finished: get same result, retry next round
md2: [Self Heal] Retry sector [3253080] round [3/3] start: sh-sector [1626584], d-disk [2:sdj3], p-disk [0:sdh3], q-disk [1:sdi3]
md2: [Self Heal] Retry sector [3253080] round [3/3] choose q-disk
md2: [Self Heal] Retry sector [3253080] round [3/3] finished: get same result, give up
BTRFS warning (device dm-2): corrupt leaf fixed, bad key order, block=570261504, root=264, slot=161
BTRFS critical (device dm-2): corrupt leaf: root=264 block=570261504 slot=161 ino=69311, name hash mismatch with key, have 0x000000001dc4dc92 expect 0x000000001e00be13
BTRFS error (device dm-2): cannot fix 570261504, record in meta_err

watchdog: BUG: soft lockup

HAProxy crash

Mon May 1 07:02:03 2023

[754398.348700] watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [haproxy:1889]
[754398.688670] Modules linked in: sch_fq tcp_bbr ipv6 af_packet tun crct10dif_pclmul ghash_clmulni_intel mousedev aesni_intel crypto_simd cryptd evdev psmouse qemu_fw_cfg button virtio_net net_failover failover virtio_scsi crc32_pclmul bochs drm_vram_helper drm_ttm_helper ttm simpledrm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm fb fbdev i2c_core drm_panel_orientation_quirks firmware_class loop ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 usb_storage usbcore usb_common sd_mod t10_pi
[754398.688753] CPU: 0 PID: 1889 Comm: haproxy Not tainted 5.15.103-0-virt #1-Alpine
[754398.688761] Hardware name: Linode Compute Instance, BIOS Not Specified
[754398.688763] RIP: 0033:0x7f12519916b7
[754398.688774] Code: 0e 8b 71 18 85 f6 75 0b 8b 49 1c 85 c9 75 04 48 83 c2 03 48 83 fa 0d 0f 43 c5 48 63 d0 48 63 e8 49 8b 5c d7 50 48 85 db 74 2d <8b> 43 18 89 c2 f7 da 21 c2 74 22 29 d0 89 43 18 89 d0 f7 d8 21 d0
[754398.688777] RSP: 002b:00007ffe2c3201e0 EFLAGS: 00000202
[754398.688779] RAX: 000000000000000e RBX: 0000561d34aaf540 RCX: 0000000000000018
[754398.688781] RDX: 0000000000000019 RSI: 00007f12519f53c0 RDI: 000000000000017d
[754398.688782] RBP: 000000000000000e R08: 00007f1250c9c010 R09: 00007f1250dc2c78
[754398.688783] R10: 00007f12512bc0e0 R11: 0000000000000000 R12: 000000000000017a
[754398.688784] R13: 0000000000000153 R14: 00007f1251a03fdc R15: 00007f1251a01b00
[754398.688785] FS: 00007f1251466880 GS: 0000000000000000

Sat May 13 04:16:27 2023

[1781262.070997] watchdog: BUG: soft lockup - CPU#0 stuck for 29s! [kworker/u2:2:7241]
[1781272.465968] Modules linked in: sch_fq tcp_bbr ipv6 af_packet tun crct10dif_pclmul ghash_clmulni_intel mousedev aesni_intel crypto_simd cryptd evdev psmouse qemu_fw_cfg button virtio_net net_failover failover virtio_scsi crc32_pclmul bochs drm_vram_helper drm_ttm_helper ttm simpledrm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm fb fbdev i2c_core drm_panel_orientation_quirks firmware_class loop ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 usb_storage usbcore usb_common sd_mod t10_pi
[1781272.753556] CPU: 0 PID: 7241 Comm: kworker/u2:2 Tainted: G L 5.15.103-0-virt #1-Alpine
[1781272.753561] Hardware name: Linode Compute Instance, BIOS Not Specified
[1781272.753574] Workqueue: 0x0 (events_unbound)
[1781272.753580] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x30
[1781272.756309] Code: 88 cc cc cc cc cc cc cc cc c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 75 0b 31 c0 31 f6 31 ff e9 36 8b 20 00 fb 66 0f 1f 44 00 00 <31> c0 31 f6 31 ff e9 24 8b 20 00 0f 1f 40 00 8b 07 a9 ff 01 00 00
[1781272.756324] RSP: 0018:ffffaf59c057bd60 EFLAGS: 00000206
[1781272.756327] RAX: 0000000000000000 RBX: ffff8ab7c1173d00 RCX: 0000000000000000
[1781272.756329] RDX: 0000000000000000 RSI: 0000000000000283 RDI: ffff8ab7c1174534
[1781272.756330] RBP: ffff8ab7fec2bb00 R08: 0000000000000000 R09: 0000000000000000
[1781272.756332] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[1781272.757089] R13: 0000000000000283 R14: ffff8ab7c1174534 R15: 0000000000000000
[1781272.757093] FS: 0000000000000000(0000) GS:ffff8ab7fec00000(0000) knlGS:0000000000000000
[1781272.757095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1781272.757097] CR2: 00007fffc2444e18 CR3: 000000000442a000 CR4: 0000000000150eb0
[1781272.757101] Call Trace:
[1781272.757107] <TASK>
[1781272.757109] try_to_wake_up+0x1ea/0x450
[1781272.757128] ? process_one_work+0x360/0x360
[1781272.757131] __kthread_create_on_node+0xda/0x1b0
[1781272.757138] kthread_create_on_node+0x44/0x70
[1781272.757141] create_worker+0xef/0x1e0
[1781272.757145] worker_thread+0x2d9/0x3c0
[1781272.757147] ? process_one_work+0x360/0x360
[1781272.757149] kthread+0x113/0x130
[1781272.757151] ? set_kthread_struct+0x50/0x50
[1781272.757154] ret_from_fork+0x22/0x30
[1781272.757161] </TASK>

mlx4

ConnectX2 异常

pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0
pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:00:02.0: device [8086:0e04] error status/mask=00000020/00318000
pcieport 0000:00:02.0: [ 5] SDES (First)
mlx4_core 0000:04:00.0: mlx4_pci_err_detected was called
mlx4_core 0000:04:00.0: device is going to be reset
mlx4_core 0000:04:00.0: crdump: FW doesn't support health buffer access, skipping
mlx4_core 0000:04:00.0: device was reset successfully
mlx4_en 0000:04:00.0: Internal error detected, restarting device
<mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error was started
<mlx4_ib> mlx4_ib_handle_catas_error: mlx4_ib_handle_catas_error ended
mlx4_en: eth4: Close port called
mlx4_core 0000:04:00.0: Fail to set mac in port 1 during unregister
pcieport 0000:00:02.0: AER: Root Port link has been reset (0)
mlx4_core 0000:04:00.0: mlx4_pci_slot_reset was called
mlx4_core 0000:04:00.0: enabling device (0000 -> 0002)
mlx4_core 0000:04:00.0: mlx4_pci_resume was called
mlx4_core 0000:04:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
mlx4_en 0000:04:00.0: Activating port:1
mlx4_en: 0000:04:00.0: Port 1: Using 24 TX rings
mlx4_en: 0000:04:00.0: Port 1: Using 16 RX rings
mlx4_en: 0000:04:00.0: Port 1: Initializing port
<mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
pcieport 0000:00:02.0: AER: device recovery successful

TODO

mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627000: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
snd_hda_codec_hdmi hdaudioC0D2: Monitor plugged-in, Failed to power up codec ret=[-13]

mlx4_core Internal error detected

[  149.148339] mlx4_core 0000:82:00.0: Internal error detected:
[ 149.148368] mlx4_core 0000:82:00.0: buf[00]: ffffffff
[ 149.148375] mlx4_core 0000:82:00.0: buf[01]: ffffffff
[ 149.148398] mlx4_core 0000:82:00.0: buf[02]: ffffffff
[ 149.148421] mlx4_core 0000:82:00.0: buf[03]: ffffffff
[ 149.148427] mlx4_core 0000:82:00.0: buf[04]: ffffffff
[ 149.148433] mlx4_core 0000:82:00.0: buf[05]: ffffffff
[ 149.148478] mlx4_core 0000:82:00.0: buf[06]: ffffffff
[ 149.148483] mlx4_core 0000:82:00.0: buf[07]: ffffffff
[ 149.148489] mlx4_core 0000:82:00.0: buf[08]: ffffffff
[ 149.148494] mlx4_core 0000:82:00.0: buf[09]: ffffffff
[ 149.148500] mlx4_core 0000:82:00.0: buf[0a]: ffffffff
[ 149.148523] mlx4_core 0000:82:00.0: buf[0b]: ffffffff
[ 149.148546] mlx4_core 0000:82:00.0: buf[0c]: ffffffff
[ 149.148568] mlx4_core 0000:82:00.0: buf[0d]: ffffffff
[ 149.148574] mlx4_core 0000:82:00.0: buf[0e]: ffffffff
[ 149.148579] mlx4_core 0000:82:00.0: buf[0f]: ffffffff
[ 149.148603] mlx4_core 0000:82:00.0: device is going to be reset
[ 149.148607] mlx4_core 0000:82:00.0: crdump: FW doesn't support health buffer access, skipping

[Firmware Bug]: TSC_DEADLINE disabled due to Errata: please update microcode to version: 0x52 (or later)

[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 30)

EDAC sbridge: Failed to register device with error -19.

The NVM Checksum Is Not Valid

e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
e1000e 0000:00:1f.6: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
e1000e 0000:00:1f.6: The NVM Checksum Is Not Valid
e1000e: probe of 0000:00:1f.6 failed with error -5

lpc_ich: Resource conflict(s) found affecting gpio_ich

[    5.019593] ACPI Warning: SystemIO range 0x0000000000001C00-0x0000000000001C2F conflicts with OpRegion 0x0000000000001C00-0x0000000000001FFF (\GPR) (20190816/utaddress-204)
[ 5.019594] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver


[ 4.297023] wmi_bus wmi_bus-PNP0C14:00: WQBC data block query control method not found

[ 0.172443] pmd_set_huge: Cannot satisfy [mem 0xf8000000-0xf8200000] with a huge-page mapping due to MTRR override.

[ 0.172443] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    0.165743] MDS: Mitigation: Clear CPU buffers
[ 0.165864] Freeing SMP alternatives memory: 28K
[ 0.166679] smpboot: CPU0: Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[ 0.166754] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver.
[ 0.166766] ... version: 3
[ 0.166767] ... bit width: 48
[ 0.166767] ... generic registers: 4
[ 0.166768] ... value mask: 0000ffffffffffff
[ 0.166768] ... max period: 00007fffffffffff
[ 0.166768] ... fixed-purpose events: 3
[ 0.166769] ... event mask: 000000070000000f
[ 0.166792] rcu: Hierarchical SRCU implementation.
[ 0.167184] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 0.167241] smp: Bringing up secondary CPUs ...
[ 0.167292] x86: Booting SMP configuration:
[ 0.167292] .... node #0, CPUs: #1 #2 #3 #4
[ 0.167683] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.167683] #5 #6 #7
[ 0.169202] smp: Brought up 1 node, 8 CPUs
[ 0.169202] smpboot: Max logical packages: 1
[ 0.169202] ----------------
[ 0.169202] | NMI testsuite:
[ 0.169202] --------------------
[ 0.169202] remote IPI: ok |
[ 0.169202] local IPI: ok |
[ 0.169202] --------------------
[ 0.169202] Good, all 2 testcases passed! |
[ 0.169202] ---------------------------------
[ 0.169202] smpboot: Total of 8 processors activated (39924.80 BogoMIPS)
Buffer I/O error on dev sdb, logical block 2655236, lost sync page write
JBD2: Error -5 detected when updating journal superblock for sdb-8.
Aborting journal on device sdb-8.
Buffer I/O error on dev sdb, logical block 2655236, lost sync page write
JBD2: Error -5 detected when updating journal superblock for sdb-8.
sd 3:0:0:1: [sdc] Synchronizing SCSI cache