【讨论】PVE直通PCIe的问题: 添加直通PCI设备后, pve系统死机
本帖最后由 guitarbug 于 2023-5-23 21:14 编辑问题:
在vm中添加直通PCI设备后, pve主系统死机, 直通的设备为PCIe转SATA(ASM1061芯片)
配置:
CPU: E3-1225 V2
主板: Dell 9010mt Q77(已魔改BIOS, 支持PCIe转NVME启动).
PCIe: PCIe 16x(没插卡), PCIe 4x(PCIe转NVME卡), PCIe 1x(PCIe转SATA卡). 将卡插在PCIe 16x也是一样的现象:死机.
PVE: 安装在PCIe转NVME里, 占用PCIe 4x插槽, 版本为: 7.3-6.
内存: 4 * 4GB.
启动参数:
Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.102-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction
root@Q77:~# dmesg | grep -e DMAR -e IOMMU
[ 0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[ 0.013432] ACPI: DMAR 0x00000000D7FFEDB8 0000B8 (v01 INTEL SNB 00000001 INTL 00000001)
[ 0.013459] ACPI: Reserving DMAR table memory at
[ 0.042044] DMAR: IOMMU enabled
[ 0.118620] DMAR: Host address width 36
[ 0.118621] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.118626] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap c0000020e60262 ecap f0101a
[ 0.118628] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[ 0.118631] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap c9008020660262 ecap f0105a
[ 0.118632] DMAR: RMRR base: 0x000000daf77000 end: 0x000000daf9dfff
[ 0.118634] DMAR: RMRR base: 0x000000db800000 end: 0x000000df9fffff
[ 0.118636] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1
[ 0.118637] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[ 0.118638] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.119129] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 0.266769] DMAR: No ATSR found
[ 0.266770] DMAR: No SATC found
[ 0.266771] DMAR: IOMMU feature pgsel_inv inconsistent
[ 0.266773] DMAR: IOMMU feature pass_through inconsistent
[ 0.266774] DMAR: dmar0: Using Queued invalidation
[ 0.266779] DMAR: dmar1: Using Queued invalidation
[ 0.333148] DMAR: Intel(R) Virtualization Technology for Directed I/O
[ 3.911191] i915 0000:00:02.0: DMAR active, disabling use of stolen memory
root@Q77:~#
root@Q77:~# dmesg | grep 'remapping'
[ 0.118638] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.119129] DMAR-IR: Enabled IRQ remapping in x2apic mode
root@Q77:~#
root@Q77:~# find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/7/devices/0000:00:1c.0
/sys/kernel/iommu_groups/5/devices/0000:00:1a.0
/sys/kernel/iommu_groups/13/devices/0000:02:00.0
/sys/kernel/iommu_groups/3/devices/0000:00:16.0
/sys/kernel/iommu_groups/11/devices/0000:00:1e.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/8/devices/0000:00:1c.2
/sys/kernel/iommu_groups/6/devices/0000:00:1b.0
/sys/kernel/iommu_groups/14/devices/0000:03:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:19.0
/sys/kernel/iommu_groups/12/devices/0000:00:1f.2
/sys/kernel/iommu_groups/12/devices/0000:00:1f.0
/sys/kernel/iommu_groups/12/devices/0000:00:1f.3
/sys/kernel/iommu_groups/2/devices/0000:00:14.0
/sys/kernel/iommu_groups/10/devices/0000:00:1d.0
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/9/devices/0000:00:1c.4
root@Q77:~#
root@Q77:~# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/Ivy Bridge DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 3 (rev c4)
00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4)
00:1f.0 ISA bridge: Intel Corporation Q77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
02:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
03:00.0 Non-Volatile memory controller: Solid State Storage Technology Corporation Device 9100 (rev 03)
root@Q77:~# lspci -vvt
--+-00.0 Intel Corporation Xeon E3-1200 v2/Ivy Bridge DRAM Controller
+-02.0 Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller
+-14.0 Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller
+-16.0 Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1
+-19.0 Intel Corporation 82579LM Gigabit Network Connection (Lewisville)
+-1a.0 Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2
+-1b.0 Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller
+-1c.0---
+-1c.2-----00.0 ASMedia Technology Inc. ASM1062 Serial ATA Controller
+-1c.4-----00.0 Solid State Storage Technology Corporation Device 9100
+-1d.0 Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1
+-1e.0---
+-1f.0 Intel Corporation Q77 Express Chipset LPC Controller
+-1f.2 Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller
-1f.3 Intel Corporation 7 Series/C216 Chipset Family SMBus Controller
root@Q77:~# qm config 104
boot: order=sata0;ide2;net0
cores: 2
description: wqf%0A123456
hostpci0: 0000:02:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 2048
meta: creation-qemu=7.1.0,ctime=1670485787
name: Ubuntu
net0: virtio=22:82:F8:E1:6C:83,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
sata0: local-lvm:vm-104-disk-0,size=64G
scsihw: virtio-scsi-single
smbios1: uuid=4c074828-b68c-4b12-95b7-74ddff8374e8
sockets: 1
vmgenid: 237f068b-8811-4486-9de7-284eac69bb6e
root@Q77:~#
root@Q77:~# ./iommu.sh
Group 0: 00:00.0Host bridge Xeon E3-1200 v2/Ivy Bridge DRAM Controller
Group 1: 00:02.0VGA compatible controller Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller
Group 2: 00:14.0USB controller 7 Series/C210 Series Chipset Family USB xHCI Host Controller
USB: Bus 001 Device 001 Linux Foundation 2.0 root hub
USB: Bus 002 Device 001 Linux Foundation 3.0 root hub
Group 3: 00:16.0Communication controller 7 Series/C216 Chipset Family MEI Controller #1
Group 4: 00:19.0Ethernet controller 82579LM Gigabit Network Connection (Lewisville)
Group 5: 00:1a.0USB controller 7 Series/C216 Chipset Family USB Enhanced Host Controller #2
USB: Bus 003 Device 002 Intel Corp. Integrated Rate Matching Hub
USB: Bus 003 Device 001 Linux Foundation 2.0 root hub
Group 6: 00:1b.0Audio device 7 Series/C216 Chipset Family High Definition Audio Controller
Group 7: 00:1c.0PCI bridge 7 Series/C216 Chipset Family PCI Express Root Port 1
Group 8: 00:1c.2PCI bridge 7 Series/C210 Series Chipset Family PCI Express Root Port 3
Group 9: 00:1c.4PCI bridge 7 Series/C210 Series Chipset Family PCI Express Root Port 5
Group 10: 00:1d.0USB controller 7 Series/C216 Chipset Family USB Enhanced Host Controller #1
USB: Bus 004 Device 002 Intel Corp. Integrated Rate Matching Hub
USB: Bus 004 Device 001 Linux Foundation 2.0 root hub
Group 11: 00:1e.0PCI bridge 82801 PCI Bridge
Group 12: 00:1f.0ISA bridge Q77 Express Chipset LPC Controller
00:1f.2SATA controller 7 Series/C210 Series Chipset Family 6-port SATA Controller
00:1f.3SMBus 7 Series/C216 Chipset Family SMBus Controller
Group 13: 02:00.0SATA controller ASM1062 Serial ATA Controller
Group 14: 03:00.0Non-Volatile memory controller Device 9100
root@Q77:~#
02:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02) (prog-if 01 )
Subsystem: ASMedia Technology Inc. ASM1062 Serial ATA Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 29
IOMMU group: 13
Region 0: I/O ports at e050
Region 1: I/O ports at e040
Region 2: I/O ports at e030
Region 3: I/O ports at e020
Region 4: I/O ports at e000
Region 5: Memory at f7d10000 (32-bit, non-prefetchable)
Expansion ROM at f7d00000
Capabilities: MSI: Enable+ Count=1/1 Maskable- 64bit-
Address: fee002d8Data: 0000
Capabilities: Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Kernel driver in use: ahci
Kernel modules: ahci
插个u盘用esxi试试,或者换个PVE版本 你直通前看iommu分组了吗?如果有同组的,直通必死,我有一次脑残没看,还设了个开机自启,最后含泪重装 iooo 发表于 2023-5-23 21:32
插个u盘用esxi试试,或者换个PVE版本
刚升级到了最新的PVE 7.4, 还是存在一样的问题...[吐槽]
esxi没用过, 实在搞不定pve的话,有机会试试看esxi 大头吃小头 发表于 2023-5-23 21:33
你直通前看iommu分组了吗?如果有同组的,直通必死,我有一次脑残没看,还设了个开机自启,最后含泪重装 ...
这块PCIe转SATA卡在第13组, 这一组只有这一个设备.
Group 13: 02:00.0SATA controller ASM1062 Serial ATA Controller guitarbug 发表于 2023-5-23 21:37
刚升级到了最新的PVE 7.4, 还是存在一样的问题...
esxi没用过, 实在搞不定pve的话,有机会试试看es ...
这个asm1062你确定是好的吗?
也有可能是存在问题的 大头吃小头 发表于 2023-5-23 21:49
这个asm1062你确定是好的吗?
也有可能是存在问题的
BIOS能识别ASM1062下的2块硬盘.
PVE Host下也能识别到ASM1062下的2块硬盘.
应该是好的吧. PVE最多管理6个sata设备,加asm扩展卡也无用,只能用6个
esxi下直通后,群晖的硬盘设定raid模式不稳定掉盘,后改RDM模式稳定,磁盘读写性能接近直通。但nas无法加载SSD缓存
最后黑群晖采用物理机U盘安装,可用SSD缓存。 nnmm999 发表于 2023-5-23 23:00
PVE最多管理6个sata设备,加asm扩展卡也无用,只能用6个
esxi下直通后,群晖的硬盘设定raid模式不稳定掉盘 ...
我这主板4个SATA+扩展卡2个,加起来刚好6个呀,哈哈哈[狂笑]。
之前也是用rdm模式,但是看不到硬盘smart信息,而且也不能休眠,所以才想用直通的方式。。。[偷笑] pcie_acs_override=downstream,multifunction 这个内核参数需要加么,我一般不加的。
或者考虑魔改 BIOS 有没有问题,你可以先试试用原版 BIOS 看会不会死机
然后有可能主板自身有些奇奇怪怪的问题。比如我之前买的寨板就是不能直通,哪怕 BIOS 里有 VT-d 开关。但感觉戴尔质量应该也没那么差。 大头吃小头 发表于 2023-5-23 21:33
你直通前看iommu分组了吗?如果有同组的,直通必死,我有一次脑残没看,还设了个开机自启,最后含泪重装 ...
没必要重装啊,进bios关掉vt然后进PVE改虚拟机配置不就好了 Icarus_Radio 发表于 2023-5-23 23:50
pcie_acs_override=downstream,multifunction 这个内核参数需要加么,我一般不加的。
或者考虑魔改 BIOS 有 ...
pcie_acs_override=downstream,multifunction //我刚开始也是不加, 出现直通死机后, 在网上搜了一些文章的说明, 才加的这个内核参数..
魔改BIOS只是在原版BISO插入了NVME模块. 如果用原版BIOS, 那我的PVE Host系统则要改成从SATA或者USB启动, 我猜可能直通PCIe会OK. [怪脸] 最后楼主解决了没,我也是死机,但不是直通死机,是每隔一段时间有几率死机。 本帖最后由 yuri_2234 于 2023-8-15 00:38 编辑
gfhgth 发表于 2023-8-4 11:34
最后楼主解决了没,我也是死机,但不是直通死机,是每隔一段时间有几率死机。 ...
昨天刚好折腾碰到这个问题,顺便给后面遇到这个问题的人提供一个解决的思路
我的症状是虚拟机直通之后开机过一会宿主机死机。环境是PVE7.4-3,虚拟机openmediavault,bios是seabios,机型i440fx,pice转sata扩展卡是asm1062
解决方法的是PVE论坛的老哥给的,地址在这里 https://forum.proxmox.com/threads/pcie-sata-asm1062-passthrough-problem.84135/
总结就是把机型改为q35,bios改为OVMF通过uefi引导启动就解决问题,知道怎么解决就好办了。因为默认虚拟机的bios是用的是mbr+传统方式启动,改为g.u.i.d+uefi启动应该会解决,于是参考了下面的文章
注意,下面操作之前记得对虚拟机快照,快照之后最好克隆多一个虚拟机进行实验
https://yuweizzz.github.io/post/convert_legacy_bios_to_uefi/
https://bbs.pcbeta.com/viewthread-1881340-1-1.html
但是我尝试以上两种方法开机之后还是找不到grub启动项,最后折腾不下去了,直接装了个新的虚拟机用OVMF和q35直通,至此问题解决[困惑] yuri_2234 发表于 2023-8-14 22:39
昨天刚好折腾碰到这个问题,顺便给后面遇到这个问题的人提供一个解决的思路
我的症状是虚拟机直通之后开 ...
虚拟机的机型和BIOS选择还会导致宿主机死机?这还第一次听说,你这个虚拟机装的啥系统? 星河 发表于 2023-8-14 22:56
虚拟机的机型和BIOS选择还会导致宿主机死机?这还第一次听说,你这个虚拟机装的啥系统? ...
bios指的是虚拟机模拟的bios,虚拟机装的是openmediavault(debian),玩直通设备出问题其实还蛮常见的,所以还是allinboom 本帖最后由 ntuchenxy 于 2023-8-14 23:30 编辑
直通过网卡、sata控制器、usb移动硬盘,没有发生什么故障,添加sriov的网卡倒是发生过死机,关闭了内存balloon解决,机型一直都用的q35 yuri_2234 发表于 2023-8-14 22:39
昨天刚好折腾碰到这个问题,顺便给后面遇到这个问题的人提供一个解决的思路
我的症状是虚拟机直通之后开 ...
我里面装了很多,有Uefi,也有seabios,感觉和这个关系不大。死机之后可以看看syslog,里面会有提示的,我这个是soft lockup hardware lockup,有些人说和bios有关,要升级bios,我还没动,我看syslog是和iommu有关,所以我觉得可能和开启acs有关,这个会造成系统不稳定之类,当然很多人开启了也没问题。
我下一步打算把acs关了,再试试 。 第三方控制器直通好像不太行 我用kgdboc调过ASM1062,发现在KVM的一个函数里打转出不来了(太久远了记不清了),同样的设置,同样的group,marvell 9215是正常的,当时就感觉很奇怪,感觉14楼的方案是个思路。 gfhgth 发表于 2023-8-15 09:01
我里面装了很多,有Uefi,也有seabios,感觉和这个关系不大。死机之后可以看看syslog,里面会有提示的, ...
噢噢,你和我的问题不太一样,我是一开始直通pcie转m2硬盘解决了iommu的问题,所以一开始以为还是iommu的问题在这一步排查了好久
页:
[1]