Hung/frozen machine with X370 board, GTX 1060 card, Ryzen 5 CPU - Xid 32 & 69 - all driver versions

TranHung

I get rare freeze-ups on my current machine. They occur every few days, with no log entry and no way to recover.

I frequently (every boot) get Xid errors 32 and 69. I believe these are related. If I am reading the Xid documentation correctly, these two Xids can only be caused by driver issues. I sometimes get other Xid errors, but those logs have been truncated - I will keep on the lookout for them.

I believe this is caused by some race condition - the issue gets better (crashes/flickering/slowness/Xid’s are rarer) when I turn on maximum performance instead of auto performance. The issue is worse when I play games or other GPU-intensive activities. WebGL sometimes crashes, though video games seem to recover pretty well?

The output of nvidia-bug-report.sh is attached.

The machine is used for only a few hours a day, but it is mostly online - I can run experiments or other code to attempt to reproduce the issue if someone can send me the code. I am familiar with C/C++ development. I can also run things in any kind of super-debug mode if someone can tell me how. I have not setup a machine to receive the logs (a remote logger), because I am unsure if that will help - I doubt they get out of the buffer before the kernel hangs. The kernel module should be in persistence mode. I am running Arch linux and the latest kernel/nvidia driver.

What I get from dmesg | grep -i ‘nvrm’ is something like:

Sep 25 02:02:18 RockCruncher kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts) Sep 25 02:02:19 RockCruncher kernel: NVRM: Your system is not currently configured to drive a VGA console Sep 25 17:34:47 RockCruncher kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts) Sep 25 17:34:49 RockCruncher kernel: NVRM: Your system is not currently configured to drive a VGA console Sep 25 18:40:57 RockCruncher kernel: NVRM: GPU at PCI:0000:0c:00: GPU-bc43403c-41f2-3d53-37da-dd090bfda690 Sep 25 18:40:57 RockCruncher kernel: NVRM: GPU Board Serial Number: Sep 25 18:40:57 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 18:41:13 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:20:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:21:03 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 69, Class Error: ChId 001b, Class 0000c197, Offset 00001688, Data 00008000, ErrorCode 0000000c Sep 25 19:21:09 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:27:23 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:42:46 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:43:15 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:43:22 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:43:24 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 19:59:13 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 20:54:24 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr 00040000 Sep 25 23:37:29 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000020 Sep 25 23:37:29 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000345 Sep 25 23:49:56 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000 Sep 25 23:50:44 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 00000010 intr 00040000 Sep 26 00:35:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 0000238c HCE_DBG1 00000020 Sep 26 00:35:58 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000345 Sep 26 01:08:19 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000000 Sep 26 01:08:19 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002394 HCE_DBG1 00000000 Sep 26 01:40:01 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002390 HCE_DBG1 00000000 Sep 26 01:40:01 RockCruncher kernel: NVRM: Xid (PCI:0000:0c:00): 32, Channel ID 0000001b intr1 00000008 HCE_DBG0 00002394 HCE_DBG1 00000000

nvidia-bug-report.log.gz (161 KB)