Periodically get ‘device not responding’ error on my HiveOS rig

Once a 2 or 3 minutes I get the following error in the miner log:

grep -i error /var/log/miner/gminer/gminer.log
Error on GPU6: Device not responding, check overclocking settings

and on the following message on the miner screen:

miner
GPU6: DAG has been damaged, check overclocking settings
Miner terminated, watchdog will restart process after 10 seconds

GPU temperature is about 50-60°C, but looks like the memory chips overheat, because there was NVidia error:

journalctl -p err | grep NVRM
Nov 06 21:14:14 rig0 kernel: NVRM: Xid (PCI:0000:08:00): 31, pid=12382, Ch 0000001e, intr 10000000.
MMU Fault: ENGINE HOST7 HUBCLIENT_HOST_CPU faulted @ 0x2_24607000.
Fault is of type FAULT_PDE ACCESS_TYPE_READ

Also it can be:

root@rig0:/var/log/miner/gminer# grep -i error gminer.log
22:21:11 GPU2: DAG verification completed without errors
22:21:11 GPU4: DAG verification completed without errors
22:21:11 GPU5: DAG verification completed without errors
22:21:12 GPU1: DAG verification completed without errors
22:21:12 GPU3: DAG verification completed without errors
22:21:12 GPU0: DAG verification completed without errors
22:24:05 Error on GPU3: unspecified launch failure
root@rig0:/var/log/miner/gminer# journalctl -p err | grep NVRM
Nov 06 22:24:05 rig0 kernel: NVRM: Xid (PCI:0000:05:00): 32, pid=2766, Channel ID 0000001e intr 00040000
Nov 06 22:24:05 rig0 kernel: NVRM: Xid (PCI:0000:05:00): 32, pid=2766, Channel ID 0000001e intr1 00000008 HCE_DBG0 00001cc4 HCE_DBG1 00000272

I used “Arctic Cooling Thermal Pad” 6 Вт/мК, ceramics, 1 мм x 50 мм x 50 мм:

but probably I need a thermal pad of 1.5mm thickness.

The size of memory chips are 1×1.5 cm.

My first idea was to try “Thermal Grizzly Minus Pad 8“, but they can’t be used with aluminum radiators, so I ordered “Arctic Cooling Thermal Pad” 1.5 mm.

The original layout of thermal pads:

Tried to search for d9tcb on aliexpress.com (is it Micron?):

also there are some results for K4G41325FE-HC25, is it Samsung?

Testing the card on Windows

Started BTG miner on my Windows 10 machine with +600 memory clock:

+---+-----+-----------+------+-----+-----------+
| ID  GPU     Speed    Shares Power Efficiency |
+---+-----+-----------+------+-----+-----------+
|  0  1060  37.2 Sol/s  0/0/0 116 W 0.32 Sol/W |
+---+-----+-----------+------+-----+-----------+
+---+-----+----+---+----+----+
| ID  GPU  Temp Fan Core Mem |
+---+-----+----+---+----+----+
|  0  1060   65 0 %    0   0 |
+---+-----+----+---+----+----+
03:24:10 Pool: btg.2miners.com:4040 Shares/Minute: 0.00
03:24:10 Pool Hashrate: 0.0 Sol/s Efficiency: 0.00 %
03:24:10 Uptime: 0d 00:10:08 Electricity: 0.019 kWh
03:24:24 GPU0: Share #1 verified on CPU, difficulty: 19.83K
03:24:24 GPU0: Share #1 accepted 61 ms

Mining slowed down Task Manger, GPUZ and MSI Afterburner.

Tried to mine ETP with t-rex:

t-rex -a ethash -o stratum+tcp://eu.etp.k1pool.com:8008 -u MNpoZqo8VDeDTJVMU72YMpzxUYguzDGc7J -p x --worker win-rig

but got “not enough free memory to mine ethash at epoch 196”. It is not clear why Windows can’t mine it, on Linux it uses only 2666MiB.

Tried to mine ERGO:

t-rex.exe -a autolykos2 -o stratum+tcp://erg.2miners.com:8888 -u 9eZSBAg38A5KvQWjeHXdubo9owj8VPrxCHa16GSXn4rBSSx1bmg.rig0 -p x --no-watchdog
20211109 04:36:29 T-Rex NVIDIA GPU miner v0.24.5  -  [Windows]
20211109 04:36:29 r.3ed63f02e8cb
20211109 04:36:29
20211109 04:36:29
20211109 04:36:29 NVIDIA Driver v456.71
20211109 04:36:29
20211109 04:36:29 + GPU #0: [00:01.0|1c02] MSI GeForce GTX 1060 3GB, 3072 MB
20211109 04:36:29
20211109 04:36:29 WARN: DevFee 2% (autolykos2)
20211109 04:36:29
20211109 04:36:29 URL : stratum+tcp://erg.2miners.com:8888
20211109 04:36:29 USER: 9eZSBAg38A5KvQWjeHXdubo9owj8VPrxCHa16GSXn4rBSSx1bmg.rig0
20211109 04:36:29 PASS: x
20211109 04:36:29
20211109 04:36:29 Starting on: erg.2miners.com:8888
20211109 04:36:29 ApiServer: HTTP server started on 127.0.0.1:4067
20211109 04:36:29 ---------------------------------------------------
20211109 04:36:29 For control navigate to: http://127.0.0.1:4067/trex
20211109 04:36:29 ---------------------------------------------------
20211109 04:36:29 GPU #0: intensity 21.2
20211109 04:36:30 Extranonce is set to: e3e0
20211109 04:36:30 Authorizing...
20211109 04:36:30 Authorized successfully.
20211109 04:36:30 autolykos2 block: 615558, diff: 8.73 G
20211109 04:36:30 GPU #0: allocated memory for the dataset, memory left: 333.86 MB
20211109 04:36:35 GPU #0: dataset generated for block 615558 [time: 4689 ms]
20211109 04:36:35 GPU #0: failed to allocate second dataset buffer, falling back to single buffer mode
20211109 04:36:42 [ OK ] 1/1 - 41.08 MH/s, 84ms ... GPU #0

-------------20211109 04:37:40 -------------
Mining at erg.2miners.com:8888, diff: 8.73 G
GPU #0: MSI GTX 1060 3GB - 41.08 MH/s, [T:53C, P:80W, F:45%, E:514kH/W], 1/1 R:0%
Shares/min: 1 (Avg. 4.615)
Uptime: 1 min 10 secs | Algo: autolykos2 | T-Rex v0.24.5
WD: 1 min 12 secs, shares: 1/1

20211109 04:37:41 autolykos2 block: 615559, diff: 8.73 G
20211109 04:37:46 GPU #0: dataset generated for block 615559 [time: 4706 ms]
20211109 04:38:08 autolykos2 block: 615560, diff: 8.73 G

-------------20211109 04:38:10 -------------
Mining at erg.2miners.com:8888, diff: 8.73 G
GPU #0: MSI GTX 1060 3GB - 41.08 MH/s, [T:55C, P:80W, F:47%, E:495kH/W], 1/1 R:0%
Shares/min: 1 (Avg. 4.615)
Uptime: 1 min 40 secs | Algo: autolykos2 | T-Rex v0.24.5
WD: 1 min 42 secs, shares: 1/1

20211109 04:38:13 GPU #0: dataset generated for block 615560 [time: 4712 ms]
20211109 04:38:30 autolykos2 block: 615561, diff: 8.73 G
20211109 04:38:35 GPU #0: dataset generated for block 615561 [time: 4717 ms]
20211109 04:38:38 [ OK ] 2/2 - 41.08 MH/s, 97ms ... GPU #0

After switching to P0 state I got slowed down Task Manager, MSI Afterburner, GPUZ and this:

20211109 19:57:58 TREX: Can't find nonce with device [ID=1, GPU #1], cuda exception: CUDA_ERROR_LAUNCH_FAILED, try to reduce overclock to stabilize GPU state

I switched back to P2 state and t-rex miner continued to work at about 87 MH/s.

Below I provided GPUZ screenshots for both the cards:

Links

1 Response to Periodically get ‘device not responding’ error on my HiveOS rig

  1. dmitriano says:

    Started to mine ERGO with two cards +600 memory:
    GPU #0: MSI GTX 1060 3GB – 41.08 MH/s, [T:56C, P:120W, F:90%, E:456kH/W], 15/15 R:0%
    GPU #1: Gigabyte GTX 1060 3GB – 46.33 MH/s, [T:49C, P:110W, F:90%, E:515kH/W], 15/15 R:0%
    Hashrate: 87.41 MH/s, Shares/min: 0.855 (Avg. 0.703), Avg.P: 180W, Avg.E: 486kH/W

Leave a Reply

Your email address will not be published. Required fields are marked *