Skip to main content

Troubleshooting

When a node is online but earning nothing โ€” or a freshly-installed node is failing validator verification โ€” start here. This page covers the two diagnostic surfaces providers reach for first:

  1. Grafana Job Logs โ€” the canonical place to read raw validator-side error messages for a specific node.
  2. The verifyx-test benchmark โ€” runs the same network, storage, and memory checks the validator runs, on demand.

If neither narrows down the problem, the Provider Portal monitoring view and the full Grafana suite cover the rest of the day-2 telemetry.

1. Pull error details from Grafana Job Logsโ€‹

The Job Logs dashboard is the per-node scoring log. Every validator run for your node โ€” successful, failed, or unreachable โ€” lands here with the exact error message the validator produced.

Open the dashboard with your filtersโ€‹

  1. Go to grafana.lium.io โ†’ Job Logs.
  2. In the top toolbar, set the two template variables:
    • miner_hotkey โ€” your provider SS58 hotkey (the one registered on subnet 51).
    • executor_id โ€” the UUID of the specific node you're debugging. You can copy this from the Provider Portal node detail page or from provider.lium.io/executors/{executor-id}.
  3. Set the time range (top-right) to cover the period you care about. Default is the last 1 hour; widen it if your node has been silent.

You can also reach the same view without leaving the portal โ€” each node's detail page has a Grafana tab that pre-fills both variables for that one node (see Monitoring Nodes โ†’ Grafana Integration).

Read the three log feedsโ€‹

The dashboard shows three separate panels, all keyed by the same miner_hotkey + executor_id:

PanelWhen it has rowsWhat to look for
Job LogsEvery scoring runThe score the validator gave you, the uptime, and any non-fatal warnings.
Job Error Logs (Score = 0 AND GPU > 0)The node was reachable but failed scoringSynthetic-job errors, missing sysbox-runc runtime, GPU verification failures, network-metric failures.
Machine scrape Error Logs (Score = 0 AND GPU = 0)The validator could not even read your machineSSH / auth failures, unreachable host, no GPU detected, port closed.

Triage rule of thumb:

  • Rows in panel 3 โ†’ it is a connectivity or configuration problem (network, firewall, SSH, machine offline). Fix the host before doing anything else.
  • Rows in panel 2 โ†’ the host is up but a specific check is failing. Read the error message; for verifyx network or storage failures, jump to ยง2 and reproduce locally.
  • Rows in panel 1 with non-zero scores โ†’ your node is being scored normally; revenue concerns belong in Rewards and the rewards FAQ.

2. Run the verifyx benchmark to reproduce validator checksโ€‹

When the Job Logs panel 2 reports a network or storage failure, the fastest way to confirm and iterate on a fix is to run the validator's own checks locally. We ship them as a Docker image, daturaai/verifyx-test, that runs on the node host itself. Run it before registering a new node, and re-run it any time validator-side network or storage scoring drops.

Quick start (network only)โ€‹

docker run --gpus all --rm daturaai/verifyx-test:latest

Runs a 10-iteration network benchmark and prints download and upload speeds. Finishes in roughly 2โ€“3 minutes.

Full benchmark (network + storage + memory)โ€‹

docker run --gpus all --rm daturaai/verifyx-test:latest python3 benchmark.py --full

Runs all three test suites. Expect 5โ€“10 minutes due to the 5 GB storage I/O test.

What is testedโ€‹

TestWhat it measuresMinimum requirement
Network downloadPyPI package download speed50 Mbps
Network uploadCloudflare speedtest uploadโ€”
Storage writeSequential write throughput (5 GB)100 GB free space
Storage readSequential read throughput (5 GB)โ€”
Memory allocationAllocates 75 % of available RAM (8โ€“128 GB range)8 GB RAM

Interpreting resultsโ€‹

The benchmark runs 10 iterations and reports an Exponential Moving Average (EMA) with alpha = 0.3 (decay = 0.7), so recent runs carry more weight. The final printed values are the EMA-smoothed numbers โ€” a single slow run will not dominate the result.

The last section of the output also shows your hardware stats (total RAM, free disk space, utilization) from the most recent successful run.

Prerequisitesโ€‹

Your host must have the NVIDIA Container Toolkit installed so Docker can pass GPU access through the --gpus all flag. If you followed the Node setup guide, this is already configured.

# Verify the toolkit is installed
docker run --gpus all --rm nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi

3. Common issuesโ€‹

Symptom (Job Logs message or benchmark output)Likely causeFix
Download speed below 50 MbpsISP throttling or congested uplinkContact your hosting provider; consider a dedicated port.
Storage test fails โ€” not enough free spaceLess than 100 GB free on the node diskFree space or resize the volume. See Docker storage.
Memory allocation failsLess than 8 GB RAM available to DockerReduce other processes consuming RAM; upgrade RAM if needed.
--gpus all flag not recognisedNVIDIA Container Toolkit not installedFollow the Node setup guide.
Job Logs panel 3 has rows; SSH / auth errorsValidator cannot reach the nodeCheck firewall, the node's public port, SSH service, and the credentials registered in the Provider Portal.
Job Logs panel 2 mentions sysbox-runcSysbox runtime missing or not the defaultInstall Sysbox โ€” see Sysbox setup. It is required; validators reject any node without it.

Next stepsโ€‹