Node telemetry metrics

There is currently no way for nodes to communicate metrics for monitoring network status. Things such as these are being considered:

  • Block count
  • Cemented block count
  • Set bandwidth cap
  • Protocol version number
  • Node vendor version
  • Peer count
  • Account count

We are looking to see if there are any others node operator may want.

The reason for doing this, is that even though we are connected to many peers we don't actually share much information about the node state. It can there be difficult to know how far along the upgrade or bootstrap process we are or if an error has been encountered. This will enable the node to automatically adjust to these conditions.

This relates to issue https://github.com/nanocurrency/nano-node/issues/2225

6 Likes

I fully agree with your remark. So in the future we can add real-time node monitoring, I have an idea, to reduce network impact, why don't we use MQTT to enable this transfer?

3 Likes

We'd like the telemetry to be available over the regular peer to peer network protocol. However we are considering MQTT (or other message brokers) for our callback mechanisms, in addition to http and websockets, for their guaranteed delivery aspect :+1:

2 Likes

Now that we are adding voters count into into election information for websocket/RPC, I wonder if tracking and reporting to others the average voters per block seen over a time period could help give a better picture of the decentralization being actively seen across the network. Do we think there is value in that? https://github.com/nanocurrency/nano-node/pull/2414

1 Like

Given feedback, we currently have 10 piece of telemetry data available:

  1. block count
  2. cemented count
  3. node vendor version
  4. protocol version
  5. peer count
  6. bandwidth cap
  7. account count
  8. unchecked count
  9. uptime (seconds)
  10. genesis block hash

More can be added later in future node versions and it is also forwards/backwards compatible. Nodes running newer versions will not receive any newly added data which older nodes do not understand of course but the messages will still be valid. The node does not actually use the new data yet https://github.com/nanocurrency/nano-node/pull/2446 for anything useful, but does allow requests to specific nodes and a random selection through the new "node_telemetry" RPC. I anticipate this will be ready for DB4.

6 Likes

That's amazing! To make full use of this for NanoTicker, retrieve data from ALL nodes and completely remove the use of nodeMonitors I suggest adding:

  • Active difficulty
  • Confirmation time (at least median for past 2048 blocks but would be nice with more stats like average, percentile 90 and 99). Can't send full history so some kind of averages are needed here.
  • Weight of the representative. Maybe not possible since you need to know which account is the rep but maybe the rep address and custom rep name could be optionally entered in the config and included in the response. Then the weight could be calculated manually from that. I know the weight can be optained from the "confirmation_quorum" RPC with "peer_details" and match the IP but is that reliable enough? A custom rep name would be nice regardless. I can't really use telemetry on nanoticker until the weight can be obtained because I separate PRs/Non Prs by weight.
  • Store vendor version

Other useful stats not currently on nanoticker

  • Average used data bandwidth in/out past min, or current bandwidth load. Or total value since the node restarted is probably better and the average can be calculated elsewhere as needed.
  • Current node load as a percentage of maximum possible CPU capacity. Or average over some time.
  • Current node memory use in MB. Or average over some time.
4 Likes

Great to hear. Well take a look at what it would take to include these.

7 Likes

For some things we can use a moving average to stay O(1): A := A + (1/N)*a. Once the queue has N items, and the oldest item is z, it's simply A := A + (1/N)*a - (1/N)*z = A + (1/N)(a-z). Doing percentiles is harder since it would have to be on-demand, so only if we cache the telemetry response for some time.

Getting load and memory usage cross-platform is quite messy.

2 Likes

Welcome to the community! :joy:

2 Likes