monitoring
Once your node is up and running, it's important to keep an eye on it to make sure it stays afloat and continues to contribute to the health of the overall network. To help with that, Bantu Core exposes vital information that you can use to monitor your node and diagnose potential problems.
You can access this information using commands and inspecting Bantu Core's output, which is what the first half of this doc covers. You can also connect Prometheus to make monitoring easier, combine it with Alertmanager to automate notification, and use pre-built Grafana dashboards to create visual representations of your node's well-being.
However you decide to monitor, the most important thing is that you have a system in place to ensure that your integration keeps ticking.
General Node Information
If you run $ bantu-core http-command 'info'
, the output will look something like this:
Some notable fields in info
are:
build
: the build number for this Bantu Core instanceledger
: the local state of your node, which may be different from the network state if your node was disconnected from the network. Some important sub-fields:age
: time elapsed since this ledger closed (during normal operation less than 10 seconds)num
: ledger numberversion
: protocol version supported by this ledger
network
is the network passphrase that this core instance is using to decide whether to connect to the testnet or the public networkpeers
: information on the connectivity to the networkauthenticated_count
: the number of live connectionspending_count
: the number of connections that are not fully established yet
protocol_version
: the maximum version of the protocol that this instance recognizesstate
: the node's synchronization status relative to the networkquorum
: summarizes the state of the SCP protocol participants, the same as the information returned by thequorum
command (see below).
Overlay information
The peers
command returns information on the peers your node is connected to.
This list is the result of both inbound connections from other peers and outbound connections from this node to other peers.
$ bantu-core http-command 'peers'
Quorum Health
To help node operators monitor their quorum sets and maintain the health of the overall network, Bantu Core also provides metrics on other nodes in your quorum set. You should monitor them to make sure they're up and running, and that your quorum set is maintaining good overlap with the rest of the network.
Quorum set diagnostics
The quorum
command allows to diagnose problems with the quorum set of the local node.
If you run:
$ bantu-core http-command 'quorum'
The output will look something like:
This output has two main sections: qset
and transitive
. The former describes the node and its quorum set; the latter describes the transitive closure of the node's quorum set.
Per-node Quorum-set Information
Entries to watch for in the qset
section — which describe the node and its quorum set — are:
agree
: the number of nodes in the quorum set that agree with this instance.delayed
: the nodes that are participating in consensus but seem to be behind.disagree
: the nodes that are participating but disagreed with this instance.fail_at
: the number of failed nodes that would cause this instance to halt.fail_with
: an example of such potential failure.missing
: the nodes that were missing during this consensus round.value
: the quorum set used by this node (t
is the threshold expressed as a number of nodes).
In the example above, 6 nodes are functioning properly, one is down (sampl1
), and the instance will fail if any two nodes still working (or one node and one inner-quorum-set) fail as well.
If a node is stuck in state Joining SCP
, this command allows to quickly find the reason:
too many validators missing (down or without a good connectivity), solutions are:
adjust your quorum set based on the nodes that are not missing
try to get a better connectivity path to the missing validators
network split would cause SCP to stick because of nodes that disagree. This would happen if either there is a bug in SCP, the network does not have quorum intersection, or the disagreeing nodes are misbehaving (compromised, etc).
Note that the node not being able to reach consensus does not mean that the network as a whole will not be able to reach consensus (and the opposite is true: the network may fail because of a different set of validators failing).
You can get a sense of the quorum set health of a different node using using: $ bantu-core http-command 'quorum?node=$sdf1
or $ bantu-core http-command 'quorum?node=@GABCDE
Overall network health can be evaluated by walking through all nodes and looking at their health. Note that this is only an approximation, as remote nodes may not have received the same messages (in particular: missing
for other nodes is not reliable).
Transitive Closure Summary Information
When showing quorum-set information about the local node, a summary of the transitive closure of the quorum set is also provided in the transitive
field. This has several important sub-fields:
last_check_ledger
: the last ledger in which the transitive closure was checked for quorum intersection. This will reset when the node boots and whenever a node in the transitive quorum changes its quorum set. It may lag behind the last-closed ledger by a few ledgers depending on the computational cost of checking quorum intersection.node_count
: the number of nodes in the transitive closure, which are considered when calculating quorum intersection.intersection
: whether or not the transitive closure enjoyed quorum intersection at the most recent check. This is of utmost importance in preventing network splits. It should always be true. If it is ever false, one or more nodes in the transitive closure of the quorum set is currently misconfigured, and the network is at risk of splitting. Corrective action should be taken immediately, for which two additional sub-fields will be present to help suggest remedies:last_good_ledger
: this will note the last ledger for which theintersection
field was evaluated as true; if some node reconfigured at or around that ledger, reverting that configuration change is the easiest corrective action to take.potential_split
: this will contain a pair of lists of validator IDs, which is a potential pair of disjoint quorums allowed by the current configuration. In other words, a possible split in consensus allowed by the current configuration. This may help narrow down the cause of the misconfiguration: likely it involves too-low a consensus threshold in one of the two potential quorums, and/or the absence of a mandatory trust relationship that would bridge the two.
critical
: an "advance warning" field that lists nodes that could cause the network to fail to enjoy quorum intersection, if they were misconfigured sufficiently badly. In a healthy transitive network configuration, this field will benull
. If it is non-null
then the network is essentially "one misconfiguration" (of the quorum sets of the listed nodes) away from no longer enjoying quorum intersection, and again, corrective action should be taken: careful adjustment to the quorum sets of nodes that depend on the listed nodes, typically to strengthen quorums that depend on them.
Detailed transitive quorum analysis
The quorum endpoint can also retrieve detailed information for the transitive quorum.
This is a format that's easier to process than what scp
returns as it doesn't contain all SCP messages.
$ bantu-core http-command 'quorum?transitive=true'
The output looks something like:
The output begins with the same summary information as in the transitive
block of the non-transitive query (if queried for the local node), but also includes a nodes
array that represents a walk of the transitive quorum centered on the query node.
Fields are:
node
: the identity of the validatordistance
: how far that node is from the root node (ie. how many quorum set hops)heard
: the latest ledger sequence number that this node voted onqset
: the node's quorum setstatus
: one ofbehind|tracking|ahead
(compared to the root node) ormissing|unknown
(when there are no recent SCP messages for that node)value_id
: a unique ID for what the node is voting for (allows to quickly tell if nodes are voting for the same thing)value
: what the node is voting for
Using Prometheus
Monitoring bantu-core
using Prometheus is by far the simplest solution, especially if you already have a Prometheus server within your infrastructure. Prometheus is a free and open source time-series database with a simple yet incredibly powerful query language PromQL
. Prometheus is also tightly integrated with Grafana, so you can render complex visualisations with ease.
In order for Prometheus to scrape bantu-core
application metrics, you will need to install the bantu-core-prometheus-exporter (apt-get install bantu-core-prometheus-exporter
) and configure your Prometheus server to scrape this exporter (default port: 9473
). On top of that grafana can be used to visualize metrics.
Install a Prometheus server within your infrastructure
Installing and configuring a Prometheus server is out of scope of this document, however it is a fairly simple process: Prometheus is a single Go binary which you can download from https://prometheus.io/docs/prometheus/latest/installation/.
Install the bantu-core-prometheus-exporter
The bantu-core-prometheus-exporter is an exporter that scrapes the bantu-core
metrics endpoint (http://localhost:11626/metrics
) and renders these metrics in the Prometheus text-based format available for Prometheus to scrape and store in its timeseries database.
The exporter needs to be installed on every Bantu Core node you wish to monitor.
apt-get install bantu-core-prometheus-exporter
You will need to open up port 9473
between your Prometheus server and all your Bantu Core nodes for your Prometheus server to be able to scrape metrics.
Point Prometheus to bantu-core-prometheus-exporter
Pointing your Prometheus instance to the exporter can be achieved by manually configuring a scrape job; however, depending on the number of hosts you need to monitor this can quickly become unwieldy. Luckily, the process can also be automated using Prometheus' various "service discovery" plugins. For example with AWS hosted instance you can use the ec2_sd_config
plugin.
Manual
Using Service Discovery (EC2)
Create Alerting Rules
Once Prometheus scrapes metrics we can add alerting rules. Recommended rules are here (require Prometheus 2.0 or later). Copy rules to /etc/prometheus/bantu-core-alerting.rules on the Prometheus server and add the following to the prometheus configuration file to include the file:
Rules are documented in-line,and we strongly recommend that you review and verify all of them as every environment is different.
Configure Notifications Using Alertmanager
Alertmanager is responsible for sending notifications. Installing and configuring an Alertmanager server is out of scope of this document, however it is a fairly simple process. Official documentation is here.
All recommended alerting rules have "severity" label:
critical normally require immediate attention. They indicate an ongoing or very likely outage. We recommend that critical alerts notify administrators 24x7
warning normally can wait until working hours. Warnings indicate problems that likely do not have production impact but may lead to critical alerts or outages if left unhandled
The following example alertmanager configuration demonstrates how to send notifications using different methods based on severity label:
In the above examples alerts with severity "critical" are sent to pagerduty and warnings are sent to slack.
Useful Exporters
You may find the below exporters useful for monitoring your infrastructure as they provide incredible insight into your operating system and database metrics. Installing and configuring these exporters is out of the scope of this document but should be relatively straightforward.
node_exporter can be used to track all operating system metrics.
postgresql_exporter can be used to monitor the local bantu-core database.
Visualize metrics using Grafana
Once you've configured Prometheus to scrape and store your bantu-core metrics, you will want a nice way to render this data for human consumption. Grafana offers the simplest and most effective way to achieve this. Installing Grafana is out of scope of this document but is a very simple process, especially when using the prebuilt apt packages
We recommend that administrators import the following two dashboards into their grafana deployments:
Bantu Core Monitoring - shows the most important metrics, node status and tries to surface common problems. It's a good troubleshooting starting point
Bantu Core Full - shows a simple health summary as well as all metrics exposed by the
bantu-core-prometheus-exporter
. It's much more detailed than the Bantu Core Monitoring and might be useful during in-depth troubleshooting
Last updated