Combining InfluxDB + Telegraf and Grafana for ESXi monitoring

With several VMs and complex network setups there is nothing more satisfying than watching your system come to life in the form of a dashboard. Grafana is a beautiful dashboard with support for several data stores and complex SQL queries. This will be a basic guide on how to setup system monitoring with Grafana.

First things first, my setup might differentiate from yours but it should be good unless you use Windows or some other non-generic Linux system. All of my VMs run Debian 8 but almost all commands should be identical on Ubuntu Server as well. I also use UFW on all of my machines so my guide will reflect that. All text that looks like this is meant to be used in CLI.

In my setup i have four virtual machines(VMs):

  • Webserver - nginx - 192.168.1.42
  • Database - InfluxDB - 192.168.1.43
  • Plex - 192.168.1.40
  • Media - 192.168.1.41

Let's start by setting up InfluxDB on Database.


InfluxDB

"InfluxDB is a time series database built from the ground up to handle high write and query loads."

NOTE: As /u/ztherion pointed out, this guide is missing retention policys and adding the necessary users. I will update this guide when i have the time, but for now look up the official documentation on retention policys and how to add users. This has now been resolved.

Start by adding the official repo for influxdb:
vim /etc/apt/sources.list

Add the following at the bottom of the list:
# InfluxDB Repository deb https://repos.influxdata.com/debian jessie stable

Save the file and close it. Next, add the GPG signature:

sudo curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -

Then install it with apt:

apt update && apt install influxdb -y

If you get an error message saying something about missing drivers for https, just install the following:

apt install apt-transport-https

InfluxDB should now be installed on the system and almost ready to go. Add an exception for your local network to UFW otherwise InfluxDB will not be able to receive any data.

sudo ufw allow from 192.168.1.0/24 to any port 8086
This allows clients only from the internal network 192.168.1.0/24. You should replace it with your own local network.

Next up, add the database which Telegraf will write data to in InfluxDB. This will create the database "telegraf" which will be used later. We will create it with a retention of 1 week but you can change it to whatever value you want.

  • influx -> This brings up the influx CLI
  • create database telegraf
  • CREATE RETENTION POLICY "oneweek" ON "telegraf" DURATION 1w REPLICATION 1 DEFAULT

Next lets create some users with read and write access with read for Grafana and write for Telegraf. Start by creating a Admin account with full access.

CREATE USER "admin" WITH PASSWORD 'inputpasswordhere' WITH ALL PRIVILEGES

Next, lets create the user grafana and telegraf.

  • CREATE USER "grafana" WITH PASSWORD 'inputpasswordhere'
  • CREATE USER "telegraf" WITH PASSWORD 'inputpasswordhere'

The user grafana only needs read access to the database telegraf and the user telegraf only needs write access to the database telegraf.

  • GRANT READ ON "telegraf" TO "grafana"
  • GRANT WRITE ON "telegraf" TO "telegraf"

Next we'll have to enable authentication which is disabled by default, so go ahead and exit the influx CLI. Edit the file /etc/influxdb/influxdb.conf and find the section [HTTP]. Now simply edit the value auth_enabled to true.

### [http]
###
### Controls how the HTTP endpoints are configured. These are the primary
### mechanism for getting data into and out of InfluxDB.
###

 [http]
  # Determines whether HTTP endpoint is enabled.
  # enabled = true

  # The bind address used by the HTTP service.
  # bind-address = ":8086"

  # Determines whether HTTP authentication is enabled.
   auth-enabled = true

Restart InfluxDB and it's now ready to collect data from clients in your local network, you login to the influx CLI with the following command:

influx -username admin -password yourpasswordhere

Onwards to Telegraf.


Telegraf

Telegraf is used to gather statistics from machines and then sent to either a database like InfluxDB or MariaDB. You should install Telegraf on all the machines which you want to gather data from.

Start by adding the official repo for Telegraf and adding the GPG signature, it's the same procedure as above so follow that. Next up, update and install Telegraf:

sudo apt update && sudo apt install telegraf

Before we start the Telegraf service we will configure it. There is a default configuration file which we will use. A lot of these will be commented and you won't use nearly all of them but i think it's nice to see what can be achieved with Telegraf. Telegraf will capture a lot of data by default with no special configuration, but you should take a look to see just what data you can gather.

vim /etc/telegraf/telegraf.conf

Find the section [[outputs.influxdb]] and edit it to send data to your InfluxDB machine.

urls = ["http://192.168.1.43:8086"] # required

You'll also have to specify the user you created for telegraf above for influxdb. Edit the values username and password.

[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://192.168.1.43:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "telegraf" # required

  ## Retention policy to write to. Empty string writes to the default rp.
  retention_policy = ""
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
   username = "telegraf"
   password = "inputpasswordhere"

Next find the section [[inputs.net]] and simply uncomment the first line where it says [[inputs.net]]. This is to collect network statistics which we will be using later with Grafana.

Save the file and close it. There is not much more to it if you just want to gather the basics about a system, like, cpu/disk/system/load/network etc. At the bottom of the guide will be a special section related to using Telegraf along with the SNMP protocol to gather data directly from ESXi.

Restart the service and you should be getting data in your created database telegraf. You can check it by issueing the following:

  • influx
  • use telegraf
  • show measurements

Next up, Grafana!


Grafana

Grafana takes your data and presents it in the form of beautiful graphs and statistics. Dashboards for everyone!

Start by adding the official Grafana repository along with the GPG signature.
sudo vim /etc/apt/sources.list

Add the following to the bottom of the file:
# Grafana deb https://packagecloud.io/grafana/stable/debian/ jessie main

Save it and close the file. Now add the GPG signature:
sudo curl https://packagecloud.io/gpg.key | sudo apt-key add

Time to install Grafana!
sudo apt update && sudo apt install grafana

We're almost there, just update systemctl to start the Grafana service automatically on reboot.
sudo systemctl daemon-reload sudo systemctl enable grafana-server

Before starting Grafana, an exception needs to be added to UFW to allow access from the local network, you should change the command so that it allows your local network.
sudo ufw allow from 192.168.1.0/24 to any port 3000

Now just start Grafana:
sudo systemctl start grafana-server

Grafana should now be accessible at your server's ip.
"http://192.168.1.42:3000"

You should be greeted with the following:

grafana1

The default credentials are admin:admin. When you login you will be greeted with a guide and it wants you to add a data source, so lets do it.

revideradgrafanainflux

As you can see i've selected InfluxDB as the Type and named my data source to Telegraf but you can name it whatever. Under http settings you should input your database's internal ip adress along with the port 8086. Like this "http://192.168.1.43:8086". Under InfluxDB Details you have to specify what database you want to fetch data from. If you did not name the database telegraf you'll have to specify in the field "Database". Also don't forget to input the user grafana which you created in the influxdb section for read-only access to Grafana.

Just add the data source and at the bottom it should say that everything is working.
testinfluxdb

Next, navigate up to the top left corner where the Grafana symbol is located and select Dashboards - new. Let the graphing begin!

We'll start slow by creating a few singlestat panels, like the one i have above which shows the uptime of all my VMs. Start by adding a row.
addrow

You should get an option so select a singlestat panel. Next, click on the title of the panel called "Panel title" and a small submenu should appear, select edit.
You should be presented with the following:
efterEdit-1

You can see that i have selected the Panel data source to telegraf, which is our database from InfluxDB. I've also modified the SQL query to get the uptime from my host "ghost".
This is the raw SQL query:
SELECT mean("uptime") FROM "system" WHERE "host" = 'ghost' AND $timeFilter GROUP BY time($interval) fill(null)

As you can see we want to get the value "uptime" from "system" where the host is "guide". Now, here is where you change it to your hostname of the machine you want to display data from.
If we move on to the next tab called "Options" there are a few settings which should be changed here as well.


"Stat" should be changed to current because we want to display the current uptime. "Unit" is where you select what type of data your display will convert to. So if we select the "Unit" to "duration(s)" we will get the uptime in hours and eventually it will display in days/weeks/etc. This requires a bit of fiddling when playing with other values but you should eventually get the right display for your data.

I've also changed the Coloring to display a green background whenever the VM has been up more than a second. This makes it so that if the VM shuts down it will display a red background.

Next, go to the tab General and just change the title of your panel to something like "host1 Uptime". When you're done, navigate up to the top left corner where there is a save icon. Click it and your panel should be on display on your dashboard.

If you want to stack several of these for all of your machines, you simply hover to the left of the row where is three small dots and select Add panel.

Next upp, lets try adding a graph! Just follow the same as above but this time, select Graph instead of Singlestat.
Below you'll find exactly how I produce my network statistics. I cannot guarantee that the values are 100 % correct but this is at least what i use for displaying network statistics.

networkmon

After you've configured network statistics for "bytes _ sent" you can duplicate the query and just change the value from "bytes _ sent" to "bytes_recv". This makes it so you can display both Sent and Received traffic.

And what it should look like after you've duplicated the query.
networkmon2

Next, navigate to the tab "Axes" and select the "Unit" bytes/sec. Now you should have a network graph displaying the peaks and lows of your network traffic. Awesome! However, i do one small tweak additionally to make the graph a little bit clearer. Navigate to the tab "Display" and add a "Series override" to make the bytes_recv be transformed in a negative-y fashion.
negative-y

The values to the left will be inverted but if you hover with your mouse at any given point in the graph, the correct values will be displayed.

Now just change the title of your graph and save it. And there you have a really basic Grafana dashboard! Play around with different values and data to display and you should figure it out pretty quickly.


SNMP with ESXi

This is not meant to be an all out definitive guide for SNMP, only what i've managed to get working. So far the only thing i've managed to get working is polling ESXi for current workload on the cpu cores and threads.

First things first, you'll have to SSH in to your ESXi machine. After you've logged in to your machine, run the following commands:

esxcli system snmp set --communities YOUR_STRING_HERE
esxcli system snmp set --enable true

Replace YOUR_STRING_HERE with something you'll remember. Many times this is demonstrated as PUBLIC or PRIVATE but you should change it to something else.

After that you'll have to allow snmp in ESXi's firewall with the following commands:

esxcli network firewall ruleset set --ruleset-id snmp --allowed-all false
esxcli network firewall ruleset allowedip add --ruleset-id snmp --ip-address 192.168.1.0/24
esxcli network firewall ruleset set --ruleset-id snmp --enabled true

Make sure you change the ip range to your network as well as your subnet mask. Lastly just restart the SNMP daemon:

/etc/init.d/snmpd restart

Nex we'll have to configure Telegraf to poll ESXi for data.

Here is the my configuration for SNMP in Telegraf:

 
[[inputs.snmp]]
   agents = [ "192.168.1.5:161" ]
#   ## Timeout for each SNMP query.
   timeout = "5s"
#   ## Number of retries to attempt within timeout.
   retries = 3
#   ## SNMP version, values can be 1, 2, or 3
   version = 2 
#
#   ## SNMP community string.
   community = "YOUR_STRING_HERE"
#
#   ## The GETBULK max-repetitions parameter
   max_repetitions = 10
#
#   ## SNMPv3 auth parameters
#   sec_name = "victor"
#   auth_protocol = "md5"      # Values: "MD5", "SHA", ""
#   auth_password = ""
#   #sec_level = "authNoPriv"   # Values: "noAuthNoPriv", "authNoPriv", "authPriv"
#   #context_name = ""
#   #priv_protocol = ""         # Values: "DES", "AES", ""
#   #priv_password = ""
#
#   ## measurement name
   name = "system"
   [[inputs.snmp.field]]
     name = "esxi-uptime"
     oid = "iso.3.6.1.2.1.25.1.1.0"
   [[inputs.snmp.field]]
     name = "esxi-cpuload1"
     oid = ".1.3.6.1.2.1.25.3.3.1.2.1"
   [[inputs.snmp.field]]
     name = "esxi-cpuload2"
     oid = ".1.3.6.1.2.1.25.3.3.1.2.2"
   [[inputs.snmp.field]]
     name = "esxi-cpuload3"
     oid = ".1.3.6.1.2.1.25.3.3.1.2.3"
   [[inputs.snmp.field]]
     name = "esxi-cpuload4"
     oid = ".1.3.6.1.2.1.25.3.3.1.2.4"

Some explanations are in order. The field agents = [ "192.168.1.5" ] simply tells Telegraf that my ESXi host is at 192.168.1.5 and i want to poll data from this server.
Timeout and retries are self-explanatory. Under each [[inputs.snmp.field]] there is a name and an OID. The name specifies what type of data will be fetched from the ESXi host. This field can be whatever, however, the oid field is unique.

I've managed to the uptime and cpuload to work. The oid for uptime is: iso.3.6.1.2.1.25.1.1.0

The following oid is for each cpu-core AND thread. It does not differentiate a core from a thread so if your system, like mine, has 2 cores with hyperthreading, you will poll for 4 cpu's. The oid is as follows: .1.3.6.1.2.1.25.3.3.1.2 however, you'll have to add a number att the end of the string for each individual core and thread, like this:

1st core = .1.3.6.1.2.1.25.3.3.1.2.1
2nd core = .1.3.6.1.2.1.25.3.3.1.2.2
1st hyperthread = .1.3.6.1.2.1.25.3.3.1.2.3
2nd hyperthread = .1.3.6.1.2.1.25.3.3.1.2.4

And that's pretty much it. Just restart Telegraf and head over to Grafana and you'll find the values which you specified as names above.

Here is the raw SQL query that i use:

SELECT mean("esxi-cpuload1") FROM "system" WHERE "host" = 'database' AND $timeFilter GROUP BY time(1m) fill(null)

I use the GROUP BY time(1m) here, just so that i don't get spiky graphs. You can change it as you please. I set the unit to percent (0-100).

Here is that the SQL query that i used for displaying the uptime for my ESXi machine:

SELECT mean("esxi-uptime") * 10 FROM "system" WHERE "host" = 'database' AND "agent_host" = '192.168.1.5' AND $timeFilter GROUP BY time($interval) fill(null)

I set the unit to milliseconds as well.

And at the end of this guide you should now have successfully set up a Grafana dashboard monitoring you network, uptime and cpuload from your ESXi host. This is just the tip of the iceberg of what can be done with Grafana + Telegraf and InfluxDB but it should provide you with a solid understanding of monitoring with Grafana. This is all for me now, but in the future there will be more guides, so stay tuned.

Victor