This is a living, community-editable document. We are believers in Cunningham's Law (relevant xkcd), so we value and deeply appreciate feedback, thoughts, rants (and even complaints.)
Please feel free to leave a comment or, if you're feeling particularly ambitious, go ahead and make some edits on GitHub and submit a pull request.
Nginx is an increasingly popular open-source HTTP and reverse proxy server that's known for its high concurrency, high performance, and low memory usage. It's become the second most popular public-facing web server (as of December 2014) among the top 1M busiest sites online, and it's pretty awesome.
But, like any long-running software (or a small child), it can get into trouble if left completely unattended.
This guide will take you through a series of recommendations and best practices for monitoring a production nginx deployment. It will make no recommendations as to how to monitor a small child.
While nginx's open source variant (nginx F/OSS, or "plain 'ol nginx") is the most popular, a commercial version (NGINX Plus) is also available and offers load balancing, session persistence, advanced management, and finer-grained monitoring metrics. This guide uses "nginx" in the universal sense and refers to both versions. The metrics discussed in this guide are also available in both versions.
While there's a body of conventional wisdom and some basic material on the web, we found no definitive resource on the subject.
At Scalyr, our approach to monitoring is based on years of frontline operational experience. Good monitoring will help you detect problems more quickly, spot incipient / slow-burn issues before they become full-scale outages, and ultimately save you (and your users) headaches.
This guide can be read as-is and is designed to provide quick, actionable recommendations and best practices.
If you want to dig deeper, take a spin through Zen and the Art of System Monitoring. It will give you a systematic framework for thinking about system monitoring.
Next, read How to Set Alerts to get a deeper understanding of how to build intelligent alerts and set notification thresholds properly. The primary goal here is to minimize false alarms without missing real incidents.
Finally, take a look at our In-Depth Guide to NGINX Metrics. Think of that as an addendum / appendix to this guide. It examines the complete list of available nginx metrics, what exactly they measure, and what exactly they mean.
We recommend a layered approach to monitoring, starting from the application layer and moving down through process, server, hosting provider, external services, and user activity. By monitoring the metrics listed here, you'll get good coverage for both active and incipient problems with your nginx site.
The essential job of nginx is to serve content to client devices. That content is delivered in response to requests from those clients, so it makes sense that the first metric we care about is requests per second - the rate at which nginx is doing its job.
Spikes in RPS can indicate benign events (increased customer activity) or malignant ones (a DDoS attack). A spike can also be connected to errors from upstream servers -- if nginx is load balancing a set of servers and those servers go down, nginx will return errors very quickly.
Drops in RPS, on the other hand, can be signs of network connectivity issues or saturation of an essential system resource like CPU or RAM.
Whatever the cause - significant changes in RPS are events you'll want to know about and investigate further.
requests
from ngx_http_stub_status_module
.Response time measures how quickly requests are being handled - one of the primary indicators of application performance.
$request_time
variable to your nginx log configuration. This measures the elapsed time for nginx to receive the full client request, process the request, and transmit the response.There is a (configurable) hard limit on the total number of connections that nginx can handle. It's important to know that limit and alert before the limit is reached.
The limit is equal to the # of worker_connections
* worker_processes
in your nginx configuration. Once the limit is reached, connections will be dropped and users will see errors. Note that this limit includes all connections (from clients and to upstream servers).
Active connections
from ngx_http_stub_status_module
.accepts
and handled
respectively from ngx_http_stub_status_module
. Under normal circumstances, these should be equal.handled
falls below accepts
- this means that nginx is dropping connections before completion and is an indication that a resource limit has been reached.Nginx accepts connections very quickly, but in extremely high-traffic situations, a connection backlog can still happen at the system level (which is a distinct bottleneck from the application-level connection handling described in #3 above.) When this occurs, new connections will be refused.
netstat -s
for "SYNs to LISTEN sockets dropped” and "times the listen queue of a socket overflowed" values.The connection queue size can be increased by modifying the somaxconn
and tcp_max_syn_backlog
kernel variables. Details of those are beyond the scope of this article, but this kernel.org documentaion contains more information.
Nginx logs the HTTP response code returned for each request, and this can be a rich source of information about the health of both nginx and your upstream servers.
Nginx uses a system file descriptor for each connection it handles - one for each client and one for each upstream connection. The number of simultaneous connections, therefore, cannot exceed the system's limit on open files. When there are no more file handles available, nginx drop new connection requests.
The maximum number of open files can is defined in several places:
sys.fs.file_max
(or a close variation thereof depending on your UNIX flavor) is a kernel variable that defines the system-wide maximum number of open files./etc/security/limits.conf
defines a nofile
entry which sets the maximum number of open files per user.ulimit -n
defines a per-process limit.You'll want to know each of these values, but the smallest one will be the limiting value. By monitoring the number of open files relative to this limiting value, you'll know how much headroom your system has for additional connections.
A note on increasing the limit: If you change the system / user / process limit, you'll normally have to restart the nginx master process to apply the change. If for some reason you cannot allow a restart, nginx provides a workaround in the worker_rlimit_nofile
directive. You can change the value of this directive to match the new system / user / process limit, do a configuration reload, and nginx will apply thew new limit without needing a restart.
Nginx spawns multiple OS processes - a master process and a separate process for each worker. If you've enabled caching, there's an additional process for that too...("You get a process! And you get a process! And YOU get a process!") It's critical to keep an eye on these processes and make sure they stay healthy.
Running
, Sleeping
, Idle
, Defunct
/Zombie
, or Stopped
/Exited
.running
state. Note that if a worker process terminates, the master process will restart it automatically, so it's less important to alert on failure of a worker process. (However, worker failure could be a symptom of a deeper issue, so you might still want to monitor it.)Complete server monitoring is topic beyond the scope of this guide, but we'll cover a few key high-level metrics.
First, the big simple picture - you want to monitor each box's status. Whether you're running nginx on dedicated hardware or a cloud-based VPS, your hosting provider will likely provide you with an overall status indication.
Load average is a metric provided by UNIX systems that summarizes CPU and disk usage into a single number. The number indicates how many threads or processes are attempting to do work at any given time. Loosely speaking, if the number is equal to the number of CPU cores plus the number of disk drives in your server, then the system is completely occupied. Higher numbers indicate that the system is overloaded and some threads are waiting. Lower numbers indicate spare capacity.
This is a loose rule, because it's possible that your CPUs are fully occupied while some disks are idle, or vice versa. As a rule of thumb, if your system doesn't use much disk I/O, then "fully occupied" load average is equal to the number of CPU cores. If it does a lot of disk I/O, then add the number of disk rdrives to the number of CPU cores.
Most systems report three metrics: the 1-minute running load average, the 5-minute running average, and the 15-minute running average. Which of these you should monitor is a matter of taste, but if in doubt, try the 5-minute average. (The 1-minute average is sensitive to spikes; the 15-minute average can be slow to react to changes.)
Most servers are not constrained by network bandwidth, but if you do happen to overuse your network, it can lead to confusing problems that are hard to track down to a root cause. To properly monitor network usage, you need to understand how much bandwidth you have available. If your server is primarily communicating over the public Internet, e.g. with browsers on your users' machines, then your effective limit is probably determined by your hosting provider -- not by the 1Gps Ethernet card in the server.
Running out of disk space is a common source of "mystery" system failures. Logs can grow rapidly during periods of high usage (or DDoS attacks) and if they eat up all available space, chaos can ensue.
Your servers live within a hosting provider, so you'll want to know about big connectivity or availability issues.
Don't make Hotmail's mistake! If users can't reach your site because of a DNS expiration, you're in trouble. Your domain registrar will probably start bugging you about renewal at the 60- and 30-day marks, but in case those alerts are missed (or go to someone else's email address), your monitoring tools can be a valuable backstop.
An expired SSL certificate can also wreak havoc. Certificate authorities are not always as proactive with expiration alerts, so it's a good idea to use your monitoring platform as a backstop.
Monitoring availability of key pages and their corresponding user activity is the last and perhaps most important part of the picture. We suggest a two-pronged approach:
We recommend monitoring static pages (that only nginx responds to) as well as dynamic pages handled by upstream servers so you can better isolate any issues.
One of the side effects of our layered approach is that, when there is a problem at a lower level of the system, alerts may trigger at multiple levels. For example, if your hosting provider fails, that will trigger alerts at the system, process, and application level. But by monitoring every level, you'll be able to more quickly hone in on problems.
With these alerts in place, you can rest easy(er) knowing you've covered a majority of failure scenarios. Of course every environment is different, so your specific needs may vary (and batteries won't be included, etc. etc.)
Be sure to check out our In-Depth Guide to Nginx Metrics to learn about the complete list of measurable nginx metrics so you can customize your monitoring to suit.
Are there any metrics that you like to monitor that we've left out? Are there any nginx monitoring tips or tricks you think we should add? Let us know in the comments!
Scalyr offers a fast, powerful server monitoring, alerting, and log management service. We're a team of ex-Google engineers with years of DevOps experience and we know what it's like to be on call, get an alert, and not have enough information to track down the problem. So we decided to fix that. If you like what you've read on this site, you'll probably like using Scalyr.