I have a machine. I actually have many machines. This specific machine runs a daemon, let's call it Atlassian Confluence, just for fun. The daemon is run by a user, let's call it senhorcrap. This user is in a little jail, no ssh, no nothing.
I get a note from an enduser saying something to the effect of:
what the fark is going on with your farking website it is farking down.
I respond:
really?
Actually he said:
hey, i've gone a 500 error and then a few minutes ago i saw this:
Service Temporarily Unavailable. The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.
I responded with:
not again.
Not again. Before I'd lazily restart the service and the world would be good. Not this time.
And a sick stack trace later...
Looking at the logs (we always look at the logs) I found it was a open file error. Too many of them were open. Interesting. Well. There are limits to these things to prevent system resource exhaustion.
# tail -f -n 30 /home/senhorcrap/senhorcraps-home/logs/atlassian-confluence.logThen I tried to gracefully stop the service. Then I just killed it by sweeping it away with a script I have on this blog.
# killsomething # ps aux |grep confluenceNot there. Nice.
# su - senhorcrap # ulimit -aS | grep open 1024 # lsof |wc 2044Uh.
As root... I edited /etc/security/limits.conf , /etc/pam.d/login , /etc/profile
/etc/security/limits.conf
senhorcrap soft nofile 1024 senhorcrap hard nofile 4096/etc/pam.d/login
session required pam_limits.so/etc/profile
if [ $USER = "senhorcrap" ]; then if [ $SHELL = "/bin/bash" ]; then ulimit -n 4096 fi fiOnce I su'd as senhorcrap I checked my limits, and all was well.
I started my daemon and the system was fine. Doing the "Windows refresh" wasn't required.
...
What I did not write was it took me a goodly long time to figure out I needed the soft and hard limits in limits.conf to work. And that those limits have to be divisible by 1024. And the new limits would only take effect on new processes (daemons) after the fact; thus I had to kill confluence. But, we don't talk about that. A note before you start to sneeze bs all over me. YES hard alone should work. In this instance, it did not. And I got mad. Well, as only as mad as a sysadmin can be, which is not really mad at all.