Monday, September 24, 2012

ulimits & confluence


I have a machine.  I actually have many machines.  This specific machine runs a daemon, let's call it Atlassian Confluence, just for fun.  The daemon is run by a user, let's call it senhorcrap. This user is in a little jail, no ssh, no nothing.

I get a note from an enduser saying something to the effect of:
what the fark is going on with your farking website it is farking down.

I respond:
really?

Actually he said:
hey, i've gone a 500 error and then a few minutes ago i saw this:

Service Temporarily Unavailable. The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.


I responded with:
not again.

Not again. Before I'd lazily restart the service and the world would be good. Not this time.

And a sick stack trace later...

Looking at the logs (we always look at the logs) I found it was a open file error.  Too many of them were open. Interesting.  Well.  There are limits to these things to prevent system resource exhaustion.
# tail -f -n 30 /home/senhorcrap/senhorcraps-home/logs/atlassian-confluence.log
Then I tried to gracefully stop the service. Then I just killed it by sweeping it away with a script I have on this blog.
# killsomething
# ps aux |grep confluence
Not there.  Nice.
# su - senhorcrap
# ulimit -aS | grep open
1024

# lsof |wc
2044
Uh.

As root... I edited /etc/security/limits.conf , /etc/pam.d/login , /etc/profile

/etc/security/limits.conf
senhorcrap      soft    nofile          1024
senhorcrap      hard    nofile          4096
/etc/pam.d/login
session    required   pam_limits.so
/etc/profile
if [ $USER = "senhorcrap" ]; then
        if [ $SHELL = "/bin/bash" ]; then
              ulimit -n 4096
        fi
fi
Once I su'd as senhorcrap I checked my limits, and all was well.
I started my daemon and the system was fine. Doing the "Windows refresh" wasn't required.

...

What I did not write was it took me a goodly long time to figure out I needed the soft and hard limits in limits.conf to work.  And that those limits have to be divisible by 1024.  And the new limits would only take effect on new processes (daemons) after the fact; thus I had to kill confluence. But, we don't talk about that. A note before you start to sneeze bs all over me. YES hard alone should work. In this instance, it did not. And I got mad. Well, as only as mad as a sysadmin can be, which is not really mad at all.

No comments: