Debugging with RPM packages

Posted on June 30, 2017
Tags:

With one of our internal web applications based on Ruby on Rails, we’ve discovered a file descriptor leak in one of the delayed job worker processes. The worker leaked descriptors whenever it invoked a message being send to the message bus using qpid-messaging.

Since we’re using gems compiled as C++ and C extensions, in order to find the root cause, I used the packages provided through the package manager and gdb.

Big thanks to Dan Callaghan who walked me through most of the process and then found the leak in the C++ sources.

TL;DR;
  • identify the leaking descriptors and reproduce it with lsof
  • attach strace to the process and identify file descriptors which are not being closed
  • install debuginfo packages for all dependencies
  • use gdb to figure out what is going on

Reproducer

I’ve used lsof and a friend wrote a small script to quickly monitor the worker process. Looking at the opened files of the process revealed a long list which looked like half closed sockets. It turned out later, that it wasn’t the same problem since the sockets were created, but never bound/connected.

I was unable to reproduce the problem on my local development environment, but found away to do it on our staging environment which resembles production much closer. So whenever I invoked an action in the UI which resulted in a message being sent, I was able to see another file descriptor leak with lsof.

Strace the process

With the reproducer at hand, I started to strace the process:
# Note we're not filtering system calls with -e here.
# Weirdly CLOSE was not reported when just filtering network calls
strace -s 1000 -p  -o strace_output_log.strace
Dan helped me looking through the produced log output, which revealed that the system under investigation created a socket and called getpeername right after it, without binding it resulting in a leaked file descriptor.
10971 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 35
10971 getpeername(35, 0x7fffae712a90, [112]) = -1 ENOTCONN (Transport endpoint is not connected)

Install debuginfo packages and use gdb

In order to debug the system, we need debuginfo packages installed, otherwise you wont be able to step through the sources using gdb. When you attach gdb to the process it will tell you what packages it is missing, for example:
Missing separate debuginfos, use: debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64
You then go install those (be mindful that you the repositories configured e.g. section name fedora-debuginfo):
debuginfo-install qpid-proton-c-0.10-3.fc25.x86_64

and basically start debugging.

Our first suspicion was the qpid messaging library and we check if it’s invocation of getpeername was leaking the file descriptors. I’ve added a break point at the point of the source code we thought was suspicious and in a separate terminal used lsof to see which file descriptor number is leaked. For example:
# I've used a watch, which executes the lsof every 2 seconds by
# default. The grep filters some of the files I'm not interested in
$ watch "lsof -p  | grep -v REG"

The lsof output will show you the leaked file descriptor number in column 4 by default. With that you can check in gdb if the file descriptor being handled in the source code is the one which leaked.

Since that achieved no results, we used gdb to break on invocations of the getpeername identifier and used backtrace to pin point in the sources where the leak occurred.