Troubleshooting (Web) Application Performance Issues with Strace

Datetime:2016-08-23 00:59:16          Topic: Apache HTTP Server  AWK           Share

What do you do when your Linux web app starts acting out? You troubleshoot it, of course! Me too. And today I would like to talk to you about one tool that works really well with all sorts of applications and application servers – ranging from apache http server running modules (mod_php, mod_perl, etc.), to applications running in fast cgi setting or even simple single-threaded programs.

The tool I’m talking about is strace , and if you’re not familiar with it, it will be my pleasure to introduce you two. I guess it’s worth noting that strace is not a universal troubleshooting tool that will help you nail down just about any sort of performance issue (yes, you still need to learn and use the other 27 performance troubleshooting tools), but when it helps, it helps big time and with little effort.

And the good news – it’s really easy to tell upfront whether it is going to be helpful or not.

When you should not be using strace

Let’s start by looking at anti-patterns:

  • Applications that are (or can easily be) profiled properly. I mean, if you have New Relic or some other type of application analytics, it should be a no brainer – just look at the dashboard and you know what to focus on. Strace, on the other hand, can be used with applications that were built with no profiling in mind and that you don’t know well enough (or at all).
  • Strace as a general best practice tool. While strace is a very handy quick n’ dirty tool that helps me a lot in my day to day life as a performance consultant (troubleshooting random systems I see for the first time), if you’re consistently working on the same project, do get proper application analytics set up. This will pay off in the long run, I promise.
  • Finally, strace is not about profiling where User CPU time is spent. You can usually spot weather it is the case pretty easily – just look at vmstat 1 or mpstat -P ALL 1 for a few seconds and if you see that CPUs are maxed out and most activity is in %usr, then you have to profile your application (or simply add more CPU power if your application got there due to increased load).

When does strace really help

Most applications these days are built around micro services. Meaning that besides the logic inside the application, there are lots of calls to other (micro) services. Sometimes these calls go to remote servers you don’t even have access to. Other times, they could be calls to application modules on different cloud instances, database requests, full text searches, cache checks and so on.

This is where strace becomes really handy.

So how do you know you have problem strace can help you with? Well, say you have your application running 20-30 processes normally and all of the sudden the number of processes jumps through the roof. Often reaching hard limits you’ve set for it (e.g. MaxClients), and as a consequence, consuming large amounts of memory.

You look at the CPU stats, and it’s clearly not even close to being saturated. That is a good sign some remote service is blocking your application.

(It could also be blocked by a number of internal resources, such as I/O calls, mutexes, semaphores and what not, and strace is good at identifying those too)

So what is that strace anyway?

In a nutshell, strace is a system call and signal tracer. You attach it to a running program and all of the sudden you can see all the connect() , poll() , read() , write() and many other system calls that the program is making. And most importantly, strace can tell you how long it took to return from any given call (as well as any arguments that were used).

I’ve used strace successfully to troubleshoot all sorts of weirdnesses, from applications running hundreds of database queries instead of one to API calls nobody knew existed.

And I want to show you one such case I have been troubleshooting just a couple of weeks ago.

Before we go on though, I’d like to mention two important things about strace:

  1. Strace can (and often will) slow down your application. Which is why it’s even more so important to check that CPU is not saturated already before you run strace.
  2. Sometimes (although very rarely) strace can leave your program in traceable state ( T state in ps ), so you may need this small program to unfreeze it.

I haven’t had this happen for the last 2 or 3 years though, so whatever was causing it, is probably fixed on most systems by now.

Where do we start

We start by identifying processes that we need to trace. This is the easy part, it’s what alerted us in the first place after all.

Say it’s apache (as it was the case in my last session), so we start by attaching to a PID of the main apache process (one that’s started by “root” user):

# ps auwx | grep http
apache    6247  7.4  0.1 776404 202836 ?       S    05:03   0:37 /usr/sbin/httpd -k start
apache    6344  8.2  0.1 766936 195480 ?       S    05:04   0:36 /usr/sbin/httpd -k start
...
root     22540  0.0  0.0 613424 81804 ?        Ss   Jan25   0:58 /usr/sbin/httpd -k start

# strace -p 22540
Process 22540 attached
select(0, NULL, NULL, NULL, {0, 913767}) = 0 (Timeout)
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2be96bcab0) = 8672
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})     = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6247, si_status=0, si_utime=3603, si_stime=195} ---
select(0, NULL, NULL, NULL, {0, 664861}) = 0 (Timeout)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED, NULL) = 6247
...
# ps auwx | grep http
apache    6247  7.4  0.1 776404 202836 ?      S    05:03  0:37 /usr/sbin/httpd -k start
apache    6344  8.2  0.1 766936 195480 ?      S    05:04  0:36 /usr/sbin/httpd -k start
...
root    22540  0.0  0.0 613424 81804 ?        Ss  Jan25  0:58 /usr/sbin/httpd -k start
 
# strace -p 22540
Process 22540 attached
select(0, NULL, NULL, NULL, {0, 913767}) = 0 (Timeout)
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})    = 0 (Timeout)
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2be96bcab0) = 8672
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})    = ? ERESTARTNOHAND (To berestartedif nohandler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6247, si_status=0, si_utime=3603, si_stime=195} ---
select(0, NULL, NULL, NULL, {0, 664861}) = 0 (Timeout)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG|WSTOPPED, NULL) = 6247
...

Hm, not very helpful, is it?

Thing is, in case of prefork MPM (often used with non-thread safe modules such as mod_php), tracing the main process is useless, as the main process is only forking child processes to do the actual work.

Luckily, instead of tracing the main process only (or only the child processes), we can ask strace to trace child processes as they are created:

# strace -fp 22540
Process 22540 attached
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2be96bcab0) = 9380
Process 9380 attached
[pid  9380] set_robust_list(0x7f2be96bcac0, 24 <unfinished ...>
[pid 22540] wait4(-1,  <unfinished ...>
[pid  9380] <... set_robust_list resumed> ) = 0
[pid 22540] <... wait4 resumed> 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
[pid  9380] rt_sigaction(SIGHUP, {0x7f2be9712e90, [], SA_RESTORER|SA_INTERRUPT, 0x7f2be7fb0790},  <unfinished ...>
[pid 22540] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid  9380] <... rt_sigaction resumed> {0x7f2be9712060, [HUP USR1], SA_RESTORER, 0x7f2be7fb0790}, 8) = 0
[pid  9380] rt_sigaction(SIGTERM, {0x7f2be9712e90, [], SA_RESTORER|SA_INTERRUPT, 0x7f2be7fb0790}, {0x7f2be9712030, [], SA_RESTORER, 0x7f2be7fb0790}, 8) = 0
[pid  9380] rt_sigaction(SIGUSR1, {0x7f2be97126f0, [], SA_RESTORER|SA_INTERRUPT, 0x7f2be7fb0790}, {0x7f2be9712060, [HUP USR1], SA_RESTORER, 0x7f2be7fb0790}, 8) = 0
[pid  9380] geteuid()                   = 0
[pid  9380] setgid(48)                  = 0
[pid  9380] open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 20
[pid  9380] read(20, "65536\n", 31)     = 6
[pid  9380] close(20)                   = 0
...
# strace -fp 22540
Process 22540 attached
select(0, NULL, NULL, NULL, {1, 0})    = 0 (Timeout)
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})    = 0 (Timeout)
wait4(-1, 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0})    = 0 (Timeout)
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2be96bcab0) = 9380
Process 9380 attached
[pid  9380] set_robust_list(0x7f2be96bcac0, 24 <unfinished ...>
[pid 22540] wait4(-1,  <unfinished ...>
[pid  9380] <... set_robust_listresumed> ) = 0
[pid 22540] <... wait4resumed> 0x7fff4338463c, WNOHANG|WSTOPPED, NULL) = 0
[pid  9380] rt_sigaction(SIGHUP, {0x7f2be9712e90, [], SA_RESTORER|SA_INTERRUPT, 0x7f2be7fb0790},  <unfinished ...>
[pid 22540] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid  9380] <... rt_sigaction resumed> {0x7f2be9712060, [HUPUSR1], SA_RESTORER, 0x7f2be7fb0790}, 8) = 0
[pid  9380] rt_sigaction(SIGTERM, {0x7f2be9712e90, [], SA_RESTORER|SA_INTERRUPT, 0x7f2be7fb0790}, {0x7f2be9712030, [], SA_RESTORER, 0x7f2be7fb0790}, 8) = 0
[pid  9380] rt_sigaction(SIGUSR1, {0x7f2be97126f0, [], SA_RESTORER|SA_INTERRUPT, 0x7f2be7fb0790}, {0x7f2be9712060, [HUPUSR1], SA_RESTORER, 0x7f2be7fb0790}, 8) = 0
[pid  9380] geteuid()                  = 0
[pid  9380] setgid(48)                  = 0
[pid  9380] open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 20
[pid  9380] read(20, "65536\n", 31)    = 6
[pid  9380] close(20)                  = 0
...

That’s better. Still, data from different processes is now mixed in a single stream and there’s no timing information.

Therefore let’s create a folder and ask strace to store data for each process in a separate file (-ff) with a name tr1.PID (-o tr1). Additionally, we want unix timestamps with microsecond resolution (-ttt) and time spent in a syscall (-T). Finally, to get a bit more visibility in strings used, let’s increase the length for them (-s 100).

Here’s how our final command looks like:


# strace -tttT -s 100 -ff -o tr1 -p <PID>

For the customer I’m working with, this has generated couple of gigabytes of data within 10 minutes. If the issue you are seeing is not as intermittent, you may only need to run strace for a few minutes to get just enough data. I had to wait a bit until the issue has occurred.

Now, let’s look at a small sample of data we’re now getting:

1453386823.452132 connect(4, {sa_family=AF_INET, sin_port=htons(9306), sin_addr=inet_addr("192.168.91.137")}, 16) = 0 <0.000155>
1453386823.452346 setsockopt(4, SOL_SOCKET, SO_RCVTIMEO, "\2003\341\1\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 <0.000048>
1453386823.452441 setsockopt(4, SOL_SOCKET, SO_SNDTIMEO, "\2003\341\1\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 <0.000050>
1453386823.452534 setsockopt(4, SOL_IP, IP_TOS, [8], 4) = 0 <0.000020>
1453386823.452582 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 <0.000019>
1453386823.452636 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 <0.000023>
1453386823.452698 read(4, "?\0\0\0\n2.0.2-dev (r2919)\0\1...", 16384) = 67 <0.000015>
1453386823.452751 write(4, "R\0\0\1\217\242...", 86) = 86 <0.000025>
1453386823.452132 connect(4, {sa_family=AF_INET, sin_port=htons(9306), sin_addr=inet_addr("192.168.91.137")}, 16) = 0 <0.000155>
1453386823.452346 setsockopt(4, SOL_SOCKET, SO_RCVTIMEO, "\2003\341\1\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 <0.000048>
1453386823.452441 setsockopt(4, SOL_SOCKET, SO_SNDTIMEO, "\2003\341\1\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 <0.000050>
1453386823.452534 setsockopt(4, SOL_IP, IP_TOS, [8], 4) = 0 <0.000020>
1453386823.452582 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 <0.000019>
1453386823.452636 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 <0.000023>
1453386823.452698 read(4, "?\0\0\0\n2.0.2-dev (r2919)\0\1...", 16384) = 67 <0.000015>
1453386823.452751 write(4, "R\0\0\1\217\242...", 86) = 86 <0.000025>

So every line has a wall clock timestamp with microsecond resolution, followed by a system call (syscall) with all the arguments, the result and actual time spent in syscall in angle brackets.

In this example you can see a program connect() ‘s to a sphinx server at 192.168.91.137:9306 using file socket #4 , setsockeopt() sets various socket options and starts read() ‘ing and write() ‘ing.

It’s worth mentioning that time between the timestamps won’t necessarily match the time spent in syscalls. In fact it will almost never match as your program will be executing a user code between the system calls. Plus, it will take some time before the call is scheduled to run, especially if the CPU is rather busy.

Analysing the data

Now it’s time to make sense of all the data strace has captured. What exactly are we looking for? Essentially – syscalls that took long time to return from . Here’s a handy awk one-liner that helps us find the offenders (in this case taking more than 0.5s):

awk '{match($0, /<([0-9]+.[0-9]+)>$/, a)} a[1]>0.5 {print $0}' FILE
awk '{match($0, /<([0-9]+.[0-9]+)>$/, a)} a[1]>0.5 {print $0}' FILE

So in order to find the files you are interested in, you can run this loop:

for f in tr1.*; do
  n=$(awk '{match($0, /<([0-9]+.[0-9]+)>$/, a)} a[1]>0.5 {print $0}' $f | wc -l)
  [ $n -gt 0 ] && echo "$f - $n"
done

# And get an output like:
tr1.1029 - 18
tr1.1034 - 22
tr1.1035 - 4
tr1.1043 - 24
for f in tr1.*; do
  n=$(awk '{match($0, /<([0-9]+.[0-9]+)>$/, a)} a[1]>0.5 {print $0}' $f | wc -l)
  [ $n -gt 0 ] && echo "$f - $n"
done
 
# And get an output like:
tr1.1029 - 18
tr1.1034 - 22
tr1.1035 - 4
tr1.1043 - 24

Alternatively, we can get the offending calls straight away:

for f in tr1.*; do
  echo $f ":"
  awk '{match($0, /<([0-9]+.[0-9]+)>$/, a)} a[1]>0.5 {print $0}' $f
done

# With an output like:
tr1.1021 :
tr1.1025 :
tr1.1029 :
1453386646.638896 read(23, "\2\0\0\1!..."..., 16384) = 1448 <6.759897>
1453386663.204526 read(23, "\2\0\0\1!..."..., 16384) = 1448 <1.055826>
1453386688.662411 read(23, "\2\0\0\1\..."..., 16384) = 2896 <3.869555>
1453386790.492792 epoll_wait(32, {{EPOLLIN, {u32=538913112, u64=140398724852056}}}, 2, 60000) = 1 <1.471661>
1453386821.039989 epoll_wait(29, {{EPOLLIN, {u32=533641080, u64=140398719580024}}}, 2, 60000) = 1 <1.394888>
1453386827.980080 epoll_wait(30, {{EPOLLIN, {u32=539741032, u64=140398725679976}}}, 2, 60000) = 1 <1.229154>
for f in tr1.*; do
  echo $f ":"
  awk '{match($0, /<([0-9]+.[0-9]+)>$/, a)} a[1]>0.5 {print $0}' $f
done
 
# With an output like:
tr1.1021 :
tr1.1025 :
tr1.1029 :
1453386646.638896 read(23, "\2\0\0\1!..."..., 16384) = 1448 <6.759897>
1453386663.204526 read(23, "\2\0\0\1!..."..., 16384) = 1448 <1.055826>
1453386688.662411 read(23, "\2\0\0\1\..."..., 16384) = 2896 <3.869555>
1453386790.492792 epoll_wait(32, {{EPOLLIN, {u32=538913112, u64=140398724852056}}}, 2, 60000) = 1 <1.471661>
1453386821.039989 epoll_wait(29, {{EPOLLIN, {u32=533641080, u64=140398719580024}}}, 2, 60000) = 1 <1.394888>
1453386827.980080 epoll_wait(30, {{EPOLLIN, {u32=539741032, u64=140398725679976}}}, 2, 60000) = 1 <1.229154>

So now we finally have something that can help us understand what is going on. The issue you see above is slow reads from sphinx search daemon which, as I discovered later, was caused by a completely saturated disks on the Sphinx server. This was a well known issue for the customer already.

This, on the other hand, was a completely different story:

1453387583.344145 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 24 <0.000014>
1453387583.344255 ioctl(24, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff77cfdd90) = -1 EINVAL (Invalid argument) <0.000008>
1453387583.344289 lseek(24, 0, SEEK_CUR)  = -1 ESPIPE (Illegal seek) <0.000010>
1453387583.344317 ioctl(24, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff77cfdd90) = -1 EINVAL (Invalid argument) <0.000009>
1453387583.344343 lseek(24, 0, SEEK_CUR)  = -1 ESPIPE (Illegal seek) <0.000008>
1453387583.344367 fcntl(24, F_SETFD, FD_CLOEXEC) = 0 <0.000009>
1453387583.344419 fcntl(24, F_GETFL)      = 0x2 (flags O_RDWR) <0.000008>
1453387583.344444 fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000008>
1453387583.344479 connect(24, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("REMOTE_IP")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000020>
1453387583.344560 select(32, NULL, [24], [24], {180, 0}) = 1 (left {179, 964827}) <0.035201>
1453387583.379835 getsockopt(24, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000012>
1453387583.379903 fcntl(24, F_GETFL)      = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000012>
1453387583.379940 fcntl(24, F_SETFL, O_RDWR) = 0 <0.000011>
1453387583.380058 fcntl(24, F_GETFL)      = 0x2 (flags O_RDWR) <0.000014>
1453387583.380098 fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000011>
1453387583.380352 write(24, "GET /remotecall.json?q=query-string"..., 159) = 159 <0.000021>
1453387583.380446 select(32, [24], NULL, NULL, {180, 0}) = 1 (in [24], left {175, 672759}) <4.327342>
1453387587.707993 read(24, "HTTP/1.1 200 OK\r\nCache-Control: "..., 1024) = 1024 <0.000027>
1453387583.344145 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 24 <0.000014>
1453387583.344255 ioctl(24, SNDCTL_TMR_TIMEBASEor SNDRV_TIMER_IOCTL_NEXT_DEVICEor TCGETS, 0x7fff77cfdd90) = -1 EINVAL (Invalidargument) <0.000008>
1453387583.344289 lseek(24, 0, SEEK_CUR)  = -1 ESPIPE (Illegalseek) <0.000010>
1453387583.344317 ioctl(24, SNDCTL_TMR_TIMEBASEor SNDRV_TIMER_IOCTL_NEXT_DEVICEor TCGETS, 0x7fff77cfdd90) = -1 EINVAL (Invalidargument) <0.000009>
1453387583.344343 lseek(24, 0, SEEK_CUR)  = -1 ESPIPE (Illegalseek) <0.000008>
1453387583.344367 fcntl(24, F_SETFD, FD_CLOEXEC) = 0 <0.000009>
1453387583.344419 fcntl(24, F_GETFL)      = 0x2 (flagsO_RDWR) <0.000008>
1453387583.344444 fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000008>
1453387583.344479 connect(24, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("REMOTE_IP")}, 16) = -1 EINPROGRESS (Operationnowin progress) <0.000020>
1453387583.344560 select(32, NULL, [24], [24], {180, 0}) = 1 (left {179, 964827}) <0.035201>
1453387583.379835 getsockopt(24, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000012>
1453387583.379903 fcntl(24, F_GETFL)      = 0x802 (flagsO_RDWR|O_NONBLOCK) <0.000012>
1453387583.379940 fcntl(24, F_SETFL, O_RDWR) = 0 <0.000011>
1453387583.380058 fcntl(24, F_GETFL)      = 0x2 (flagsO_RDWR) <0.000014>
1453387583.380098 fcntl(24, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000011>
1453387583.380352 write(24, "GET /remotecall.json?q=query-string"..., 159) = 159 <0.000021>
1453387583.380446 select(32, [24], NULL, NULL, {180, 0}) = 1 (in [24], left {175, 672759}) <4.327342>
1453387587.707993 read(24, "HTTP/1.1 200 OK\r\nCache-Control: "..., 1024) = 1024 <0.000027>

The issue I have discovered here was a remote call to the server nobody in the team even knew existed (and it took a while to locate it!). Apparently everyone who worked on (or knew about) this part of the application, left the company. And so did the know-how (and know-what).

Let’s review this piece of strace sample together.

When you’re investigating, naturally you are looking at it the other way around – you start from the bottom (syscall that’s taking too long for your tastes) and you go back up to understand what it was all about.

In line 17 , you can see a select() call that took 4.33s to return. Right before it, in line 16 , there’s an HTTP request on file descriptor #24 (first argument to write() function). So the last task for us is to find what is that fd #24 all about.

And it doesn’t take very long to get to line 9 , where we find the connection attempt, and the server to which connection is made. (The call “fails” with EINPROGRESS as it’s a non-blocking call, so it is followed by a select() and getsockopt() calls to determine whether connect() completed successfully).

Few quick ab checks on URL http://REMOTE_IP/remotecall.json?q=query-string confirmed that the service was flaky. And it only took a few minutes to get it fixed now that we knew what to fix. If not strace, nobody would have guessed it, since nobody even knew about this server or this API call.

Final notes

Before we wrap up, let me give you a few quick tips that would have saved me hours when I first started using strace back in 2004 (and had no clue about Linux system calls, kernel and many other internals):

  1. Some calls are bound to take a long time so before you dig too deep, be sure you’re checking the right thing. Good examples are epoll_wait() and poll() calls that could be waiting for http requests to come in.
  2. Some slow syscalls are symptoms rather than a root cause , e.g. semop() taking a long time could mean a lot of contention between many processes rather than issues with memory or semaphores.
  3. Documentation of calls is super easy to find. You don’t even need Google. In fact, it’s best if you do not use Google if you want to learn how a specific call works on your system. For example, to learn more about semop(), just run “man 2 semop” on your system.

Well, have fun playing with strace. In the mean time, I need to figure out how to get the same result with tcpdump rather than strace. Sure, strace is nice in that it shows you an local issues too (e.g. slow disk reads), but for micro service based applications, tcpdump + something like pt-tcp-model could be a good less intrusive way to get to very similar conclusions.

If you know how to do it with tcpdump, I’m all ears!





About List