OpenResty
®
Color
pdf
|<<
<
>
>>|
/
Tracing and Troubleshooting NGINX, OpenResty, and Their Backends ☺{{#author|yichun@openresty.com}}☺ {{#author|Yichun Zhang (@agentzh)}} {{img src="images/openresty2.png" width="128" height="128"}} {{#date|2018.10}} ---- Your nginx workers is eating up too much {{#ci|CPU}}? Even though the traffic is not much? ---- {{img src="images/top-nginx-100-cpu.gif" width="1062" height="848"}} ---- Your nginx workers can {{#ci|never}} max out a CPU core? No matter how hard you try? ---- {{img src="images/nginx-cpu-not-100.gif" width="1074" height="850"}} ---- Your nginx workers are taking too much {{#ci|memory}}? ---- {{img src="images/nginx-big-res-mem.gif" width="1060" height="840"}} ---- Your nginx workers keep {{#ci|eating}} memory? ---- {{img src="images/nginx-mem-grow.gif" width="1074" height="850"}} ---- Seeing your nginx workers {{#ci|crash}}? ---- [alert] 118955#118955: worker process 118956 exited on signal 11 (core dumped) ---- [414307.007494] nginx[67628]: segfault at 42409000 ip 00007f962f781120 sp 00007ffcd79d6c50 error 4 in ... ---- Seeing weird nginx upstream {{#ci|timeout}} errors? ---- [error] 19670#19670: *145 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: foo.com, ... ---- Percentage of the requests served within a certain time (ms) 50% 5 66% 5 75% 6 80% 6 90% 7 95% 7 98% 7 99% 7 100% {{#x|1008}} (longest request) ---- Stop guessing; let's trace it! ---- Every running process {{#ci|is}} a database. Every running system {{#ci|is}} a database. ---- {{img src="images/query-db.png" width="782" height="222"}} ---- We need {{#ci|debuginfo}} to make sense of the binary world. ---- {{img src="images/dwarf-guide-new.jpg" width="899" height="721"}} ---- {{#cm|# CentOS}} debuginfo-install hello {{#cm|# Fedora}} dnf debuginfo-install hello {{#cm|# Ubuntu/Debian}} apt-get install hello-dbgsym apt-get install hello-dbg ---- Challenges • Hard to reproduce • Only reproducible online • No disruptions in online services when debugging • Binaries are usually heavily optimized by compilers ---- Requirements • Noninvasive, no side effects in the target processes at all • No addons or pluggins required in the target process space • No restart or reload required for the target processes • No recompilation of the target software • Low overhead • Postmortem debugging • No collaborations from the target software • Safe from bugs in tracing tools themselves ---- Common Debugging/Tracing Tools • GDB • SystemTap • DTrace • BCC/eBPF • Mozilla rr • VMware VProbes • Linux perf ---- Common solutions & causes ---- some_tool -p PID ---- some_tool -p PID -p PID2 -p PID3 ... ---- # whole system some_tool ---- # trace all processes run from an executable some_tool --exe PATH ---- CPU is too high? ➥ {{#ci|CPU}} flame graphs! ---- {{img src="images/on-cpu-sketch2.png" width="600" height="280"}} ---- upstream backend { server 127.0.0.1; keepalive 10; } server { server_name \"foo.com\"; listen 1234; location / { gzip on; gzip_comp_level 9; proxy_http_version 1.1; proxy_set_header Connection \"\"; proxy_pass http://backend; } } ---- {{img src="images/gzip-hot9-grow.gif" width="1366" height="592"}} ---- {{img src="images/nginx-gzip9.gif" width="1366" height="592"}} ---- gzip_comp_level 1; ---- {{img src="images/gzip-hot1.gif" width="1366" height="592"}} ---- {{img src="images/proxy-short-conns.gif" width="1366" height="628"}} ---- upstream backend { server 127.0.0.1; keepalive 10; keepalive_requests 100; } server { server_name \"foo.com\"; listen 1234; location / { proxy_http_version 1.1; proxy_set_header Connection \"\"; proxy_pass http://backend; } } ---- {{img src="images/proxy-long-conn.gif" width="1366" height="558"}} ---- {{img src="images/proxy-cache-no-open-file-cache.gif" width="1366" height="608"}} ---- {{#kw|open_file_cache}} max=300000 inactive=1d; ---- {{img src="images/proxy-with-open-file-cache.gif" width="1366" height="608"}} ---- {{#ci|Lua}}-land CPU Flame Graphs ---- {{img src="images/lua-pow-cpu.gif" width="1366" height="158"}} ---- {{#ci|Perl}}-land CPU Flame Graphs ---- {{img src="images/perl-fan-cpu.gif" width="1270" height="812"}} ---- {{#ci|Python}}-land CPU Flame Graphs ---- {{img src="images/py-hello-cpu.gif" width="1366" height="410"}} ---- PHP/Ruby/Erlang/NodeJS CPU Flame Graphs... ---- {{img src="images/redis-cpu.gif" width="1366" height="412"}} ---- • Spin-locks • Crazily backtracking regexes • Recompiling regexes • Heavy computations (image processing, data compression, encryption) • Dynamic allocations and garbage collection • Excessive system calls (connect, gettimeofday, ...) • Interpreted execution instead of running JIT compiled code ---- CPU is always too low? ➥ {{#ci|off-CPU}} flame graphs! ---- {{img src="images/off-cpu-sketch2.png" width="600" height="225"}} ---- {{img src="images/perfect-nginx-off-cpu.gif" width="1366" height="246"}} ---- location / { content_by_lua_block { local function foo() local f = assert(io.popen(\"whoami\")) local data = f:read(\"*a\") f:close() ngx.print(data) end foo() } } ---- {{img src="images/blocking-io-popen.gif" width="1366" height="462"}} ---- location / { content_by_lua_block { -- based on the nonblocking ngx.pipe API: local shell = require \"resty.shell\" local function foo() local ok, stdout, stderr, reason, status = shell.run('whoami', nil, timeout, 128) ngx.print(stdout) end foo() } } ---- {{img src="images/ngx-pipe-off-cpu.gif" width="1366" height="244"}} ---- • Blocking I/O operations (disk I/O, blocking network I/O) • Locks and semaphores • Overloaded systems with heavy process preemptions ---- ...when off-CPU flame graphs look like CPU flame graphs ---- Disk I/O is going crazy? ➥ VFS latency & data volume flame graphs ---- Memory footprint is too big? ➥ Memory usage analyses in the process space! ---- $ lj-gc-objs.sxx -x 5686 Start tracing 5686 (/opt/nginx/sbin/nginx) main machine code area size: 65536 bytes C callback machine code size: 4096 bytes GC total size: 922648 bytes GC state: sweep 4713 table objects: max=6176, avg=90, min=32, sum=428600 (in bytes) 3341 string objects: max=2965, avg=47, min=18, sum=159305 (in bytes) 677 function objects: max=144, avg=29, min=20, sum=20224 (in bytes) 563 userdata objects: max=8895, avg=82, min=24, sum=46698 (in bytes) 306 proto objects: max=34571, avg=541, min=78, sum=165557 (in bytes) 287 upvalue objects: max=24, avg=24, min=24, sum=6888 (in bytes) 102 trace objects: max=928, avg=337, min=160, sum=34468 (in bytes) 8 cdata objects: max=24, avg=17, min=16, sum=136 (in bytes) 7 thread objects: max=1648, avg=1493, min=568, sum=10456 (in bytes) JIT state size: 6920 bytes global state tmpbuf size: 2948 bytes C type state size: 2520 bytes My GC walker detected for total 922648 bytes. 5782 microseconds elapsed in the probe handler. ---- (gdb) lgcstat 15172 str objects: max=2956, avg = 51, min=18, sum=779126 987 upval objects: max=24, avg = 24, min=24, sum=23688 104 thread objects: max=1648, avg = 1622, min=528, sum=168784 431 proto objects: max=226274, avg = 2234, min=78, sum=963196 952 func objects: max=144, avg = 30, min=20, sum=28900 446 trace objects: max=23400, avg = 1857, min=160, sum=828604 2965 cdata objects: max=4112, avg = 17, min=12, sum=51576 18961 tab objects: max=24608, avg = 207, min=32, sum=3943256 9 udata objects: max=176095, avg = 39313, min=32, sum=353822 sizeof strhash 65536 sizeof g->tmpbuf 512 sizeof ctype_state 8664 sizeof jit_state 53792 total sz 7274672 g->strnum 15172, g->gc.total 7274672 ---- (gdb) lgcpath 100000 tab path 000:[registry] ->Tab[\"_LOADED\"] ->Tab[\"ffi\"] ->Tab[\"gc\"] ->cfunc ->env ->Tab sz:196640 (GCobj*)0x40784f58 ->END path 001:[registry] ->Tab[tv=0x4132e470] -> Tab sz:524328 (GCobj*)0x40783108 ->END ---- A catch in shared memory: demand paging ---- A catch in allocators: they may not release freed memory to the OS. • Nginx memory pool never frees small blocks • sbrk/brk never jumps over used blocks to free space • glibc does not free mmap'd memory in some cases ---- Memory usage is growing forever ➥ Memory leak flame graph ---- Filter out noises ---- Some requests taking too long to respond? ➥ Latency decomposition in the request's lifetime ---- # assuming an nginx worker process's pid is 27327 {{#v|$}} ./samples/ngx-single-req-latency.sxx -x 27327 Start tracing process 27327 (/opt/nginx/sbin/nginx)... POST /api_json total: 143596us, accept() ~ header-read: 43048us, rewrite: 8us, pre-access: 7us, access: 6us, content: 100507us upstream: connect=29us, time-to-first-byte=99157us, read=103us ---- Connection timeout? ➥ Check TCP state for a socket fd. ---- sock = sockfd_lookup(fd) $*sock := @cast(sock, \"socket\", \"kernel\") printf(\", sock->state:%d\", $*sock->state) state = $*sock->sk->__sk_common->skc_state printf(\", sock->sk->sk_state:%d (%s)\n\", state, tcp_sockstate_str(state)) ---- Got core dumps? ➥ Analyze core dumps with GDB tools ---- {{#v|(gdb)}} {{#x|lvmst}} current VM state: C code from intperpreted Lua ---- {{#v|(gdb)}} {{#x|lbt full}} C:ngx_http_lua_socket_tcp_receive @/home/agentzh/git/lua-resty-mysql/lib/resty/mysql.lua:191 local \"self\": table (0x40f181a8) local \"sock\": table (0x40f181b0) @/home/agentzh/git/lua-resty-mysql/lib/resty/mysql.lua:530 ... ---- Mozilla rr: records and replays {{#ci|nondeterminism}} ---- Intel CPU bugs? ---- Behavioral bugs? ➥ Function call tracing ➥ Dynamic metrics ---- -> C:lj_cf_package_require(\"jit\") at test.lua:1 -> C:lj_cf_jit_off() at test.lua:1 -> test.lua:3(3, 5, 6, 7) at test.lua:11 <- test.lua:3 return 3, 5, 6, 7 at test.lua:4 -> test.lua:7(\"a\", 3, 5, 6, 7) at test.lua:11 -> C:lj_cf_select(\"#\", \"a\", 3, 5, 6, 7) at test.lua:8 ---- -> Foo::Bar::Baz::bar@a.pl:8 (32, 3.140000, \"world\", \\"hello\") Foo::Bar::Baz::bar@a.pl:17 -> Foo::Bar::Baz::foo@a.pl:4 (32, 3.140000, \"world\", \\"hello\") Foo::Bar::Baz::foo@a.pl:8 Foo::Bar::Baz::bar@a.pl:17 <- Foo::Bar::Baz::foo@a.pl:4 ret [], 33, undef, \"world!\" in void ctx <- Foo::Bar::Baz::bar@a.pl:8 in void ctx -> Foo::Bar::Baz::blah@a.pl:12 () Foo::Bar::Baz::blah@a.pl:19 <- Foo::Bar::Baz::blah@a.pl:12 ret 1012 in void ctx ---- -> bar@a.py:4 (5, 'hello', 3, 7.100000, dog=32, cat=True) File \"a.py\", line 4 in bar File \"a.py\", line 7 in <module> -> foo@a.py:1 (5, 'hello') File \"a.py\", line 1 in foo File \"a.py\", line 5 in bar File \"a.py\", line 7 in <module> <- foo@a.py:2 ret ([1, 'hi'], 6, None, 'hello!', False, None, 3.140000) <- bar@a.py:5 ret ([1, 'hi'], 6, None, 'hello!', False, None, 3.140000) ---- RDBMS query plan flame graphs ---- {{img src="images/pg-execnode.gif" width="1366" height="198"}} ---- {{img src="images/pg-execnode-vfs.gif" width="1366" height="172"}} ---- SystemTap ---- {{img src="images/stap-uprobes.png" width="972" height="524"}} ---- Pros: • Provide a powerful scripting language with loops, conditionals, user-defined functions, and etc. • Multiple backends: kernel, dyninst, ebpf • C-like scripting language is good for debugging C programs. • Powerful DWARF debuginfo integration (with some limitations though) • Very fast runtime performance once compiled and loaded. Cons: • Minimal type system with only string, long (64-bit signed integers), arrays, and aggregates. Very limited tracee C type support in certain expressions. • Slow startup (requiring building a kernel module with the gcc tool chain) • Cannot analyze core dumps. • Not portable. Linux only. • No floating-point number support (patches already tested) ---- DTrace ---- {{img src="images/dtrace.png" width="1062" height="784"}} ---- Pros: • Provide a C-like scripting language. • Provide a rich set of C-like data types. • Full debuginfo support in kernel (userland probing still requires manual type declarations, it seems). • Fast tracer program building and loading due to the in-kernel VM. • Supports multiple operating systems like Solaris, FreeBSD, and Mac OS X with varying levels of compatibility. Cons: • The scripting language lacks looping constructs. • Not good enough support for Linux yet. ---- BCC/ebpf ---- {{img src="images/bcc.png" width="992" height="764"}} ---- Pros: • Provide the C language syntax. But still requires the user to write Python or Lua wrappers to make the tools complete. • Almost no DWARF debuginfo support. You are on your own to play with binary addresses and offsets, or define all the C types used by the target programs yourself. Cons: • Requires the heavy-weight LLVM toolchain to compile. • No looping allowed unless the loops can be unrolled at compile time. • No user-defined function support unless always with inlinig. • The eBPF VM in the Linux kernel is very restrictive, preventing complex tracig tools from running. ---- GDB ---- {{img src="images/gdb-py.png" width="1012" height="644"}} ---- Pros: • Runs almost everywhere. • Provides full scripting support via Python, Guile, and other general purpose languages. • Full C type computation for the target programs. • Can analyze both live and dead processes (i.e., core dumps). Cons: • Using Python to write C level logic is very horrible. • No native function return probe support. Requires emulation. • Not following C implicit type conversion rules like integer promotion. • Runs very slowly (More than 10x slower than SystemTap for complex tools) • Based on the ptrace system call which is fragile and error prone. ---- Perf ---- Pros: • Light weight and fast. • Get simple things done quickly. • Good DWARF debuginfo support. Cons: • Lacks a scriptig language and often output a lot of data for later processing. Very limited in programmability. • Linux only. ---- A {{#ci|universal}} debugging language? ➥ The Y language (or ylang) ---- {{img src="images/ylang-backends.png" width="707" height="698"}} ---- {{#cm|✓}} Full C language syntax, with loops, macros, and etc. {{#cm|✓}} With scripting language elements like @array, %hash, aggregates, built-in strings, and etc. {{#cm|✓}} Full C type checking using DWARF info. {{#cm|✓}} Strictly follow the target program's C compiler behaviors (standard or nonstandard). {{#cm|✓}} Emits code for different tracing tool-chains (GDB Python and SystemTap are supported, more are coming). ---- {{#kw|#include}} \"lj21.y\" {{#kw|long}} MAX_SAMPLES = 3000; {{#kw|_agg}} %bts{_str}; {{#kw|long}} samples = 0; {{#kw|_probe}} _timer.profile { {{#kw|_str}} bt = lj_dump_bt2(); %bts{bt} <<< 1; {{#kw|if}} (++samples >= MAX_SAMPLES) {{#kw|_exit}}(); } {{#kw|_probe}} _end { {{#kw|_foreach}} %bts -> {{#kw|_str}} bt, {{#kw|_agg}} agg { {{#kw|printf}}(\"%s\t%d\n\", bt, {{#kw|_count}}(agg)); } } ---- WARNING: len > 512: comparison between signed and unsigned integer expressions (have 'size_t' and 'int') at lj21.y line 86 ---- {{#cm|// file lj21.y}} {{#kw|#define}} \"lj21_macros.y\" ... {{#kw|return}} tabref(L->env); ---- {{#cm|// file lj21_macros.y, generated by the dw2macros.pl tool}} {{#cm|/* lj_obj.h, line 498 */}} {{#kw|#ifndef}} tabref {{#kw|#define}} tabref(r) (&gcref((r))->tab) {{#kw|#endif}} {{#cm|/* lj_obj.h, line 75 */}} {{#kw|#ifndef}} gcref {{#kw|#define}} gcref(r) ((GCobj *)(uintptr_t)(r).gcptr32) {{#kw|#endif}} ---- {{#cm|# file lj21.py, generated by the ylang compiler}} {{#kw|return}} ((stash[\"var197_L\"]['env']['gcptr32']).cast( gdb.lookup_type('long unsigned int'))).cast( gdb.lookup_type('union GCobj').pointer())['tab'].address ---- {{#cm|// file lj21.stp, genreated by the ylang compiler}} ... {{#kw|return}} (&({{#kw|@cast}}({{#kw|@cast}}(var194_L, \"struct lua_State\", \"/opt/openresty/luajit/bin/luajit\")->env->gcptr32, \"union GCobj\", \"/opt/openresty/luajit/bin/luajit\")->tab)); ---- Virtualization & Tracing ---- • Docker • Qemu/KVM • User Mode Linux • VirtualBox • VMware ---- {{#x|☺}} OpenResty Trace ---- {{img src="images/or-trace-arch.png" width="972" height="630"}} ---- When to use what tools to get what info so as to come to what conclusions? ---- {{#cm|✓}} Pattern matching {{#cm|✓}} Machine reasoning {{#cm|✓}} Knowledge base {{#cm|✓}} Fact & rule discovery ---- {{https://github.com/openresty/openresty-gdb-utils}} {{https://github.com/openresty/stapxx}} {{https://github.com/openresty/openresty-systemtap-toolkit}} ---- ☺ {{#c|Thank you!}} ☻ yichun@openresty.com