How to Locate Exceptions inside Go Applications (using OpenResty XRay)
The OpenResty EC2 test cluster (we'll call it "test cluster" from now on) is used by OpenResty core developers to run most of its software components' test suites in various different test modes on Amazon EC2. The latest official test report can always be browsed on the qa.openresty.org site website.
Over the years, this tool chain successfully captured various obscure regressions in new NGINX cores, new LuaJIT cores, as well as our own components.
On the highest level, it works like this:
The user spawns the EC2 test cluster using her AWS API credentials locally in a terminal by invoking the command-line tool,
dispatcher, provided by the opsboy github repository. The user specifies the test running time constraint and test tasks.
dispatchercommand-line tool spawns one or more EC2 VM instances according to the test tasks and the time constraint specified by the user. It will try distributing the test jobs evenly across a minimum number of EC2 VM instances while meeting the total running time constraint at its best effort. The VM instances will keep running. Upon job completion or job failures, they will automatically upload the raw test run logs as tarballs to the server
qa-data.openresty.org. See the Spawn the test cluster section for more details.
The user waits for a period of time approximately equal to the time constraint she specifies, until all the EC2 VM instances are terminated (by checking the EC2 console, for example).
After all the VM instances are terminated, the user invokes the
make pullcommand under the
misc/logs/directory of her local clone of the
opsboyrepository. This command will automatically download the raw test run logs as tarballs from the server
qa-data.openresty.orginto the local sub-directory
The user can now run the command
make genunder the
misc/logs/directory to generate a new HTML report under the
html/sub-directory. This command will also print out newly unpacked
ortest/. Sometimes the user may want to check out the raw test run log data in those
The user can open the
html/index.htmlfile locally with her favorite web browser to view the HTML version of the test report. But for the best result, one should setup a local nginx server so that one can avoid browsers' security constraints on opening local web pages directly.
AWS EC2 is not a free service (though for new customers they offer free tier instances), it is important to use some tricks to minimize the monthly bill.
To prepare for a new OpenResty release, all projects in all test modes
using the (new) version of the nginx core the new OpenResty release is
going to use. But usually we first only run
./dispatcher with the
target, that is, testing all projects using the "normal test mode" (which
explains in details below in the Test job names section). If there
is some big problems with the opsboy script or some projects themselves,
we can fix it more quickly and without paying too much money for too many
EC2 instances. If
t is green, we then try
tv and etc, one by one, or
a few modes at a time. The easiest way to waste money is to run all test
modes for all projects and then only to find there is a silly mistake somewhere.
For security reasons, not everyone can directly perform the workflow explained
above. The user must have access to our EC2 AMI images with names like
ortest-x64-2019-02-06. Our AMI images always follow this naming convention.
The date suffix indicates on which date the image was created. And the
x64 infix indicates the CPU architecture (we also have
The user can request permission by sending emails to Yichun Zhang.
The user may also prepare her own AMI images if she wants to, but she must
run-node script in the
ops github repository, somewhere
Also, the user needs to setup the Python-based AWS CLI tool chain
using the user's own AWS access key and secret access key locally. Make
aws command-line tool is visible to your
PATH system environment
run-node tool can invoke it
opsboy github repository like this:
git clone git://github.com/openresty/opsboy.git
If you are a committer of that repository, change
Then always switch the current working directory to the
of this local clone:
Finally you need to prepare a shell script called
sync (the name is important!)
of the content like below in the current working directory:
rsync -av qa-data.openresty.org:~/ortest/ ortest/ || exit 1
mkdir -p ortest/
cd ortest/ || exit 1
for tar in $(ls *.tar.gz); do
if [ ! -f $log ]; then
tar -xzf $tar || exit 1
The user can change the server name
qa-data.openresty.org in this script
to her own server if her AMI images upload the test run log files to somewhere
Spawning a test cluster on EC2 is easy with the
dispatch tool provided
opsboy. Usually we run the command like this:
./dispatcher -r -f -t 230 -a 'linux x86_64' t-lua-resty-redis to-lua-resty-websocket
-f option passes the
force=1 variable to the opsboy script
for the EC2 test cluster, which effectively force a clean rebuild of all
the NGINX C modules (like lua-nginx-module). The
-t 230 option constraints
the total running time to 230 minutes. This is just a soft limit and the
actual running time may exceed this a bit if the job dispatcher cannot
find a better way. The
-a 'linux x86_64' option specifies that we use
the "Linux x86_64" architecture for this test run. Another supported architecture
linux i386. They are defined internally inside the
Each architecture name corresponds to an AMI image (by their IDs). The
-r option actually invokes the AWS CLI tool to spawn EC2 instances. Usually
we first omit this option to do a "dry run", to see how many EC2 instances
will be needed and how test jobs will be distributed among them. Below
is an example for a "dry run":
$ ./dispatcher -f -t 230 -a 'linux x86_64' t-lua-resty-redis ngx_lua
Requires at least 3 machines.
bucket 1: tl-ngx_lua thv-ngx_lua force=1 (228 min)
bucket 2: twv-ngx_lua tv-ngx_lua tr-ngx_lua to-ngx_lua force=1 (228 min)
bucket 3: trv-ngx_lua tw-ngx_lua t-ngx_lua th-ngx_lua t-lua-resty-redis
force=1 (183 min)
Here we will need 3 machines each will take 228 or 183 minutes to complete.
At his point you may wonder how does it know how long each job it will
run? Actually it uses the meta data produced by the previous job run (generated
by the command
make gen). So initially it will simply assumes each job
will take 5 minutes.
The metadata generated by each run is stored in the
ast.json file in
misc/logs/ directory. Usually it is the
make gen command which
automatically updates it by analyzing the raw test run log files. The HTML
report is also rendered from this data file.
The user can skip particular test jobs by passing the
-e PATTERN option.
This option takes a Perl regex so it can be quite powerful. The regex always
work on expanded test job names, like
By default, the
dispatcher tool uses the EC2 instance type
Linux x86_64 architecture jobs and
c3.large to run the i386
architecture. It is also possible to change the EC2 instance type used
by passing the
-i INSTANCE-TYPE option to the
-c option is also occasionally used to enforce the use of lua-resty-core
to run the Lua related projects' tests (including
ngx_lua). Also it injects
require('resty.core.base').set_string_buf_size(1) require('resty.core.regex' ).set_buf_grow_ratio(1) to the
init_by_lua* phase as well. It essentially
use_lua_resty_core=1 variable setting to the opsboy script.
By default, nginx 1.15.8 is used (which is changing by new OpenResty releases
though). One can choose a different nginx core version by specifying the
-v VERSION option of the
./dispatcher tool, like
-v 1.13.6. Not all
nginx version numbers will work. Make sure there is an OpenResty version
which uses that version of the nginx core.
You can see more options by typing
./dispatcher -h and/or by directly
looking at its Perl source code.
dispatcher command line, we use test job names like
ngx_lua. They deserve a separate section to explain. Basically the
test job names follow the following formats:
<test-mode> component can take the values
to. They corresponds to the test modes
shown on the HTML test report like the one on qa.openresty.org.
Still we explain them one by one below:
Normal test mode.
Mockeagain reading test mode.
Mockeagain writing test mode
Valgrind test mode
Combination of the
Combination of the
HUP reload test mode.
Combination of the
Leak check test mode.
Builds with aggressive gcc optimization levels (
-O3) and other options (like
to, all the test modes above are implemented by the
scaffold. You can read this free e-book section
to learn more details about these
Test::Nginx test modes.
<project-name> component is defined as the target open source project
names. NGINX modules follow the naming convention like
ngx_set_misc. Lua libraries follow the naming convention like
lua-resty-websocket, and etc. There are also special projects like
and etc. When in doubt, you can always copy the project names from a rendered
HTML page like on the qa.openresty.org site.
All the test modes and project names are internally defined in the
Perl script in the
opsboy repository. Use of unknown test modes or project
names will lead to errors in this script.
If the user specify the test mode part alone, then all the projects will
be tested under that test mode. For example, specifying
t alone will
result in all projects' test suites to be run using the normal test mode
to will lead to all projects to be tested under the aggressive
optimization build mode. Therefore, to run all the projects under all test
modes, one can simply run the command like below:
./dispatcher -r -f -t 230 -a 'linux x86_64' t tr tw tv trv twv th thv tl
Similarly, if the user specifies a project name alone, then all test modes will be run for that particular project. For example,
./dispatcher -r -f -t 230 -a 'linux x86_64' ngx_lua
is equivalent to
./dispatcher -r -f -t 230 -a 'linux x86_64' t-ngx_lua tr-ngx_lua tw-ngx_lua \
tv-ngx_lua trv-ngx_lua twv-ngx_lua th-ngx_lua thv-ngx_lua tl-ngx_lua to-ngx_lua
The former is just much more concise and much less error-prone.
We maintain a white-list for expected test failures for particular test
modes on particular architectures in the
parse-logs Perl script. We can
adjust existing white-list entries or add new entries by editing the Perl
%white_list in that script file somewhere near here.
Test result with expected failures only will be rendered by the orange
color in the HTML test report (test results without any failures will be
Special care must be taken when white-listing test failures. The patterns should be as specific as possible and they should only mark true false positives that are not easy to handle on the side of the test case itself. Too permissive white-lists would hide real problems and thus defeating the whole purpose of this test cluster. Don't do this unless there is no other way.
For valgrind false positives, we update the
valgrind.suppress file in
each project repository instead of white-listing them in
same caveat applies to valgrind suppression rules. One may hide true memory
issues in the future with too aggressive rules.
To add a new project to the test cluster, the user needs to patch several
files under the
misc/logs/ directory of the opsboy source tree. The easiest
way is to check how existing projects were specified:
$ grep websocket `ls -p |grep -v /|grep -v \~|grep -v json`
gen-report: 'lua-resty-websocket' => ['openresty', 'lua-resty-websocket'],
gen-user-data: 'resty-websocket', 'resty-lock',
index.tt: 'lua-resty-websocket' => 'github.com/openresty/lua-resty-websocket',
parse-logs: 'lua-resty-websocket' => [
We can see that lua-resty-websocket has information in the files
parse-logs. The last one in
is not necessary since it is for the white-list of expected failures (see
the The expected test failure white-list section for details). So we
can simply add similar entries in the same way as
to the same places (or similar places).
We will also need to update the samples/ortest-ec2.ob.tt2 file to setup the testing logic.
The VM environment initialization, project building, and test suite running
in each EC2 VM instance are all done by a Perl script called
which is automatically generated by the opsboy compiler tool chain from
the specification file named samples/ortest-ec2.ob.tt2.
The opsboy tool chain always initializes the system incrementally.
From time to time, we'll need to edit this file to accommodate new changes in either the operating system or the target projects to be tested.
To change this
oftest-ec2.ob.tt2 file, we need to
- switch the current working directory to the root directory of the
samples/ortest-ec2.ob.tt2file for our needs,
make testto update the
samples/ortest-ec2.obfile using the Perl TT2 tool chain, generate a Perl script using the opsboy compiler, and finally running the Perl script using the dry-run mode, and
- Commit the newly updated
ortest-ec2.obfiles to your own branch and create a GitHub pull request for review and merge.
Under the hood, the
dispatcher script invokes
to generate final test job specification strings from the user's command-line
arguments (see Test job names). This final test job specification is
passed to the new EC2 VM instances as plain-text "user data". The
run-node script which in turn runs the
aws tool provided
by the AWS CLI tool chain to spawn the EC2 instances with the specified
"user data" from our pre-defined EC2 AMI images. Our EC2 AMI image registers
script as a symlink under the VM's
/etc/init.d/ directory such that new
changes in this file in the GitHub repository does not require updates
in the AMI image. This
openresty-tester script in turn calls the openresty-tester-wrapper
openresty-tester-wrapper script does the heavy lifting of
openresty-tester.pl script which is automatically generated
from the ortest-ec2.ob.tt
script template file using the Perl TT2 template tool chain and the opsboy
compiler tool chain. Finally, upon completion, the
~/ortest-upload script bundled in the AMI image to upload
the test run log file tarball to the remote server (by default, it is
There are plans to migrate this tool chain over to the upcoming OpenResty Ops and OpenResty CI platforms developed by OpenResty Inc. in the future so that it would be much easier to use, to extend, and to maintain.
This document was written by Yichun Zhang, the creator of this OpenResty EC2 test cluster tool chain. Feel free to contact the author by sending emails to the email@example.com address.
The source of this document resides in the openresty/openresty.org GitHub repository. You are welcome to create issues and/or pull requests if you have questions or edits for this document.
Translations to other languages are also welcome.
General feedback is welcome to send to the author of this document.