SystemTap is a tracing and probing tool that allows users to study and
 monitor the activities of the operating system (particularly, the 
kernel) in fine detail. It provides information similar to the output of
 tools like netstat, ps, top, and iostat; however, SystemTap is designed to provide more filtering and analysis options for collected information.
	
		For system administrators, SystemTap can be used as a performance 
monitoring tool for Red Hat Enterprise Linux 5 or later. It is most 
useful when other similar tools cannot precisely pinpoint a bottleneck 
in the system, requiring a deep analysis of system activity. In the same
 manner, application developers can also use SystemTap to monitor, in 
finer detail, how their application behaves within the Linux system.
	
			SystemTap provides the infrastructure to monitor the running Linux 
system for detailed analysis. This can assist administrators and 
developers in identifying the underlying cause of a bug or performance 
problem.
		
			Without SystemTap, monitoring the activity of a running kernel would 
require a tedious instrument, recompile, install, and reboot sequence. 
SystemTap is designed to eliminate this, allowing users to gather the 
same information by simply running user-written SystemTap scripts.
		
			However, SystemTap was initially designed for users with intermediate
 to advanced knowledge of the kernel. This makes SystemTap less useful 
to administrators or developers with limited knowledge of and experience
 with the Linux kernel. Moreover, much of the existing SystemTap 
documentation is similarly aimed at knowledgeable and experienced users.
 This makes learning the tool similarly difficult.
		
			To lower these barriers the SystemTap Beginners Guide was written with the following goals:
		
					To introduce users to SystemTap, familiarize them with its 
architecture, and provide setup instructions for all kernel types.
				
					To provide pre-written SystemTap scripts for monitoring detailed 
activity in different components of the system, along with instructions 
on how to run them and analyze their output.
				
1.2. SystemTap Capabilities
			SystemTap was originally developed to provide functionality for Red 
Hat Enterprise Linux 6 similar to previous Linux probing tools such as dprobes
 and the Linux Trace Toolkit. SystemTap aims to supplement the existing 
suite of Linux monitoring tools by providing users with the 
infrastructure to track kernel activity. In addition, SystemTap combines
 this capability with two attributes:
		
					Flexibility: SystemTap's framework allows users to develop simple 
scripts for investigating and monitoring a wide variety of kernel 
functions, system calls, and other events that occur in kernel-space. 
With this, SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific forensic and monitoring tools.
				
					Ease-Of-Use: as mentioned earlier, SystemTap allows users to probe 
kernel-space events without having to resort to the lengthy instrument, 
recompile, install, and reboot the kernel process.
				
			Most of the SystemTap scripts enumerated in 
Chapter 4, Useful SystemTap Scripts demonstrate system forensics and monitoring capabilities not natively available with other similar tools (such as 
top, 
oprofile, or 
ps).
 These scripts are provided to give readers extensive examples of the 
application of SystemTap, which in turn will educate them further on the
 capabilities they can employ when writing their own SystemTap scripts.
		
Chapter 2. Using SystemTap
		This chapter instructs users how to install SystemTap, and provides an introduction on how to run SystemTap scripts.
	
2.1. Installation and Setup
		To deploy SystemTap, SystemTap packages along with the corresponding set of -devel, -debuginfo and -debuginfo-common-arch
 packages for the kernel need to be installed. To use SystemTap on more 
than one kernel where a system has multiple kernels installed, install 
the -devel and -debuginfo packages for each of those kernel versions.
	
		These procedures will be discussed in detail in the following sections.
	
			Many users confuse -debuginfo with -debug. Remember that the deployment of SystemTap requires the installation of the -debuginfo package of the kernel, not the -debug version of the kernel.
		
2.1.1. Installing SystemTap
			To deploy SystemTap, install the following RPMs:
		
					systemtap
				
					systemtap-runtime
				
			Assuming that yum is installed in the system, these two rpms can be installed with yum install systemtap systemtap-runtime. Install the required kernel information RPMs before using SystemTap.
		
2.1.2. Installing Required Kernel Information RPMs
			SystemTap needs information about the kernel in order to place 
instrumentation in it (i.e. probe it). This information, which allows 
SystemTap to generate the code for the instrumentation, is contained in 
the matching -devel, -debuginfo, and -debuginfo-common-arch packages for the kernel. The necessary -devel and -debuginfo packages for the ordinary "vanilla" kernel are as follows:
		
			Likewise, the necessary packages for the PAE kernel would be kernel-PAE-debuginfo, kernel-PAE-debuginfo-common-arch ,and kernel-PAE-devel.
		
			To determine what kernel your system is currently using, use:
		
uname -r
			For example, if you wish to use SystemTap on kernel version 2.6.32-53.el6 on an i686 machine, then you would need to download and install the following RPMs:
		
					kernel-debuginfo-2.6.32-53.el6.i686.rpm
				
					kernel-debuginfo-common-i686-2.6.32-53.el6.i686.rpm
				
					kernel-devel-2.6.32-53.el6.i686.rpm
				
				The version, variant, and architecture of the -devel, -debuginfo and -debuginfo-common-arch  packages must match the kernel to be probed with SystemTap exactly.
			
			The easiest way to install the required kernel information packages is through yum install and debuginfo-install. Included with later versions of the yum-utils package is the debuginfo-install (for example, version 1.1.10). Also, debuginfo-install requires an appropriate yum repository from which to download and install -debuginfo/-debuginfo-common-arch packages.
		
			Most required kernel packages can be found at 
ftp://ftp.redhat.com/pub/redhat/linux/enterprise/; navigate there until the the appropriate 
Debuginfo directory for the system is found.. Configure 
yum accordingly by adding a new "debug" 
yum repository file under 
/etc/yum.repos.d containing the following lines:
		
[rhel-debuginfo]
name=Red Hat Enterprise Linux $releasever - $basearch - Debug
baseurl=ftp://ftp.redhat.com/pub/redhat/linux/enterprise/$releasever/en/os/$basearch/Debuginfo/
enabled=1
			After configuring yum with the appropriate repository, install the required -devel, -debuginfo, and -debuginfo-common-arch packages for the kernel by running the following commands:
		
			Replace kernelname with the appropriate kernel variant name (for example, kernel-PAE), and version with the target kernel's version. For example, to install the required kernel information packages for the kernel-PAE-2.6.32-53.el6 kernel, run:
		
			If yum and yum-utils
 are not installed (and unable to be installed), manually download and 
install the required kernel information packages. To generate the URL 
from which to download the required packages, use the following script:
		
			Once the required packages to the machine have been manually downloaded, install the RPMs by running rpm --force -ivh package_names.
		
			If the kernel to be probed with SystemTap is currently being used, it
 is possible to immediately test whether the deployment was successful. 
If a different kernel is to be probed, reboot and load the appropriate 
kernel.
		
			To start the test, run the command stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'. This command simply instructs SystemTap to print read performed
 then exit properly once a virtual file system read is detected. If the 
SystemTap deployment was successful, you should get output similar to 
the following:
		
Pass 1: parsed user script and 45 library script(s) in 340usr/0sys/358real ms.
Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) in 290usr/260sys/568real ms.
Pass 3: translated to C into "/tmp/stapiArgLX/stap_e5886fa50499994e6a87aacdc43cd392_399.c" in 490usr/430sys/938real ms.
Pass 4: compiled C into "stap_e5886fa50499994e6a87aacdc43cd392_399.ko" in 3310usr/430sys/3714real ms.
Pass 5: starting run.
read performed
Pass 5: run completed in 10usr/40sys/73real ms.
			The last three lines of the output (i.e. beginning with Pass 5)
 indicate that SystemTap was able to successfully create the 
instrumentation to probe the kernel, run the instrumentation, detect the
 event being probed (in this case, a virtual file system read), and 
execute a valid handler (print text then close it with no errors).
		
2.2. Generating Instrumentation for Other Computers
		Normally, SystemTap scripts can only be run on systems where SystemTap is deployed (as in 
Section 2.1, “Installation and Setup”). This could mean that to run SystemTap on ten systems, SystemTap needs to be deployed on 
all
 those systems. In some cases, this may be neither feasible nor desired.
 For instance, corporate policy may prohibit an administrator from 
installing RPMs that provide compilers or debug information on specific 
machines, which will prevent the deployment of SystemTap.
	
		To work around this, use cross-instrumentation.
 Cross-instrumentation is the process of generating SystemTap 
instrumentation modules from a SystemTap script on one computer to be 
used on another computer. This process offers the following benefits:
	
				The kernel information packages for various machines can be installed on a single host machine.
			
				Each target machine only needs one RPM to be installed to use the generated SystemTap instrumentation module: systemtap-runtime.
			
			For the sake of simplicity, the following terms will be used throughout this section:
		
					
					 
					 instrumentation module — the kernel module built from a SystemTap script; i.e. the 
SystemTap module is built on the 
host system, and will be loaded on the 
target kernel of 
target system.
				
 
					
					 
					 host system — the system on which the instrumentation modules (from SystemTap scripts) are compiled, to be loaded on 
target systems.
				
 
					
					 
					 target system — the system in which the 
instrumentation module is being built (from SystemTap scripts).
				
 
					
					 
					 target kernel — the kernel of the 
target system. This is the kernel which loads/runs the 
instrumentation module.
				
 
Procedure 2.1. Configuring a Host System and Target Systems
				Install the systemtap-runtime RPM on each target system.
			
				Determine the kernel running on each target system by running uname -r on each target system.
			
				Install SystemTap on the 
host system. The 
instrumentation module will be built for the 
target systems on the 
host system. For instructions on how to install SystemTap, refer to 
Section 2.1.1, “Installing SystemTap”.
			
				Using the 
target kernel version determined earlier, install the 
target kernel and related RPMs on the 
host system by the method described in 
Section 2.1.2, “Installing Required Kernel Information RPMs”. If multiple 
target systems use different 
target kernels, repeat this step for each different kernel used on the 
target systems.
			
		To build the instrumentation module, run the following command on the host system (be sure to specify the appropriate values):
	
stap -r kernel_version script -m module_name -p4
		Here, kernel_version refers to the version of the target kernel (the output of uname -r on the target machine), script refers to the script to be converted into an instrumentation module, and module_name is the desired name of the instrumentation module.
	
			To determine the architecture notation of a running kernel, run uname -m.
		
		Once the instrumentation module is compiled, copy it to the target system and then load it using:
	
staprun module_name.ko
		For example, to create the instrumentation module simple.ko from a SystemTap script named simple.stp for the target kernel 2.6.32-53.el6, use the following command:
	
		stap -r 2.6.32-53.el6 -e 'probe vfs.read {exit()}' -m simple -p4
	
		This will create a module named simple.ko. To use the instrumentation module simple.ko, copy it to the target system and run the following command (on the target system):
	
		staprun simple.ko
	
			The host system must be the same architecture and running the same distribution of Linux as the target system in order for the built instrumentation module to work.
		
2.3. Running SystemTap Scripts
			SystemTap scripts are run through the command stap. stap can run SystemTap scripts from standard input or from file.
		
			Running stap and staprun
 requires elevated privileges to the system. However, not all users can 
be granted root access just to run SystemTap. In some cases, for 
instance, a non-privileged user may need to to run SystemTap 
instrumentation on their machine.
		
			To allow ordinary users to run SystemTap without root access, add them to one of these user groups:
		
- stapdev
 
						Members of this group can use stap to run SystemTap scripts, or staprun to run SystemTap instrumentation modules.
					
						Running stap involves compiling 
SystemTap scripts into kernel modules and loading them into the kernel. 
This requires elevated privileges to the system, which are granted to stapdev members. Unfortunately, such privileges also grant effective root access to stapdev members. As such, only grant stapdev group membership to users who can be trusted with root access.
					
- stapusr
 
						Members of this group can only use staprun to run SystemTap instrumentation modules. In addition, they can only run those modules from /lib/modules/kernel_version/systemtap/. Note that this directory must be owned only by the root user, and must only be writable by the root user.
					
			Below is a list of commonly used stap options:
		
- -v
 
						Makes the output of the SystemTap session more verbose. This option (for example, stap -vvv script.stp)
 can be repeated to provide more details on the script's execution. It 
is particularly useful if errors are encountered when running the 
script. This option is particularly useful if you encounter any errors 
in running the script.
					
- -o 
filename 
						Sends the standard output to file (filename).
					
- -S 
size,count 
						Limit files to size megabytes and limit the number of files kept around to count. The file names will have a sequence number suffix. This option implements logrotate operations for SystemTap.
					
						When used with -o, the -S will limit the size of log files.
					
- -x 
process ID 
						Sets the SystemTap handler function 
target() to the specified process ID. For more information about 
target(), refer to 
SystemTap Functions.
					
- -c 
command 
						Sets the SystemTap handler function 
target() to the specified command. The full path to the specified command must be used; for example, instead of specifying 
cp, use 
/bin/cp (as in 
stap script -c /bin/cp). For more information about 
target(), refer to 
SystemTap Functions.
					
- -e '
script' 
						Use script string rather than a file as input for systemtap translator.
					
- -F
 
			stap can also be instructed to run scripts from standard input using the switch -. To illustrate:
		
Example 2.1. Running Scripts From Standard Input
echo "probe timer.s(1) {exit()}" | stap -
			echo "probe timer.s(1) {exit()}" | stap -v -
		
			For more information about stap, refer to man stap.
		
				The stap options -v and -o also work for staprun. For more information about staprun, refer to man staprun.
			
2.3.1. SystemTap Flight Recorder Mode
				SystemTap's flight recorder mode allows a SystemTap script to be ran
 for long periods and just focus on recent output. The flight recorder 
mode (the -F option) limits the amount of 
output generated. There are two variations of the flight recorder mode: 
in-memory and file mode. In both cases the SystemTap script runs as a 
background process.
			
2.3.1.1. In-memory Flight Recorder
					When flight recorder mode (the -F 
option) is used without a file name, SystemTap uses a buffer in kernel 
memory to store the output of the script. Next, SystemTap 
instrumentation module loads and the probes start running, then 
instrumentation will detatch and be put in the background. When the 
interesting event occurs, the instrumentation can be reattached and the 
recent output in the memory buffer and any continuing output can be 
seen. The following command starts a script using the flight recorder 
in-memory mode:
				
stap -F /usr/share/doc/systemtap-version/examples/io/iotime.stp
					Once the script starts, a message that provides the command to reconnect to the running script will appear:
				
Disconnecting from systemtap module.
To reconnect, type "staprun -A stap_5dd0073edcb1f13f7565d8c343063e68_19556"
					When the interesting event occurs, reattach to the currently 
running script and output the recent data in the memory buffer, then get
 the continuing output with the following command:
				
staprun -A stap_5dd0073edcb1f13f7565d8c343063e68_19556
					By default, the kernel buffer is 1MB in size, but it can be increased with the -s option specifying the size in megabytes (rounded up to the next power over 2) for the buffer. For example -s2 on the SystemTap command line would specify 2MB for the buffer.
				
2.3.1.2. File Flight Recorder
					The flight recorder mode can also store data to files. The number and size of the files kept is controlled by the -S
 option followed by two numerical arguments separated by a comma. The 
first argument is the maximum size in megabytes for the each output 
file. The second argument is the number of recent files to keep. The 
file name is specified by the -o option 
followed by the name. SystemTap adds a number suffix to the file name to
 indicate the order of the files. The following will start SystemTap in 
file flight recorder mode with the output going to files named /tmp/pfaults.log.[0-9]+ with each file 1MB or smaller and keeping latest two files:
				
stap -F -o /tmp/pfaults.log -S 1,2  pfaults.stp
					The number printed by the command is the process ID. Sending a 
SIGTERM to the process will shutdown the SystemTap script and stop the 
data collection. For example if the previous command listed the 7590 as 
the process ID, the following command whould shutdown the systemtap 
script:
				
kill -s SIGTERM 7590
					Only the most recent two file generated by the script are kept and the older files are been removed. Thus, ls -sh /tmp/pfaults.log.* shows the only two files:
				
1020K /tmp/pfaults.log.5    44K /tmp/pfaults.log.6
					One can look at the highest number file for the latest data, in this case /tmp/pfaults.log.6.
				
Chapter 3. Understanding How SystemTap Works
		SystemTap allows users to write and reuse simple scripts to deeply 
examine the activities of a running Linux system. These scripts can be 
designed to extract data, filter it, and summarize it quickly (and 
safely), enabling the diagnosis of complex performance (or even 
functional) problems.
	
		The essential idea behind a SystemTap script is to name events, and to give them handlers.
 When SystemTap runs the script, SystemTap monitors for the event; once 
the event occurs, the Linux kernel then runs the handler as a quick 
sub-routine, then resumes.
	
		There are several kind of events; entering/exiting a function, timer 
expiration, session termination, etc. A handler is a series of script 
language statements that specify the work to be done whenever the event 
occurs. This work normally includes extracting data from the event 
context, storing them into internal variables, and printing results.
	
		For the most part, SystemTap scripts are the foundation of each 
SystemTap session. SystemTap scripts instruct SystemTap on what type of 
information to collect, and what to do once that information is 
collected.
	
		As stated in 
Chapter 3, Understanding How SystemTap Works, SystemTap scripts are made up of two components: 
events and 
handlers.
 Once a SystemTap session is underway, SystemTap monitors the operating 
system for the specified events and executes the handlers as they occur.
	
			An event and its corresponding handler is collectively called a probe. A SystemTap script can have multiple probes.
		
			A probe's handler is commonly referred to as a probe body.
		
		In terms of application development, using events and handlers is 
similar to instrumenting the code by inserting diagnostic print 
statements in a program's sequence of commands. These diagnostic print 
statements allow you to view a history of commands executed once the 
program is run.
	
		SystemTap scripts allow insertion of the instrumentation code without 
recompilation of the code and allows more flexibility with regard to 
handlers. Events serve as the triggers for handlers to run; handlers can
 be specified to record specified data and print it in a certain manner.
	
probe	event {statements}
		SystemTap supports multiple events per probe; multiple events are delimited by a comma (,).
 If multiple events are specified in a single probe, SystemTap will 
execute the handler when any of the specified events occur.
	
		Each probe has a corresponding statement block. This statement block is enclosed in braces ({ })
 and contains the statements to be executed per event. SystemTap 
executes these statements in sequence; special separators or terminators
 are generally not necessary between multiple statements.
	
			Statement blocks in SystemTap scripts follow the same syntax and 
semantics as the C programming language. A statement block can be nested
 within another statement block.
		
		Systemtap allows you to write functions to factor out code to be used 
by a number of probes. Thus, rather than repeatedly writing the same 
series of statements in multiple probes, you can just place the 
instructions in a function, as in:
	
function function_name(arguments) {statements}
probe event {function_name(arguments)}
		The statements in function_name are executed when the probe for event executes. The arguments are optional values passed into the function.
	
			SystemTap events can be broadly classified into two types: synchronous and asynchronous.
		
			Examples of synchronous events include:
		
- syscall.
system_call 
						The entry to the system call system_call. If the exit from a syscall is desired, appending a .return to the event monitor the exit of the system call instead. For example, to specify the entry and exit of the system call close, use syscall.close and syscall.close.return respectively.
					
- vfs.
file_operation 
						The entry to the file_operation event for Virtual File System (VFS). Similar to syscall event, appending a .return to the event monitors the exit of the file_operation operation.
					
- kernel.function("
function") 
						The entry to the kernel function function. For example, kernel.function("sys_open") refers to the "event" that occurs when the kernel function sys_open is called by any thread in the system. To specify the return of the kernel function sys_open, append the return string to the event statement; i.e. kernel.function("sys_open").return.
					
						When defining probe events, you can use asterisk (*) for wildcards. You can also trace the entry or exit of a function in a kernel source file. Consider the following example:
					
Example 3.1. wildcards.stp
probe kernel.function("*@net/socket.c") { }
probe kernel.function("*@net/socket.c").return { }
						In the previous example, the first probe's event specifies the entry of ALL functions in the kernel source file net/socket.c.
 The second probe specifies the exit of all those functions. Note that 
in this example, there are no statements in the handler; as such, no 
information will be collected or displayed.
					
- kernel.trace("
tracepoint") 
						The static probe for tracepoint.
 Recent kernels (2.6.30 and newer) include instrumentation for specific 
events in the kernel. These events are statically marked with 
tracepoints. One example of a tracepoint available in systemtap is kernel.trace("kfree_skb") which indicates each time a network buffer is freed in the kernel.
					
- module("
module").function("function") 
						Allows you to probe functions within modules. For example:
					
Example 3.2. moduleprobe.stp
probe module("ext3").function("*") { }
probe module("ext3").function("*").return { }
						A system's kernel modules are typically located in /lib/modules/kernel_version, where kernel_version refers to the currently loaded kernel version. Modules use the file name extension .ko.
					
			Examples of asynchronous events include:
		
- begin
 
						The startup of a SystemTap session; i.e. as soon as the SystemTap script is run.
					
- end
 
						The end of a SystemTap session.
					
- timer events
 
						An event that specifies a handler to be executed periodically. For example:
					
Example 3.3. timer-s.stp
probe timer.s(4)
{
  printf("hello world\n")
}
						Example 3.3, “timer-s.stp” is an example of a probe that prints 
hello world every 4 seconds. Note that you can also use the following timer events:
					
 
								timer.ms(milliseconds)
							
								timer.us(microseconds)
							
								timer.ns(nanoseconds)
							
								timer.hz(hertz)
							
								timer.jiffies(jiffies)
							
						When used in conjunction with other probes that collect 
information, timer events allows you to print out get periodic updates 
and see how that information changes over time.
					
				SystemTap supports the use of a large collection of probe events. For more information about supported events, refer to man stapprobes. The SEE ALSO section of man stapprobes also contains links to other man pages that discuss supported events for specific subsystems and components.
			
3.2.2. Systemtap Handler/Body
			Consider the following sample script:
		
Example 3.4. helloworld.stp
probe begin
{
  printf ("hello world\n")
  exit ()
}
			In 
Example 3.4, “helloworld.stp”, the event 
begin (i.e. the start of the session) triggers the handler enclosed in 
{ }, which simply prints 
hello world followed by a new-line, then exits.
		
				SystemTap scripts continue to run until the exit() function executes. If the users wants to stop the execution of the script, it can interrupted manually with Ctrl+C.
			
		printf ("format string\n", arguments)
			The 
format string specifies how 
arguments should be printed. The format string of 
Example 3.4, “helloworld.stp” simply instructs SystemTap to print 
hello world, and contains no format specifiers.
		
			You can use the format specifiers %s (for strings) and %d
 (for numbers) in format strings, depending on your list of arguments. 
Format strings can have multiple format specifiers, each matching a 
corresponding argument; multiple arguments are delimited by a comma (,).
		
				Semantically, the SystemTap printf function is very similar to its C language counterpart. The aforementioned syntax and format for SystemTap's printf function is identical to that of the C-style printf.
			
			To illustrate this, consider the following probe example:
		
Example 3.5. variables-in-printf-statements.stp
probe syscall.open
{
  printf ("%s(%d) open\n", execname(), pid())
}
			Example 3.5, “variables-in-printf-statements.stp” instructs SystemTap to probe all entries to the system call 
open; for each event, it prints the current 
execname() (a string with the executable name) and 
pid() (the current process ID number), followed by the word 
open. A snippet of this probe's output would look like:
		
 vmware-guestd(2206) open
hald(2360) open
hald(2360) open
hald(2360) open
df(3433) open
df(3433) open
df(3433) open
hald(2360) open
			The following is a list of commonly-used SystemTap functions:
		
- tid()
 
						The ID of the current thread.
					
- uid()
 
						The ID of the current user.
					
- cpu()
 
						The current CPU number.
					
- gettimeofday_s()
 
						The number of seconds since UNIX epoch (January 1, 1970).
					
- ctime()
 
						Convert number of seconds since UNIX epoch to date.
					
- pp()
 
						A string describing the probe point currently being handled.
					
- thread_indent()
 
						This particular function is quite useful, providing you with a way
 to better organize your print results. The function takes one argument,
 an indentation delta, which indicates how many spaces to add or remove 
from a thread's "indentation counter". It then returns a string with 
some generic trace data along with an appropriate number of indentation 
spaces.
					
						The generic data included in the returned string includes a timestamp (number of microseconds since the first call to thread_indent()
 by the thread), a process name, and the thread ID. This allows you to 
identify what functions were called, who called them, and the duration 
of each function call.
					
						If call entries and exits immediately precede each other, it is 
easy to match them. However, in most cases, after a first function call 
entry is made several other call entries and exits may be made before 
the first call exits. The indentation counter helps you match an entry 
with its corresponding exit by indenting the next function call if it is
 not the exit of the previous one.
					
						Consider the following example on the use of thread_indent():
					
Example 3.6. thread_indent.stp
probe kernel.function("*@net/socket.c") 
{
  printf ("%s -> %s\n", thread_indent(1), probefunc())
}
probe kernel.function("*@net/socket.c").return 
{
  printf ("%s <- %s\n", thread_indent(-1), probefunc())
}
0 ftp(7223): -> sys_socketcall
1159 ftp(7223):  -> sys_socket
2173 ftp(7223):   -> __sock_create
2286 ftp(7223):    -> sock_alloc_inode
2737 ftp(7223):    <- sock_alloc_inode
3349 ftp(7223):    -> sock_alloc
3389 ftp(7223):    <- sock_alloc
3417 ftp(7223):   <- __sock_create
4117 ftp(7223):   -> sock_create
4160 ftp(7223):   <- sock_create
4301 ftp(7223):   -> sock_map_fd
4644 ftp(7223):    -> sock_map_file
4699 ftp(7223):    <- sock_map_file
4715 ftp(7223):   <- sock_map_fd
4732 ftp(7223):  <- sys_socket
4775 ftp(7223): <- sys_socketcall
						This sample output contains the following information:
					
								The time (in microseconds) since the initial thread_indent() call for the thread (included in the string from thread_indent()).
							
								The process name (and its corresponding ID) that made the function call (included in the string from thread_indent()).
							
								An arrow signifying whether the call was an entry (<-) or an exit (->); the indentations help you match specific function call entries with their corresponding exits.
							
								The name of the function called by the process.
							
- name
 
						Identifies the name of a specific system call. This variable can only be used in probes that use the event syscall.system_call.
					
- target()
 
						Used in conjunction with stap script -x process ID or stap script -c command. If you want to specify a script to take an argument of a process ID or command, use target() as the variable in the script to refer to it. For example:
					
Example 3.7. targetexample.stp
probe syscall.* {
  if (pid() == target())
    printf("%s/n", name)
}
						When 
Example 3.7, “targetexample.stp” is run with the argument 
-x process ID, it watches all system calls (as specified by the event 
syscall.*) and prints out the name of all system calls made by the specified process.
					
						This has the same effect as specifying if (pid() == process ID) each time you wish to target a specific process. However, using target()
 makes it easier for you to re-use the script, giving you the ability to
 simply pass a process ID as an argument each time you wish to run the 
script (e.g. stap targetexample.stp -x process ID).
					
			For more information about supported SystemTap functions, refer to man stapfuncs.
		
3.3. Basic SystemTap Handler Constructs
		SystemTap supports the use of several basic constructs in handlers. 
The syntax for most of these handler constructs are mostly based on C 
and awk syntax. This section describes 
several of the most useful SystemTap handler constructs, which should 
provide you with enough information to write simple yet useful SystemTap
 scripts.
	
			Variables can be used freely throughout a handler; simply choose a 
name, assign a value from a function or expression to it, and use it in 
an expression. SystemTap automatically identifies whether a variable 
should be typed as a string or integer, based on the type of the values 
assigned to it. For instance, if you use set the variable foo to gettimeofday_s() (as in foo = gettimeofday_s()), then foo is typed as a number and can be printed in a printf() with the integer format specifier (%d).
		
			Note, however, that by default variables are only local to the probe 
they are used in. This means that variables are initialized, used and 
disposed at each probe handler invocation. To share a variable between 
probes, declare the variable name using global outside of the probes. Consider the following example:
		
Example 3.8. timer-jiffies.stp
global count_jiffies, count_ms
probe timer.jiffies(100) { count_jiffies ++ }
probe timer.ms(100) { count_ms ++ }
probe timer.ms(12345)
{
  hz=(1000*count_jiffies) / count_ms
  printf ("jiffies:ms ratio %d:%d => CONFIG_HZ=%d\n",
    count_jiffies, count_ms, hz)
  exit ()
}
			Example 3.8, “timer-jiffies.stp” computes the 
CONFIG_HZ setting of the kernel using timers that count jiffies and milliseconds, then computing accordingly. The 
global statement allows the script to use the variables 
count_jiffies and 
count_ms (set in their own respective probes) to be shared with 
probe timer.ms(12345).
		
 
				The 
++ notation in 
Example 3.8, “timer-jiffies.stp” (i.e. 
count_jiffies ++ and 
count_ms ++) is used to increment the value of a variable by 1. In the following probe, 
count_jiffies is incremented by 1 every 100 jiffies:
			
probe timer.jiffies(100) { count_jiffies ++ }
				In this instance, SystemTap understands that count_jiffies is an integer. Because no initial value was assigned to count_jiffies, its initial value is zero by default.
			
3.3.2. Conditional Statements
			In some cases, the output of a SystemTap script may be too big. To 
address this, you need to further refine the script's logic in order to 
delimit the output into something more relevant or useful to your probe.
		
			You can do this by using conditionals in handlers. SystemTap accepts the following types of conditional statements:
		
- If/Else Statements
 
						Format:
					
if (condition)
  statement1
else
  statement2
						The statement1 is executed if the condition expression is non-zero. The statement2 is executed if the condition expression is zero. The else clause (else statement2) is optional. Both statement1 and statement2 can be statement blocks.
					
Example 3.9. ifelse.stp
global countread, countnonread
probe kernel.function("vfs_read"),kernel.function("vfs_write")
{
  if (probefunc()=="vfs_read") 
    countread ++ 
  else 
    countnonread ++
}
probe timer.s(5) { exit() }
probe end 
{
  printf("VFS reads total %d\n VFS writes total %d\n", countread, countnonread)
}
						Example 3.9, “ifelse.stp” is a script that counts how many virtual file system reads (
vfs_read) and writes (
vfs_write) the system performs within a 5-second span. When run, the script increments the value of the variable 
countread by 1 if the name of the function it probed matches 
vfs_read (as noted by the condition 
if (probefunc()=="vfs_read")); otherwise, it increments 
countnonread (
else {countnonread ++}).
					
 - While Loops
 
						Format:
					
while (condition)
  statement
						So long as condition is non-zero the block of statements in statement are executed. The statement is often a statement block and it must change a value so condition will eventually be zero.
					
- For Loops
 
						Format:
					
for (initialization; conditional; increment) statement
						The for loop is simply shorthand for a while loop. The following is the equivalent while loop:
					
initialization
while (conditional) {
   statement
   increment
}
- >=
 
						Greater than or equal to
					
- <=
 
						Less than or equal to
					
- !=
 
						Is not equal to
					
3.3.3. Command-Line Arguments
			You can also allow a SystemTap script to accept simple command-line arguments using a $ or @ immediately followed by the number of the argument on the command line. Use $ if you are expecting the user to enter an integer as a command-line argument, and @ if you are expecting a string.
		
Example 3.10. commandlineargs.stp
probe kernel.function(@1) { }
probe kernel.function(@1).return { }
			Example 3.10, “commandlineargs.stp” is similar to 
Example 3.1, “wildcards.stp”, except that it allows you to pass the kernel function to be probed as a command-line argument (as in 
stap commandlineargs.stp kernel function). You can also specify the script to accept multiple command-line arguments, noting them as 
@1, 
@2, and so on, in the order they are entered by the user.
		
 
		SystemTap also supports the use of associative arrays. While an 
ordinary variable represents a single value, associative arrays can 
represent a collection of values. Simply put, an associative array is a 
collection of unique keys; each key in the array has a value associated 
with it.
	
		Since associative arrays are normally processed in multiple probes (as we will demonstrate later), they should be declared as global variables in the SystemTap script. The syntax for accessing an element in an associative array is similar to that of awk, and is as follows:
	
array_name[index_expression]
		Here, the array_name is any arbitrary name the array uses. The index_expression is used to refer to a specific unique key in the array. To illustrate, let us try to build an array named foo that specifies the ages of three people (i.e. the unique keys): tom, dick, and harry. To assign them the ages (i.e. associated values) of 23, 24, and 25 respectively, we'd use the following array statements:
	
Example 3.11. Basic Array Statements
foo["tom"] = 23
foo["dick"] = 24
foo["harry"] = 25
		You can specify up to nine index expressons in an array statement, each one delimited by a comma (
,). This is useful if you wish to have a key that contains multiple pieces of information. The following line from 
disktop.stp
 uses 5 elements for the key: process ID, executable name, user ID, 
parent process ID, and string "W". It associates the value of 
devname with that key.
	
device[pid(),execname(),uid(),ppid(),"W"] = devname
			All associate arrays must be declared as global, regardless of whether the associate array is used in one or multiple probes.
		
3.5. Array Operations in SystemTap
		This section enumerates some of the most commonly used array operations in SystemTap.
	
3.5.1. Assigning an Associated Value
			Use = to set an associated value to indexed unique pairs, as in:
		
array_name[index_expression] = value
			Example 3.11, “Basic Array Statements”
 shows a very basic example of how to set an explicit associated value 
to a unique key. You can also use a handler function as both your 
index_expression and 
value.
 For example, you can use arrays to set a timestamp as the associated 
value to a process name (which you wish to use as your unique key), as 
in:
		
 Example 3.12. Associating Timestamps to Process Names
foo[tid()] = gettimeofday_s()
			Whenever an event invokes the statement in 
Example 3.12, “Associating Timestamps to Process Names”, SystemTap returns the appropriate 
tid() value (i.e. the ID of a thread, which is then used as the unique key). At the same time, SystemTap also uses the function 
gettimeofday_s() to set the corresponding timestamp as the associated value to the unique key defined by the function 
tid(). This creates an array composed of key pairs containing thread IDs and timestamps.
		
			In this same example, if tid() returns a value that is already defined in the array foo, the operator will discard the original associated value to it, and replace it with the current timestamp from gettimeofday_s().
		
3.5.2. Reading Values From Arrays
			You can also read values from an array the same way you would read the value of a variable. To do so, include the array_name[index_expression] statement as an element in a mathematical expression. For example:
		
Example 3.13. Using Array Values in Simple Computations
delta = gettimeofday_s() - foo[tid()]
			The construct in 
Example 3.13, “Using Array Values in Simple Computations” computes a value for the variable 
delta by subtracting the associated value of the key 
tid() from the current 
gettimeofday_s(). The construct does this by 
reading the value of 
tid()
 from the array. This particular construct is useful for determining the
 time between two events, such as the start and completion of a read 
operation.
		
3.5.3. Incrementing Associated Values
			Use ++ to increment the associated value of a unique key in an array, as in:
		
array_name[index_expression] ++
			Again, you can also use a handler function for your index_expression.
 For example, if you wanted to tally how many times a specific process 
performed a read to the virtual file system (using the event vfs.read), you can use the following probe:
		
Example 3.14. vfsreads.stp
probe vfs.read
{
  reads[execname()] ++
}
			In 
Example 3.14, “vfsreads.stp”, the first time that the probe returns the process name 
gnome-terminal (i.e. the first time 
gnome-terminal performs a VFS read), that process name is set as the unique key 
gnome-terminal with an associated value of 1. The next time that the probe returns the process name 
gnome-terminal, SystemTap increments the associated value of 
gnome-terminal by 1. SystemTap performs this operation for 
all process names as the probe returns them.
		
3.5.4. Processing Multiple Elements in an Array
			Once you've collected enough information in an array, you will need 
to retrieve and process all elements in that array to make it useful. 
Consider 
Example 3.14, “vfsreads.stp”:
 the script collects information about how many VFS reads each process 
performs, but does not specify what to do with it. The obvious means for
 making 
Example 3.14, “vfsreads.stp” useful is to print the key pairs in the array 
reads, but how?
		
			The best way to process all key pairs in an array (as an iteration) is to use the foreach statement. Consider the following example:
		
Example 3.15. cumulative-vfsreads.stp
global reads
probe vfs.read
{ 
  reads[execname()] ++
}
probe timer.s(3)
{
  foreach (count in reads)
    printf("%s : %d \n", count, reads[count])
}
			In the second probe of 
Example 3.15, “cumulative-vfsreads.stp”, the 
foreach statement uses the variable 
count to reference each iteration of a unique key in the array 
reads. The 
reads[count] array statement in the same probe retrieves the associated value of each unique key.
		
			Given what we know about the first probe in 
Example 3.15, “cumulative-vfsreads.stp”,
 the script prints VFS-read statistics every 3 seconds, displaying names
 of processes that performed a VFS-read along with a corresponding 
VFS-read count.
		
			Now, remember that the 
foreach statement in 
Example 3.15, “cumulative-vfsreads.stp” prints 
all
 iterations of process names in the array, and in no particular order. 
You can instruct the script to process the iterations in a particular 
order by using 
+ (ascending) or 
- (descending). In addition, you can also limit the number of iterations the script needs to process with the 
limit value option.
		
			For example, consider the following replacement probe:
		
probe timer.s(3)
{
  foreach (count in reads- limit 10)
    printf("%s : %d \n", count, reads[count])
}
			This foreach statement instructs the script to process the elements in the array reads in descending order (of associated value). The limit 10 option instructs the foreach to only process the first ten iterations (i.e. print the first 10, starting with the highest value).
		
3.5.5. Clearing/Deleting Arrays and Array Elements
			To do that, you will need to clear the values accumulated by the array. You can accomplish this using the delete operator to delete elements in an array, or an entire array. Consider the following example:
		
Example 3.16. noncumulative-vfsreads.stp
global reads
probe vfs.read
{ 
  reads[execname()] ++
}
probe timer.s(3)
{
  foreach (count in reads)
    printf("%s : %d \n", count, reads[count])
  delete reads	
}
			In 
Example 3.16, “noncumulative-vfsreads.stp”, the second probe prints the number of VFS reads each process made 
within the probed 3-second period only. The 
delete reads statement clears the 
reads array within the probe.
		
global reads, totalreads
probe vfs.read
{
  reads[execname()] ++
  totalreads[execname()] ++
}
probe timer.s(3)
{
  printf("=======\n")
  foreach (count in reads-) 
    printf("%s : %d \n", count, reads[count])
  delete reads
}
probe end
{
  printf("TOTALS\n")
  foreach (total in totalreads-)
    printf("%s : %d \n", total, totalreads[total])
}
				In this example, the arrays reads and totalreads track the same information, and are printed out in a similar fashion. The only difference here is that reads is cleared every 3-second period, whereas totalreads keeps growing.
			
3.5.6. Using Arrays in Conditional Statements
			You can also use associative arrays in if
 statements. This is useful if you want to execute a subroutine once a 
value in the array matches a certain condition. Consider the following 
example:
		
Example 3.17. vfsreads-print-if-1kb.stp
global reads
probe vfs.read
{
  reads[execname()] ++
}
probe timer.s(3)
{
  printf("=======\n")
  foreach (count in reads-)
    if (reads[count] >= 1024)
      printf("%s : %dkB \n", count, reads[count]/1024)
    else
      printf("%s : %dB \n", count, reads[count])
}
			Every three seconds, 
Example 3.17, “vfsreads-print-if-1kb.stp”
 prints out a list of all processes, along with how many times each 
process performed a VFS read. If the associated value of a process name 
is equal or greater than 1024, the 
if statement in the script converts and prints it out in 
kB.
		
if([index_expression] in array_name) statement
			To illustrate this, consider the following example:
		
Example 3.18. vfsreads-stop-on-stapio2.stp
global reads
probe vfs.read
{
  reads[execname()] ++
}
probe timer.s(3)
{
  printf("=======\n")
  foreach (count in reads+) 
    printf("%s : %d \n", count, reads[count])
  if(["stapio"] in reads) {
    printf("stapio read detected, exiting\n")
    exit()
  }
}
			The if(["stapio"] in reads) statement instructs the script to print stapio read detected, exiting once the unique key stapio is added to the array reads.
		
3.5.7. Computing for Statistical Aggregates
			Statistical aggregates are used to collect statistics on numerical 
values where it is important to accumulate new data quickly and in large
 volume (i.e. storing only aggregated stream statistics). Statistical 
aggregates can be used in global variables or as elements in an array.
		
			To add value to a statistical aggregate, use the operator <<< value.
		
Example 3.19. stat-aggregates.stp
global reads	
probe vfs.read
{
  reads[execname()] <<< count
}
			In 
Example 3.19, “stat-aggregates.stp”, the operator 
<<< count stores the amount returned by 
count to the associated value of the corresponding 
execname() in the 
reads array. Remember, these values are 
stored;
 they are not added to the associated values of each unique key, nor are
 they used to replace the current associated values. In a manner of 
speaking, think of it as having each unique key (
execname()) having multiple associated values, accumulating with each probe handler run.
		
			To extract data collected by statistical aggregates, use the syntax format @extractor(variable/array index expression). extractor can be any of the following integer extractors:
		
- count
 
						Returns the number of all values stored into the variable/array index expression. Given the sample probe in 
Example 3.19, “stat-aggregates.stp”, the expression 
@count(writes[execname()]) will return 
how many values are stored in each unique key in array 
writes.
					
- sum
 
						Returns the sum of all values stored into the variable/array index expression. Again, given sample probe in 
Example 3.19, “stat-aggregates.stp”, the expression 
@sum(writes[execname()]) will return 
the total of all values stored in each unique key in array 
writes.
					
- min
 
						Returns the smallest among all the values stored in the variable/array index expression.
					
- max
 
						Returns the largest among all the values stored in the variable/array index expression.
					
- avg
 
						Returns the average of all values stored in the variable/array index expression.
					
			When using statistical aggregates, you can also build array 
constructs that use multiple index expressions (to a maximum of 5). This
 is helpful in capturing additional contextual information during a 
probe. For example:
		
Example 3.20. Multiple Array Indexes
global reads
probe vfs.read
{
  reads[execname(),pid()] <<< 1
}
probe timer.s(3)
{
  foreach([var1,var2] in reads)
    printf("%s (%d) : %d \n", var1, var2, @count(reads[var1,var2]))
}
			In 
Example 3.20, “Multiple Array Indexes”,
 the first probe tracks how many times each process performs a VFS read.
 What makes this different from earlier examples is that this array 
associates a performed read to both a process name 
and its corresponding process ID.
		
			The second probe in 
Example 3.20, “Multiple Array Indexes” demonstrates how to process and print the information collected by the array 
reads. Note how the 
foreach statement uses the same number of variables (i.e. 
var1 and 
var2) contained in the first instance of the array 
reads from the first probe.
		
			Tapsets are scripts that form a library of
 pre-written probes and functions to be used in SystemTap scripts. When a
 user runs a SystemTap script, SystemTap checks the script's probe 
events and handlers against the tapset library; SystemTap then loads the
 corresponding probes and functions before translating the script to C 
(refer to 
Section 3.1, “Architecture” for information on what transpires in a SystemTap session).
		
 
			Like SystemTap scripts, tapsets use the file name extension .stp. The standard library of tapsets is located in /usr/share/systemtap/tapset/
 by default. However, unlike SystemTap scripts, tapsets are not meant 
for direct execution; rather, they constitute the library from which 
other scripts can pull definitions.
		
			Simply put, the tapset library is an abstraction layer designed to 
make it easier for users to define events and functions. In a manner of 
speaking, tapsets provide useful aliases for functions that users may 
want to specify as an event; knowing the proper alias to use is, for the
 most part, easier than remembering specific kernel functions that might
 vary between kernel versions.
		
Chapter 4. Useful SystemTap Scripts
		This chapter enumerates several SystemTap scripts you can use to 
monitor and investigate different subsystems. All of these scripts are 
available at /usr/share/systemtap/testsuite/systemtap.examples/ once you install the systemtap-testsuite RPM.
	
			The following sections showcase scripts that trace network-related functions and build a profile of network activity.
		
		This section describes how to profile network activity. 
nettop.stp provides a glimpse into how much network traffic each process is generating on a machine.
	
			
#! /usr/bin/env stap
global ifxmit, ifrecv
global ifmerged
probe netdev.transmit
{
  ifxmit[pid(), dev_name, execname(), uid()] <<< length
}
probe netdev.receive
{
  ifrecv[pid(), dev_name, execname(), uid()] <<< length
}
function print_activity()
{
  printf("%5s %5s %-7s %7s %7s %7s %7s %-15s\n",
         "PID", "UID", "DEV", "XMIT_PK", "RECV_PK",
         "XMIT_KB", "RECV_KB", "COMMAND")
  foreach ([pid, dev, exec, uid] in ifrecv) {
	  ifmerged[pid, dev, exec, uid] += @count(ifrecv[pid,dev,exec,uid]);
  }
  foreach ([pid, dev, exec, uid] in ifxmit) {
	  ifmerged[pid, dev, exec, uid] += @count(ifxmit[pid,dev,exec,uid]);
  }
  foreach ([pid, dev, exec, uid] in ifmerged-) {
    n_xmit = @count(ifxmit[pid, dev, exec, uid])
    n_recv = @count(ifrecv[pid, dev, exec, uid])
    printf("%5d %5d %-7s %7d %7d %7d %7d %-15s\n",
           pid, uid, dev, n_xmit, n_recv,
           n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0,
           n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0,
           exec)
  }
  print("\n")
  delete ifxmit
  delete ifrecv
  delete ifmerged
}
probe timer.ms(5000), end, error
{
  print_activity()
}
		 
		Note that function print_activity() uses the following expressions:
	
n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0
n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0
		These expressions are if/else conditionals. The first statement is 
simply a more concise way of writing the following psuedo code:
	
if n_recv != 0 then
  @sum(ifrecv[pid, dev, exec, uid])/1024
else
  0
		nettop.stp tracks which processes are generating network traffic on the system, and provides the following information about each process:
	
 
				PID — the ID of the listed process.
			
				UID — user ID. A user ID of 0 refers to the root user.
			
				DEV — which ethernet device the process used to send / receive data (e.g. eth0, eth1)
			
				XMIT_PK — number of packets transmitted by the process
			
				RECV_PK — number of packets received by the process
			
				XMIT_KB — amount of data sent by the process, in kilobytes
			
				RECV_KB — amount of data received by the service, in kilobytes
			
Example 4.1. nettop.stp Sample Output
[...]
  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND        
    0     0 eth0          0       5       0       0 swapper        
11178     0 eth0          2       0       0       0 synergyc       
  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND        
 2886     4 eth0         79       0       5       0 cups-polld     
11362     0 eth0          0      61       0       5 firefox        
    0     0 eth0          3      32       0       3 swapper        
 2886     4 lo            4       4       0       0 cups-polld     
11178     0 eth0          3       0       0       0 synergyc       
  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND        
    0     0 eth0          0       6       0       0 swapper        
 2886     4 lo            2       2       0       0 cups-polld     
11178     0 eth0          3       0       0       0 synergyc       
 3611     0 eth0          0       1       0       0 Xorg           
  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND        
    0     0 eth0          3      42       0       2 swapper        
11178     0 eth0         43       1       3       0 synergyc       
11362     0 eth0          0       7       0       0 firefox        
 3897     0 eth0          0       1       0       0 multiload-apple
[...]
4.1.2. Tracing Functions Called in Network Socket Code
		This section describes how to trace functions called from the kernel's net/socket.c file. This task helps you identify, in finer detail, how each process interacts with the network at the kernel level.
	
[...]
0 Xorg(3611): -> sock_poll
3 Xorg(3611): <- sock_poll
0 Xorg(3611): -> sock_poll
3 Xorg(3611): <- sock_poll
0 gnome-terminal(11106): -> sock_poll
5 gnome-terminal(11106): <- sock_poll
0 scim-bridge(3883): -> sock_poll
3 scim-bridge(3883): <- sock_poll
0 scim-bridge(3883): -> sys_socketcall
4 scim-bridge(3883):  -> sys_recv
8 scim-bridge(3883):   -> sys_recvfrom
12 scim-bridge(3883):-> sock_from_file
16 scim-bridge(3883):<- sock_from_file
20 scim-bridge(3883):-> sock_recvmsg
24 scim-bridge(3883):<- sock_recvmsg
28 scim-bridge(3883):   <- sys_recvfrom
31 scim-bridge(3883):  <- sys_recv
35 scim-bridge(3883): <- sys_socketcall
[...]
4.1.3. Monitoring Incoming TCP Connections
		This section illustrates how to monitor incoming TCP connections. This
 task is useful in identifying any unauthorized, suspicious, or 
otherwise unwanted network access requests in real time.
	
		While 
tcp_connections.stp is running, it will print out the following information about any incoming TCP connections accepted by the system in real time:
	
				Current UID
			
				CMD - the command accepting the connection
			
				PID of the command
			
				Port used by the connection
			
				IP address from which the TCP connection originated
			
UID            CMD    PID   PORT        IP_SOURCE
0             sshd   3165     22      10.64.0.227
0             sshd   3165     22      10.64.0.227
4.1.4. Monitoring Network Packets Drops in Kernel
		
		 The network stack in Linux can discard packets for various reasons. Some Linux kernels include a tracepoint, 
kernel.trace("kfree_skb"), which easily tracks where packets are discarded. 
dropwatch.stp uses 
kernel.trace("kfree_skb") to trace packet discards; the script summarizes which locations discard packets every five-second interval.
	
 
		The kernel.trace("kfree_skb") traces which places in the kernel drop network packets. The kernel.trace("kfree_skb") has two arguments: a pointer to the buffer being freed ($skb) and the location in kernel code the buffer is being freed ($location).
	
Example 4.4. dropwatch.stp Sample Output
Monitoring for dropped packets
51 packets dropped at location 0xffffffff8024cd0f
2 packets dropped at location 0xffffffff8044b472
51 packets dropped at location 0xffffffff8024cd0f
1 packets dropped at location 0xffffffff8044b472
97 packets dropped at location 0xffffffff8024cd0f
1 packets dropped at location 0xffffffff8044b472
Stopping dropped packet monitor
		To make the location of packet drops more meaningful, refer to the 
/boot/System.map-`uname -r` file. This file lists the starting addresses for each function, allowing you to map the addresses in the output of 
Example 4.4, “dropwatch.stp Sample Output” to a specific function name. Given the following snippet of the 
/boot/System.map-`uname -r` file, the address 0xffffffff8024cd0f maps to the function 
unix_stream_recvmsg and the address 0xffffffff8044b472 maps to the function 
arp_rcv:
	
[...]
ffffffff8024c5cd T unlock_new_inode
ffffffff8024c5da t unix_stream_sendmsg
ffffffff8024c920 t unix_stream_recvmsg
ffffffff8024cea1 t udp_v4_lookup_longway
[...]
ffffffff8044addc t arp_process
ffffffff8044b360 t arp_rcv
ffffffff8044b487 t parp_redo
ffffffff8044b48c t arp_solicit
[...]
			The following sections showcase scripts that monitor disk and I/O activity.
		
4.2.1. Summarizing Disk Read/Write Traffic
		This section describes how to identify which processes are performing the heaviest disk reads/writes to the system.
	
			
#!/usr/bin/stap
#
# Copyright (C) 2007 Oracle Corp.
#
# Get the status of reading/writing disk every 5 seconds,
# output top ten entries 
#
# This is free software,GNU General Public License (GPL);
# either version 2, or (at your option) any later version.
#
# Usage:
#  ./disktop.stp
#
global io_stat,device
global read_bytes,write_bytes
probe vfs.read.return {
  if ($return>0) {
    if (devname!="N/A") {/*skip read from cache*/
      io_stat[pid(),execname(),uid(),ppid(),"R"] += $return
      device[pid(),execname(),uid(),ppid(),"R"] = devname
      read_bytes += $return
    }
  }
}
probe vfs.write.return {
  if ($return>0) {
    if (devname!="N/A") { /*skip update cache*/
      io_stat[pid(),execname(),uid(),ppid(),"W"] += $return
      device[pid(),execname(),uid(),ppid(),"W"] = devname
      write_bytes += $return
    }
  }
}
probe timer.ms(5000) {
  /* skip non-read/write disk */
  if (read_bytes+write_bytes) {
    printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n",
           ctime(gettimeofday_s()),
           "Average:", ((read_bytes+write_bytes)/1024)/5,
           "Read:",read_bytes/1024,
           "Write:",write_bytes/1024)
    /* print header */
    printf("%8s %8s %8s %25s %8s %4s %12s\n",
           "UID","PID","PPID","CMD","DEVICE","T","BYTES")
  }
  /* print top ten I/O */
  foreach ([process,cmd,userid,parent,action] in io_stat- limit 10)
    printf("%8d %8d %8d %25s %8s %4s %12d\n",
           userid,process,parent,cmd,
           device[process,cmd,userid,parent,action],
           action,io_stat[process,cmd,userid,parent,action])
  /* clear data */
  delete io_stat
  delete device
  read_bytes = 0
  write_bytes = 0  
}
probe end{
  delete io_stat
  delete device
  delete read_bytes
  delete write_bytes
}
		 
				UID — user ID. A user ID of 0 refers to the root user.
			
				PID — the ID of the listed process.
			
				PPID — the process ID of the listed process's parent process.
			
				CMD — the name of the listed process.
			
				DEVICE — which storage device the listed process is reading from or writing to.
			
				T — the type of action performed by the listed process; W refers to write, while R refers to read.
			
				BYTES — the amount of data read to or written from disk.
			
		The time and date in the output of 
disktop.stp is returned by the functions 
ctime() and 
gettimeofday_s(). 
ctime() derives calendar time in terms of seconds passed since the Unix epoch (January 1, 1970). 
gettimeofday_s() counts the 
actual number of seconds since Unix epoch, which gives a fairly accurate human-readable timestamp for the output.
	
		In this script, the $return is a local variable that stores the actual number of bytes each process reads or writes from the virtual file system. $return can only be used in return probes (e.g. vfs.read.return and vfs.read.return).
	
Example 4.5. disktop.stp Sample Output
[...]
Mon Sep 29 03:38:28 2008 , Average:  19Kb/sec, Read: 7Kb, Write: 89Kb
UID      PID     PPID                       CMD   DEVICE    T    BYTES
0    26319    26294                   firefox     sda5    W        90229
0     2758     2757           pam_timestamp_c     sda5    R         8064
0     2885        1                     cupsd     sda5    W         1678
Mon Sep 29 03:38:38 2008 , Average:   1Kb/sec, Read: 7Kb, Write: 1Kb
UID      PID     PPID                       CMD   DEVICE    T    BYTES
0     2758     2757           pam_timestamp_c     sda5    R         8064
0     2885        1                     cupsd     sda5    W         1678
4.2.2. Tracking I/O Time For Each File Read or Write
		This section describes how to monitor the amount of time it takes for 
each process to read from or write to any file. This is useful if you 
wish to determine what files are slow to load on a given system.
	
		iotime.stp tracks each time a system call opens, closes, reads from, and writes to a file. For each file any system call accesses, 
iotime.stp
 counts the number of microseconds it takes for any reads or writes to 
finish and tracks the amount of data (in bytes) read from or written to 
the file.
	
 
		iotime.stp also uses the local variable 
$count to track the amount of data (in bytes) that any system call 
attempts to read or write. Note that 
$return (as used in 
disktop.stp from 
Section 4.2.1, “Summarizing Disk Read/Write Traffic”) stores the 
actual amount of data read/written. 
$count can only be used on probes that track data reads or writes (e.g. 
syscall.read and 
syscall.write).
	
 Example 4.6. iotime.stp Sample Output
[...]
825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
[...]
117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
[...]
3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
3973744 2886 (sendmail) iotime /proc/loadavg time: 11
[...]
				A timestamp, in microseconds.
			
				Process ID and process name.
			
				An access or iotime flag.
			
				The file accessed.
			
		If a process was able to read or write any data, a pair of access and iotime lines should appear together. The access
 line's timestamp refers to the time that a given process started 
accessing a file; at the end of the line, it will show the amount of 
data read/written (in bytes). The iotime line will show the amount of time (in microseconds) that the process took in order to perform the read or write.
	
		If an access line is not followed by an iotime line, it simply means that the process did not read or write any data.
	
4.2.3. Track Cumulative IO
		This section describes how to track the cumulative amount of I/O to the system.
	
		traceio.stp prints the top ten 
executables generating I/O traffic over time. In addition, it also 
tracks the cumulative amount of I/O reads and writes done by those ten 
executables. This information is tracked and printed out in 1-second 
intervals, and in descending order.
	
 Example 4.7. traceio.stp Sample Output
[...]
           Xorg r:   583401 KiB w:        0 KiB
       floaters r:       96 KiB w:     7130 KiB
multiload-apple r:      538 KiB w:      537 KiB
           sshd r:       71 KiB w:       72 KiB
pam_timestamp_c r:      138 KiB w:        0 KiB
        staprun r:       51 KiB w:       51 KiB
          snmpd r:       46 KiB w:        0 KiB
          pcscd r:       28 KiB w:        0 KiB
     irqbalance r:       27 KiB w:        4 KiB
          cupsd r:        4 KiB w:       18 KiB
           Xorg r:   588140 KiB w:        0 KiB
       floaters r:       97 KiB w:     7143 KiB
multiload-apple r:      543 KiB w:      542 KiB
           sshd r:       72 KiB w:       72 KiB
pam_timestamp_c r:      138 KiB w:        0 KiB
        staprun r:       51 KiB w:       51 KiB
          snmpd r:       46 KiB w:        0 KiB
          pcscd r:       28 KiB w:        0 KiB
     irqbalance r:       27 KiB w:        4 KiB
          cupsd r:        4 KiB w:       18 KiB
4.2.4. I/O Monitoring (By Device)
		This section describes how to monitor I/O activity on a specific device.
	
		traceio2.stp takes 1 argument: the whole device number. To get this number, use 
stat -c "0x%D" directory, where 
directory is located in the device you wish to monitor.
	
 
		The usrdev2kerndev() function converts the whole device number into the format understood by the kernel. The output produced by usrdev2kerndev() is used in conjunction with the MKDEV(), MINOR(), and MAJOR() functions to determine the major and minor numbers of a specific device.
	
		The output of 
traceio2.stp includes the name and ID of any process performing a read/write, the function it is performing (i.e. 
vfs_read or 
vfs_write), and the kernel device number.
	
		The following example is an excerpt from the full output of stap traceio2.stp 0x805, where 0x805 is the whole device number of /home. /home resides in /dev/sda5, which is the device we wish to monitor.
	
Example 4.8. traceio2.stp Sample Output
[...]
synergyc(3722) vfs_read 0x800005
synergyc(3722) vfs_read 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
[...]
4.2.5. Monitoring Reads and Writes to a File
		This section describes how to monitor reads from and writes to a file in real time.
	
		inodewatch.stp takes the following information about the file as arguments on the command line:
	
 
		To get this information, use stat -c '%D %i' filename, where filename is an absolute path.
	
		For instance: if you wish to monitor /etc/crontab, run stat -c '%D %i' /etc/crontab first. This gives the following output:
	
805 1078319
		805 is the base-16 (hexadecimal) 
device number. The lower two digits are the minor device number and the 
upper digits are the major number. 1078319 is the inode number. To start monitoring /etc/crontab, run stap inodewatch.stp 0x8 0x05 1078319 (The 0x prefixes indicate base-16 values).
	
		The output of this command contains the name and ID of any process 
performing a read/write, the function it is performing (i.e. 
vfs_read or 
vfs_write), the device number (in hex format), and the 
inode number. 
Example 4.9, “inodewatch.stp Sample Output” contains the output of 
stap inodewatch.stp 0x8 0x05 1078319 (when 
cat /etc/crontab is executed while the script is running) :
	
cat(16437) vfs_read 0x800005/1078319
cat(16437) vfs_read 0x800005/1078319
4.2.6. Monitoring Changes to File Attributes
		This section describes how to monitor if any processes are changing the attributes of a targeted file, in real time.
	
chmod(17448) inode_setattr 0x800005/6011835 100777 500
chmod(17449) inode_setattr 0x800005/6011835 100666 500
			The following sections showcase scripts that profile kernel activity by monitoring function calls.
		
4.3.1. Counting Function Calls Made
		This section describes how to identify how many times the system 
called a specific kernel function in a 30-second sample. Depending on 
your use of wildcards, you can also use this script to target multiple 
kernel functions.
	
		functioncallcount.stp takes the
 targeted kernel function as an argument. The argument supports 
wildcards, which enables you to target multiple kernel functions up to a
 certain extent.
	
 [...]
__vma_link 97
__vma_link_file 66
__vma_link_list 97
__vma_link_rb 97
__xchg 103
add_page_to_active_list 102
add_page_to_inactive_list 19
add_to_page_cache 19
add_to_page_cache_lru 7
all_vm_events 6
alloc_pages_node 4630
alloc_slabmgmt 67
anon_vma_alloc 62
anon_vma_free 62
anon_vma_lock 66
anon_vma_prepare 98
anon_vma_unlink 97
anon_vma_unlock 66
arch_get_unmapped_area_topdown 94
arch_get_unmapped_exec_area 3
arch_unmap_area_topdown 97
atomic_add 2
atomic_add_negative 97
atomic_dec_and_test 5153
atomic_inc 470
atomic_inc_and_test 1
[...]
4.3.2. Call Graph Tracing
		This section describes how to trace incoming and outgoing function calls.
	
				The function(s) whose entry/exit you'd like to trace ($1).
			
				A second optional trigger function ($2),
 which enables or disables tracing on a per-thread basis. Tracing in 
each thread will continue as long as the trigger function has not exited
 yet.
			
		para-callgraph.stp uses 
thread_indent(); as such, its output contains the timestamp, process name, and thread ID of 
$1 (i.e. the probe function you are tracing). For more information about 
thread_indent(), refer to its entry in 
SystemTap Functions.
	
 
		The following example contains an excerpt from the output for stap para-callgraph.stp 'kernel.function("*@fs/*.c")' 'kernel.function("sys_read")':
	
[...]
   267 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5
   269 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5
     0 gnome-terminal(2921):->fput file=0xffff880111eebbc0
     2 gnome-terminal(2921):<-fput 
     0 gnome-terminal(2921):->fget_light fd=0x3 fput_needed=0xffff88010544df54
     3 gnome-terminal(2921):<-fget_light return=0xffff8801116ce980
     0 gnome-terminal(2921):->vfs_read file=0xffff8801116ce980 buf=0xc86504 count=0x1000 pos=0xffff88010544df48
     4 gnome-terminal(2921): ->rw_verify_area read_write=0x0 file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000
     7 gnome-terminal(2921): <-rw_verify_area return=0x1000
    12 gnome-terminal(2921): ->do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48
    15 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5
    18 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5
     0 gnome-terminal(2921):->fput file=0xffff8801116ce980
4.3.3. Determining Time Spent in Kernel and User Space
		This section illustrates how to determine the amount of time any given thread is spending in either kernel or user-space.
	
		thread-times.stp lists the top
 20 processes currently taking up CPU time within a 5-second sample, 
along with the total number of CPU ticks made during the sample. The 
output of this script also notes the percentage of CPU time each process
 used, as well as whether that time was spent in kernel space or user 
space.
	
   tid   %user %kernel (of 20002 ticks)
    0   0.00%  87.88%
32169   5.24%   0.03%
 9815   3.33%   0.36%
 9859   0.95%   0.00%
 3611   0.56%   0.12%
 9861   0.62%   0.01%
11106   0.37%   0.02%
32167   0.08%   0.08%
 3897   0.01%   0.08%
 3800   0.03%   0.00%
 2886   0.02%   0.00%
 3243   0.00%   0.01%
 3862   0.01%   0.00%
 3782   0.00%   0.00%
21767   0.00%   0.00%
 2522   0.00%   0.00%
 3883   0.00%   0.00%
 3775   0.00%   0.00%
 3943   0.00%   0.00%
 3873   0.00%   0.00%
4.3.4. Monitoring Polling Applications
		This section describes how to identify and monitor which applications 
are polling. Doing so allows you to track unnecessary or excessive 
polling, which can help you pinpoint areas for improvement in terms of 
CPU usage and power savings.
	
		timeout.stp tracks how many times each application used the following system calls over time:
	
 
				poll
			
				select
			
				epoll
			
				itimer
			
				futex
			
				nanosleep
			
				signal
			
		In some applications, these system calls are used excessively. As 
such, they are normally identified as "likely culprits" for polling 
applications. Note, however, that an application may be using a 
different system call to poll excessively; sometimes, it is useful to 
find out the top system calls used by the system (refer to 
Section 4.3.5, “Tracking Most Frequently Used System Calls” for instructions). Doing so can help you identify any additional suspects, which you can add to 
timeout.stp for tracking.
	
Example 4.14. timeout.stp Sample Output
  uid |   poll  select   epoll  itimer   futex nanosle  signal| process
28937 | 148793       0       0    4727   37288       0       0| firefox
22945 |      0   56949       0       1       0       0       0| scim-bridge
    0 |      0       0       0   36414       0       0       0| swapper
 4275 |  23140       0       0       1       0       0       0| mixer_applet2
 4191 |      0   14405       0       0       0       0       0| scim-launcher
22941 |   7908       1       0      62       0       0       0| gnome-terminal
 4261 |      0       0       0       2       0    7622       0| escd
 3695 |      0       0       0       0       0    7622       0| gdm-binary
 3483 |      0    7206       0       0       0       0       0| dhcdbd
 4189 |   6916       0       0       2       0       0       0| scim-panel-gtk
 1863 |   5767       0       0       0       0       0       0| iscsid
 2562 |      0    2881       0       1       0    1438       0| pcscd
 4257 |   4255       0       0       1       0       0       0| gnome-power-man
 4278 |   3876       0       0      60       0       0       0| multiload-apple
 4083 |      0    1331       0    1728       0       0       0| Xorg
 3921 |   1603       0       0       0       0       0       0| gam_server
 4248 |   1591       0       0       0       0       0       0| nm-applet
 3165 |      0    1441       0       0       0       0       0| xterm
29548 |      0    1440       0       0       0       0       0| httpd
 1862 |      0       0       0       0       0    1438       0| iscsid
		You can increase the sample time by editing the timer in the second probe (
timer.s()). The output of 
functioncallcount.stp
 contains the name and UID of the top 20 polling applications, along 
with how many times each application performed each polling system call 
(over time). 
Example 4.14, “timeout.stp Sample Output” contains an excerpt of the script:
	
4.3.5. Tracking Most Frequently Used System Calls
				poll
			
				select
			
				epoll
			
				itimer
			
				futex
			
				nanosleep
			
				signal
			
		However, in some systems, a different system call might be responsible
 for excessive polling. If you suspect that a polling application is 
using a different system call to poll, you need to identify first the 
top system calls used by the system. To do this, use 
topsys.stp.
	
Example 4.15. topsys.stp Sample Output
--------------------------------------------------------------
                  SYSCALL      COUNT
             gettimeofday       1857
                     read       1821
                    ioctl       1568
                     poll       1033
                    close        638
                     open        503
                   select        455
                    write        391
                   writev        335
                    futex        303
                  recvmsg        251
                   socket        137
            clock_gettime        124
           rt_sigprocmask        121
                   sendto        120
                setitimer        106
                     stat         90
                     time         81
                sigreturn         72
                    fstat         66
--------------------------------------------------------------
4.3.6. Tracking System Call Volume Per Process
		This section illustrates how to determine which processes are 
performing the highest volume of system calls. In previous sections, 
we've described how to monitor the top system calls used by the system 
over time (
Section 4.3.5, “Tracking Most Frequently Used System Calls”). We've also described how to identify which applications use a specific set of "polling suspect" system calls the most (
Section 4.3.4, “Monitoring Polling Applications”).
 Monitoring the volume of system calls made by each process provides 
more data in investigating your system for polling processes and other 
resource hogs.
	
Example 4.16. topsys.stp Sample Output
Collecting data... Type Ctrl-C to exit and display results
#SysCalls  Process Name
1577       multiload-apple
692        synergyc
408        pcscd
376        mixer_applet2
299        gnome-terminal
293        Xorg
206        scim-panel-gtk
95         gnome-power-man
90         artsd
85         dhcdbd
84         scim-bridge
78         gnome-screensav
66         scim-launcher
[...]
		If you prefer the output to display the process IDs instead of the process names, use the following script instead.
	
		As indicated in the output, you need to manually exit the script in 
order to display the results. You can add a timed expiration to either 
script by simply adding a timer.s() probe; for example, to instruct the script to expire after 5 seconds, add the following probe to the script:
	
probe timer.s(5)
{
	exit()
}
4.4. Identifying Contended User-Space Locks
		This section describes how to identify contended user-space locks 
throughout the system within a specific time period. The ability to 
identify contended user-space locks can help you investigate hangs that 
you suspect may be caused by futex contentions.
	
		Simply put, a futex contention occurs 
when multiple processes are trying to access the same region of memory. 
In some cases, this can result in a deadlock between the processes in 
contention, thereby appearing as an application hang.
	
			
#! /usr/bin/env stap
# This script tries to identify contended user-space locks by hooking
# into the futex system call.
global thread_thislock # short
global thread_blocktime # 
global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */
global lock_waits # long-lived stats on (tid,lock) blockage elapsed time
global process_names # long-lived pid-to-execname mapping
probe syscall.futex {  
  if (op != FUTEX_WAIT) next # don't care about WAKE event originator
  t = tid ()
  process_names[pid()] = execname()
  thread_thislock[t] = $uaddr
  thread_blocktime[t] = gettimeofday_us()
}
probe syscall.futex.return {  
  t = tid()
  ts = thread_blocktime[t]
  if (ts) {
    elapsed = gettimeofday_us() - ts
    lock_waits[pid(), thread_thislock[t]] <<< elapsed
    delete thread_blocktime[t]
    delete thread_thislock[t]
  }
}
probe end {
  foreach ([pid+, lock] in lock_waits) 
    printf ("%s[%d] lock %p contended %d times, %d avg us\n",
            process_names[pid], pid, lock, @count(lock_waits[pid,lock]),
            @avg(lock_waits[pid,lock]))
}
		 
		futexes.stp needs to be manually stopped; upon exit, it prints the following information:
	
 
				Name and ID of the process responsible for a contention
			
				The region of memory it contested
			
				How many times the region of memory was contended
			
				Average time of contention throughout the probe
			
Example 4.17. futexes.stp Sample Output
[...]	
automount[2825] lock 0x00bc7784 contended 18 times, 999931 avg us
synergyc[3686] lock 0x0861e96c contended 192 times, 101991 avg us
synergyc[3758] lock 0x08d98744 contended 192 times, 101990 avg us
synergyc[3938] lock 0x0982a8b4 contended 192 times, 101997 avg us
[...]
Chapter 5. Understanding SystemTap Errors
		This chapter explains the most common errors you may encounter while using SystemTap.
	
5.1. Parse and Semantic Errors
			These types of errors occur while SystemTap attempts to parse and 
translate the script into C, prior to being converted into a kernel 
module. For example type errors result from operations that assign 
invalid values to variables or arrays.
		
			The following invalid SystemTap script is missing its probe handlers:
		
			
probe vfs.read
probe vfs.write
		 
			It results in the following error message showing that the parser was expecting something other than the probe keyword in column 1 of line 2:
		
			
parse error: expected one of '. , ( ? ! { = +='
	saw: keyword at perror.stp:2:1
1 parse error(s).
		 
			If you are sure of the safety of any similar constructs in the script and are member of stapdev group (or have root privileges), run the script in "guru" mode by using the option -g (i.e. stap -g script).
		
Example 5.1. error-variable.stp
probe syscall.open
{
  printf ("%d(%d) open\n", execname(), pid())
}
			
probe begin { printf("x") = 1 }
		 
				SystemTap could not find a suitable kernel-debuginfo at all.
			
5.2. Run Time Errors and Warnings
			Runtime errors and warnings occur when the SystemTap instrumentation has been installed and is collecting data on the system.