Monday, September 22, 2014

SystemTap into Oracle for Fun and Profit

Topic: This post is about using SystemTap for investigating and troubleshooting Oracle RDBMS. In particular you will learn how to probe Oracle processes and their userspace functions. These techniques aim to be useful as well as fun to learn for those keen into peeking under the hood of the technology and improve their effectiveness in troubleshooting and performance investigations.

Introduction


Userspace probing is a very powerful technique that can be used to complement the available Oracle performance and profiling data, such as V$ views and 'event 10046' trace files, for advanced troubleshooting and investigations into the Oracle kernel. These type of investigations have been available for several years already mostly on Solaris with the introduction of DTrace (see for example Tanel's blog).
Dynamic tracing tools that allow kernel and userspace probing are also gaining momentum in the Linux word thanks to recent additions to the kernel. Several tools are available to provide the interface for collecting, aggregating and printing dynamic tracing data. SystemTap is one of the more mature tools, however more tools are now emerging, including 2 ports of DTrace. For a review of this topic see Brendan Gregg's presentation at LinuxCon 2014.
I have first learned about the possibility of using SystemTap for userspace tracing from Frits Hoogland and have since been quite excited by the possibilities that this opens for advanced troubleshooting, and in particular on how this can help investigating Oracle workloads for performance tuning and advanced troubleshooting.
I was was also quite happy to discover that the techniques described in this article work also under RHEL 6.5, this is definitely very good news for me as it means that I can start using SystemTap on many of the existing systems, without waiting for upgrades to the very latest kernels.

SystemTap documentation has additional details on the implementation of userspace probes: with Linux Kernels higher than 3.5 (such as when using RHEL 7.0, OL 7.0 or OL6.5 with UEK3) SystemTap will use the UPROBES module for userspace tracing, with older kernels SystemTap can use the UTRACE kernel extensions if available. SystemTap is able to provide userspace probing both with UTRACE and with UPROBES, more mother dynamic tracing tools seem to work only with UPROBES and therefore only with more recent versions of Linux. As reported by Frits in the above mentioned article you will need SystemTap 2.5 or higher for userspace tracing. As I write this SystemTap 2.6 has just come out.

This is how to check if UTRACE extensions are configured:
# grep CONFIG_UTRACE /boot/config-`uname -r`
CONFIG_UTRACE=y

This is how to check if UPROBES are available:
# grep CONFIG_UPROB /boot/config-`uname -r`
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENT=y

In the following you will find a few example, ideas and test code to get started with using SystemTap userspace probing with Oracle. The scripts discussed in this blog can be found at this link.


Deep Dive: SystemTap reads Oracle wait event interface data and X$ tables


You will see in this paragraph how to hook SystemTap into the Oracle wait event interface. This provides a way to link together the two types of instrumentation, SystemTap probe-based data and
Oracle-based wait event interface data, and therefore opens the way for more advanced investigations.
The main implementation detail is to create a SystemTap probe into the Oracle kernel at the 'right point' and make it read the information we want.
As discussed in previous work I have published about DTrace and Oracle, "kskthewt" is an Oracle kernel function that can be used for this purpose: it is called by the Oracle kernel at the end of each wait event presumably to update the relevant performance counters. Our first step is to create a probe that starts executing once the Oracle executable steps into the function kskthewt. The next step is make SystemTap read the information we want. What is available are CPU registers and memory areas. Some CPU registers are used to pass arguments to the function, as detailed in the x86 calling conventions. In other cases the CPU registers can contain useful information 'left over' from parent function(s) processing. Reading directly from memory areas and in particular from the SGA opens the way to most of Oracle instrumentation data.
Additional facts that I have found recently  in collaboration with Frits Hoogland and that are of help on building a probe on the Oracle function kskthewt are:
  • When execution jumps into the function kskthewt, the CPU register R13 turns out to point into the X$KSUSE memory structure (with an offset). R12 seems to point to the same address most of the time but I have seen some cases when it didn't so I prefer to use R13.
  • X$KSUSE is a base fixed table for V$SESSION and therefore carries a lot of very interesting information. X$KSUSE is allocated in memory as a segmented array in the SGA, see also Julian Dyke's presentation on SGA internals for more details on this interesting topic.
  • V$SESSION.SADDR, the base address of a given session's V$SESSION data and of X$KSUSE, can be derived from the value of R13 by subtracting an offset. The value of the offset depends of the Oracle version and port and can be found 'experimentally' by using SystemTap or gdb (see more details in the script directory: trace_oracle_events_debug.stp).
  • The fields of interest of X$KSUSE (i.e. V$SESSION column values) can be found from the base value of X$KSUSE with an offset calculated with a query from X$KQFCO and X$KQFTA (see the script: ksuse_find_offsets.sql).
  • The register RDI (which is mapped as arg1 in the SystemTap probe) is found to be set to the timestamp value. This is a timestamp of the wait event, the same value as what can be seen in a trace file for the event 10046, if activated.
  • The register RSI (that is arg2 in SystemTap probes) is found to be set to the wait event number.

If you put all of this together you can write a SystemTap script to collect and print wait event and selected v$session details. An example of such a script is demonstrated in the the figure here. The scripts can be downloaded from here as a zip file. Direct links: trace_oracle_events_11204.stp and trace_oracle_events_12102.stp and have been tested with Oracle 11.2.0.4 on RHEL6.5 and with Oracle 12.1.0.2 on OEL7 respectively.

Figure 1: Script and example output of a SystemTap probe that collects wait event and V$SESSION details from Oracle kernel functions and SGA memory.

Note: In the example above you can see that SystemTap output is piped to sed -f eventsname.sed. This is to transform event id numbers into strings. The file eventsname.sed is generated by running the script eventsname.sql.


Systemtap aggregations and filtering of wait event data


One of the strengths of SystemTap, similarly to the rest of dynamic tracing tools, is to do on-the-fly aggregations and filtering. This helps reducing the amount of data collected, therefore helping to minimize the footprint of the tracing operations, while still producing meaningful reports and real-time tracing. Building on the results discussed above on how to write SystemTap probes hooking into the wait event interface you can now add aggregations and filtering to compute wait event histograms with microsecond precision.
One simple and powerful result that you can reproduce easily is to write a script that effectively reproduces/back-ports the functionality of the 12c feature of microsecond wait event histograms. This can be very useful when studying events of short duration such as random disk reads from solid state devices. In addition to aggregation, with custom filtering you can aggregate wait event details for a given user or on a given file, or any other custom filter condition. Overall SystemTap allows for much freedom in programming our probes, which is quite empowering.
Two examples scripts that can be downloaded from here, direct links: histograms_oracle_events_11204.stp and histograms_oracle_events_12102.stp

Figure2: Script and example output of a SystemTap probe that computes microsecond precision wait event histograms.

Another and alternative way to build a SystemTap probe for gathering and optionally aggregating and filtering wait Oracle event latency data is described in the following. For this you need also an additional piece of research in Oracle kernel function calls: the function kews_update_wait_time is called just before kskthewt with the CPU register RSI set to the wait time in micro seconds (register RSI is also available in SystemTap probes as arg2). See also details reported in this previous work. The results is that you can build a probe computing event histograms that is independent of the Oracle version. An example script is: histograms_oracle_events_version_independent.stp

Figure 2: Script and example output of a SystemTap probe that computes microsecond precision wait event histograms and is independent of the Oracle version.


Tracing Oracle logical I/O with SystemTap userspace probes


You can use Systemtap userspace probes to trace Oracle logical I/O too. The main research underlying this probe comes from Tanel Poder (search for the DTrace-based script qer_trace.sh) and the extensive work of Alexander Anokhin.
By hooking on the Oracle kernel function kcbgtcr we can get the details of the logical reads, or rather cache consistent reads as explained by Alexander Anokhin, kcbgtcr =Kernel Cache Buffer Get Consistent Read and the first argument of the function call points to a memory structure with the details of the tablespace, relative file number, block number and object number of the consistent read. We can use all this to built a profiler probe trace_oracle_logical_io_basic.stp

probe process("oracle").function("kcbgtcr") {
    printf("tbs#=%d, rfile=%d, block#=%d, obj#=%d\n",user_int32(u64_arg(1)), user_int32(u64_arg(1)+4) >> 22 & 0x003FFFFF, user_int32(u64_arg(1)+4) & 0x003FFFFF, user_int32(u64_arg(1)+8))
}

From there you can also build probes that perform aggregates and compute statics of the logical read operations, see for example: trace_oracle_logical_io_count.stp

Figure 3: Script and example output of a SystemTap probe to capture and aggregate details of logical I/O (for consistent read) in Oracle.


Systemtap can integrate Oracle wait event data with OS tracing


SystemTap's strength and first use case is to trace system calls. You can use this to go deeper with Oracle tracing by combining Oracle wait event interface data (collected with SystemTap as shown above) with data coming from SystemTap OS probes. 
In the following you can see an example based on Oracle using ASM, when storage is made visible to the OS as block devices (asmlib is not used in this example). The I/O calls issued by Oracle processes in this case will typically be pread and pwrite for synchronous single-block I/O, io_submit and io_getevents for multi-block and/or asynchronous I/O. At a lower level all I/O to block devices can be traced as ioblock requests.
You can start exploring the relationship between Oracle wait events, I/O system calls and block I/O with this example starter script from: trace_oracle_iocalls_12102.stp

Figure 4: SystemTap probes for OS I/O tracing and Oracle wait event tracing. This allows to get an integrated view of the actions underlying an Oracle I/O event and relative OS actions.

In the Figure 5 here below you can see an example of tracing Oracle I/O where single-block (synchronous) I/O is performed via a pread call to the OS. This is a typical access path for Oracle random I/O. On the OS side the pread call is cascaded to the block device stack as io_block_request. On the Oracle wait event interface the I/O operation is accounted as db file sequential read wait, with indication of the file number and block number (p1 and p2 parameters)

Figure 5: Oracle performing single-block I/O profiled using SystemTap probes. This provides end-to-end tracing of the I/O operation: we can see the OS call details and the corresponding to the Oracle wait event data

Linux asynchronous I/O interface is often used by Oracle for multi-block I/O. Thisis mainly for performance reasons and can only be done when the underlying filesystem allows it (this is the case in the examples reported here as we are using ASM). A study of Oracle async I/O would take us too far from our main subject, we will limit the discussion to some main observations and an example of how you can use SystemTap probes to further investigate this topic.
With AIO the kernel is actually doing the IO, the user process is just asking for an IO to be done, and checks back to get the requested information. When investigating async I/O you need to trace at least two calls: one for submitting the I/O requests (io_submit) and one for reaping the results (io_getevents). I/O requests are then passed on to the block I/O layer. At this stage they can be split into more operations of smaller size, as it is the case of the examples of Figure 6. See also Martin Bach's post on the topic of "Increasing the maximum I/O size in Linux". More details on tracing Oracle I/O can be found in Frits Hooglands' excellent work on profiling Oracle with gdb and study on multiblock I/O.


Figure 6: SytemTap probe example showing the profiling Oracle I/O in a case of multi-block read when Oracle has chosen to perform asynchronous I/O. The I/O requests at the OS level are submitted using io_submit and reaped with io_getevents. SystemTap allows to see also what happens at the block device level when the I/Os are split into smaller chunks of maximum size 512 KB in the system used for this test.


Lab setup and documentation


Here below you can find some pointers on how to setup a test environment to experiment with the techniques and scripts discussed here above:

- Download and install Oracle Virtual Box
- Download OL 7.0 (or OL 6.x) for example from https://edelivery.oracle.com/linux
- Notably install also the kernel-devel rpm

If you want to use OS probes with SystemTap download and install the debuginfo and debuginfo-common packages for the kernel you are using. The rpms can be found from https://oss.oracle.com/ol7/debuginfo/ and https://oss.oracle.com/el6/debuginfo/ respectively for OEL 7 and OEL 6. Similarly install libaio-debuginfo and libaio-devel if you want to trace OS calls for asynchronous I/O.
Please note that if you want only to run probes for the Oracle userspace functions you can skip the installation of the kernel and libaio debuginfo packages.

The scripts discussed here require systemtap 2.5 or higher. Consider uninstalling systemtap if already installed with an older version.
Download the latest version of systemtap from https://sourceware.org/systemtap/ftp/releases/ (version 2.6 as I write this).
To install SystemTap from the downloaded source run:

./configure
# the output will tell you if additional OS packages are needed, in case install them and repeat
make
make install

As post-installation check run:
stap --help

A good point to start with the SystemTap documentation is the beginners guide, see also the language reference and tapset reference manual.

Download and install Oracle version 12.1.0.2 or 11.2.0.4.
The software can be downloaded from https://edelivery.oracle.com/
Pre-installed VMs with Oracle database are also available for download from OTN (for example at this link)
Installation guides, besides the Oracle documentation, are also available at http://www.oracle-base.com/articles/12c/articles-12c.php and in the racattack wiki book.


Conclusions


Dynamic tracing makes available userspace-level probes into the Oracle kernel and Linux probes at the OS kernel level, which can be used to complement Oracle's performance data, notably data from the wait event interface and the statistics counters. These techniques can considerably extend the amount of data available when investigating performance issues and for advanced troubleshooting, therefore helping the troubleshooter to use systematic approaches to problem solving and overall to be more successful and consistent.

You have seen in this short article a few pointers and get-started examples (download from this link or from GitGub). There is a lot of potential in applying and extending these techniques for investigating Oracle and for advanced troubleshooting of Oracle and other complex platforms. I hope to cover some more examples in future posts. For the time being, I hope you enjoyed going through the listed examples and I wish to all the readers who have been patient enough to make this far fun and success in trying these techniques on their systems!


Acknowledgements


I would like to thank three authors of many original ideas that have inspired the material I have discussed here: Brendan Gregg, Tanel Poder and Frits Hoogland. Additional thanks go to Frits Hoogland for reviewing this post.

4 comments:

  1. Nice writeup, thanks!
    If the Oracle database includes a header file that defines some of those k$foo struct types, stap can get at their fields without need for explicit pointer arithmetic via @cast(register(XX), "struct k$foo", "")->FIELD.
    Also, for mapping event numbers to names, why not use a stap array? (global oevents; probe begin { oevents[1]="foo" ...})

    ReplyDelete
    Replies
    1. Thank you Frank for stopping by and for the comments!
      Unfortunately Oracle does not currently publish a header file for the structure of interest here. But indeed you have a good point there on the use of cast instead of pointer arithmetic. On second thoughts I believe it should be possible to 'reconstruct' much of the underlying struct of interest from the contents of the internal tables X$KQFCO and X$KQFTA (much like I propose to do with the ksuse_find_offsets.sql mentioned in the blog post) .
      The use of an array to export the event names is also a good idea. As a note for future work I add that it is also possible to use Oracle's internal array of the event names: it’s address and details are available in the X$KSLED internal table (that is the table underlying GV$EVENT_NAME).

      Delete
  2. Hi Luca,
    this the most awesome post regarding Oracle performance analysis I've read in a long time.
    I have been doing some Oracle performance analysis myself, know some of the works of the Oracle specialists you mentioned in this post and I read and tested the systemtap parts of Brendan's book.
    So I'm really amazed that you, as a single person, can *combine* and *apply* such a deep knowledge from the Oracle- and Linuxexperts.
    Best regards.

    ReplyDelete