4.2. Disk
The following sections showcase scripts that monitor disk and I/O activity.
4.2.1. Summarizing Disk Read/Write Traffic
This section describes how to identify which processes are performing the heaviest disk reads/writes to the system.
disktop.stp
#!/usr/bin/env stap # # Copyright (C) 2007 Oracle Corp. # # Get the status of reading/writing disk every 5 seconds, # output top ten entries # # This is free software,GNU General Public License (GPL); # either version 2, or (at your option) any later version. # # Usage: # ./disktop.stp # global io_stat,device global read_bytes,write_bytes probe vfs.read.return { if ($return>0) { if (devname!="N/A") {/*skip read from cache*/ io_stat[pid(),execname(),uid(),ppid(),"R"] += $return device[pid(),execname(),uid(),ppid(),"R"] = devname read_bytes += $return } } } probe vfs.write.return { if ($return>0) { if (devname!="N/A") { /*skip update cache*/ io_stat[pid(),execname(),uid(),ppid(),"W"] += $return device[pid(),execname(),uid(),ppid(),"W"] = devname write_bytes += $return } } } probe timer.ms(5000) { /* skip non-read/write disk */ if (read_bytes+write_bytes) { printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n", ctime(gettimeofday_s()), "Average:", ((read_bytes+write_bytes)/1024)/5, "Read:",read_bytes/1024, "Write:",write_bytes/1024) /* print header */ printf("%8s %8s %8s %25s %8s %4s %12s\n", "UID","PID","PPID","CMD","DEVICE","T","BYTES") } /* print top ten I/O */ foreach ([process,cmd,userid,parent,action] in io_stat- limit 10) printf("%8d %8d %8d %25s %8s %4s %12d\n", userid,process,parent,cmd, device[process,cmd,userid,parent,action], action,io_stat[process,cmd,userid,parent,action]) /* clear data */ delete io_stat delete device read_bytes = 0 write_bytes = 0 } probe end{ delete io_stat delete device delete read_bytes delete write_bytes }
disktop.stp outputs the top ten processes responsible for the heaviest reads/writes to disk. Example 4.4, “disktop.stp Sample Output” displays a sample output for this script, and includes the following data per listed process:
UID
— user ID. A user ID of0
refers to the root user.PID
— the ID of the listed process.PPID
— the process ID of the listed process's parent process.CMD
— the name of the listed process.DEVICE
— which storage device the listed process is reading from or writing to.T
— the type of action performed by the listed process;W
refers to write, whileR
refers to read.BYTES
— the amount of data read to or written from disk.
The time and date in the output of disktop.stp is returned by the functions
ctime()
and gettimeofday_s()
. ctime()
derives calendar time in terms of seconds passed since the Unix epoch (January 1, 1970). gettimeofday_s()
counts the actual number of seconds since Unix epoch, which gives a fairly accurate human-readable timestamp for the output.
In this script, the
$return
is a local variable that stores the actual number of bytes each process reads or writes from the virtual file system. $return
can only be used in return probes (e.g. vfs.read.return
and vfs.read.return
).
Example 4.4. disktop.stp Sample Output
[...] Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb UID PID PPID CMD DEVICE T BYTES 0 26319 26294 firefox sda5 W 90229 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678 Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb UID PID PPID CMD DEVICE T BYTES 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678
4.2.2. Tracking I/O Time For Each File Read or Write
This section describes how to monitor the amount of time it takes for each process to read from or write to any file. This is useful if you wish to determine what files are slow to load on a given system.
iotime.stp
global start global entry_io global fd_io global time_io function timestamp:long() { return gettimeofday_us() - start } function proc:string() { return sprintf("%d (%s)", pid(), execname()) } probe begin { start = gettimeofday_us() } global filenames global filehandles global fileread global filewrite probe syscall.open { filenames[pid()] = user_string($filename) } probe syscall.open.return { if ($return != -1) { filehandles[pid(), $return] = filenames[pid()] fileread[pid(), $return] = 0 filewrite[pid(), $return] = 0 } else { printf("%d %s access %s fail\n", timestamp(), proc(), filenames[pid()]) } delete filenames[pid()] } probe syscall.read { if ($count > 0) { fileread[pid(), $fd] += $count } t = gettimeofday_us(); p = pid() entry_io[p] = t fd_io[p] = $fd } probe syscall.read.return { t = gettimeofday_us(); p = pid() fd = fd_io[p] time_io[p,fd] <<< t - entry_io[p] } probe syscall.write { if ($count > 0) { filewrite[pid(), $fd] += $count } t = gettimeofday_us(); p = pid() entry_io[p] = t fd_io[p] = $fd } probe syscall.write.return { t = gettimeofday_us(); p = pid() fd = fd_io[p] time_io[p,fd] <<< t - entry_io[p] } probe syscall.close { if (filehandles[pid(), $fd] != "") { printf("%d %s access %s read: %d write: %d\n", timestamp(), proc(), filehandles[pid(), $fd], fileread[pid(), $fd], filewrite[pid(), $fd]) if (@count(time_io[pid(), $fd])) printf("%d %s iotime %s time: %d\n", timestamp(), proc(), filehandles[pid(), $fd], @sum(time_io[pid(), $fd])) } delete fileread[pid(), $fd] delete filewrite[pid(), $fd] delete filehandles[pid(), $fd] delete fd_io[pid()] delete entry_io[pid()] delete time_io[pid(),$fd] }
iotime.stp tracks each time a system call opens, closes, reads from, and writes to a file. For each file any system call accesses, iotime.stp counts the number of microseconds it takes for any reads or writes to finish and tracks the amount of data (in bytes) read from or written to the file.
iotime.stp also uses the local variable
$count
to track the amount of data (in bytes) that any system call attempts to read or write. Note that $return
(as used in disktop.stp from Section 4.2.1, “Summarizing Disk Read/Write Traffic”) stores the actual amount of data read/written. $count
can only be used on probes that track data reads or writes (e.g. syscall.read
and syscall.write
).
Example 4.5. iotime.stp Sample Output
[...] 825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0 825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9 [...] 117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0 117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7 [...] 3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0 3973744 2886 (sendmail) iotime /proc/loadavg time: 11 [...]
Example 4.5, “iotime.stp Sample Output” prints out the following data:
- A timestamp, in microseconds
- Process ID and process name
- An
access
oriotime
flag - The file accessed
If a process was able to read or write any data, a pair of
access
and iotime
lines should appear together. The access
line's timestamp refer to the time that a given process started accessing a file; at the end of the line, it will show the amount of data read/written (in bytes). The iotime
line will show the amount of time (in microseconds) that the process took in order to perform the read or write.
If an
access
line is not followed by an iotime
line, it simply means that the process did not read or write any data.
4.2.3. Track Cumulative IO
This section describes how to track the cumulative amount of I/O to the system.
traceio.stp
#! /usr/bin/env stap # traceio.stp # Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo@redhat.com> # Copyright (C) 2009 Kai Meyer <kai@unixlords.com> # Fixed a bug that allows this to run longer # And added the humanreadable function # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as # published by the Free Software Foundation. # global reads, writes, total_io probe vfs.read.return { reads[pid(),execname()] += $return total_io[pid(),execname()] += $return } probe vfs.write.return { writes[pid(),execname()] += $return total_io[pid(),execname()] += $return } function humanreadable(bytes) { if (bytes > 1024*1024*1024) { return sprintf("%d GiB", bytes/1024/1024/1024) } else if (bytes > 1024*1024) { return sprintf("%d MiB", bytes/1024/1024) } else if (bytes > 1024) { return sprintf("%d KiB", bytes/1024) } else { return sprintf("%d B", bytes) } } probe timer.s(1) { foreach([p,e] in total_io- limit 10) printf("%8d %15s r: %12s w: %12s\n", p, e, humanreadable(reads[p,e]), humanreadable(writes[p,e])) printf("\n") # Note we don't zero out reads, writes and total_io, # so the values are cumulative since the script started. }
traceio.stp prints the top ten executables generating I/O traffic over time. In addition, it also tracks the cumulative amount of I/O reads and writes done by those ten executables. This information is tracked and printed out in 1-second intervals, and in descending order.
Note that traceio.stp also uses the local variable
$return
, which is also used by disktop.stp from Section 4.2.1, “Summarizing Disk Read/Write Traffic”.
Example 4.6. traceio.stp Sample Output
[...] Xorg r: 583401 KiB w: 0 KiB floaters r: 96 KiB w: 7130 KiB multiload-apple r: 538 KiB w: 537 KiB sshd r: 71 KiB w: 72 KiB pam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB Xorg r: 588140 KiB w: 0 KiB floaters r: 97 KiB w: 7143 KiB multiload-apple r: 543 KiB w: 542 KiB sshd r: 72 KiB w: 72 KiB pam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB
4.2.4. I/O Monitoring (By Device)
This section describes how to monitor I/O activity on a specific device.
traceio2.stp
#! /usr/bin/env stap global device_of_interest, dev probe begin { /* The following is not the most efficient way to do this. One could directly put the result of usrdev2kerndev() into device_of_interest. However, want to test out the other device functions */ dev = usrdev2kerndev($1) device_of_interest = MKDEV(MAJOR(dev), MINOR(dev)) } probe vfs.write, vfs.read { if (dev == device_of_interest) printf ("%s(%d) %s 0x%x\n", execname(), pid(), probefunc(), dev) }
traceio2.stp takes 1 argument: the whole device number. To get this number, use
stat -c "0x%D" directory
, where directory
is located in the device you wish to monitor.
The
usrdev2kerndev()
function converts the whole device number into the format understood by the kernel. The output produced by usrdev2kerndev()
is used in conjunction with the MKDEV()
, MINOR()
, and MAJOR()
functions to determine the major and minor numbers of a specific device.
The output of traceio2.stp includes the name and ID of any process performing a read/write, the function it is performing (i.e.
vfs_read
or vfs_write
), and the kernel device number.
The following example is an excerpt from the full output of
stap traceio2.stp 0x805
, where 0x805
is the whole device number of /home
. /home
resides in /dev/sda5
, which is the device we wish to monitor.
Example 4.7. traceio2.stp Sample Output
[...] synergyc(3722) vfs_read 0x800005 synergyc(3722) vfs_read 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 [...]
4.2.5. Monitoring Reads and Writes to a File
This section describes how to monitor reads from and writes to a file in real time.
inodewatch-simple.stp
probe vfs.write, vfs.read { dev_nr = $file->f_dentry->d_inode->i_sb->s_dev inode_nr = $file->f_dentry->d_inode->i_ino if (dev_nr == ($1 << 20 | $2) # major/minor device && inode_nr == $3) printf ("%s(%d) %s 0x%x/%u\n", execname(), pid(), probefunc(), dev_nr, inode_nr) }
inodewatch-simple.stp takes the following information about the file as arguments on the command line:
- The file's major device number.
- The file's minor device number.
- The file's
inode
number.
To get this information, use
stat -c '%D %i' filename
, where filename
is an absolute path.
For instance: if you wish to monitor
/etc/crontab
, run stat -c '%D %i' /etc/crontab
first. This gives the following output:
805 1078319
805
is the base-16 (hexadecimal) device number. The lower two digits are the minor device number and the upper digits are the major number. 1078319
is the inode
number. To start monitoring /etc/crontab
, run stap inodewatch.stp 0x8 0x05 1078319
(The 0x
prefixes indicate base-16 values.
The output of this command contains the name and ID of any process performing a read/write, the function it is performing (i.e.
vfs_read
or vfs_write
), the device number (in hex format), and the inode
number. Example 4.8, “inodewatch-simple.stp Sample Output” contains the output of stap inodewatch.stp 0x8 0x05 1078319
(when cat /etc/crontab
is executed while the script is running) :
Example 4.8. inodewatch-simple.stp Sample Output
cat(16437) vfs_read 0x800005/1078319 cat(16437) vfs_read 0x800005/1078319
4.2.6. Monitoring Changes to File Attributes
This section describes how to monitor if any processes are changing the attributes of a targeted file, in real time.
inodewatch2-simple.stp
global ATTR_MODE = 1 probe kernel.function("inode_setattr") { dev_nr = $inode->i_sb->s_dev inode_nr = $inode->i_ino if (dev_nr == ($1 << 20 | $2) # major/minor device && inode_nr == $3 && $attr->ia_valid & ATTR_MODE) printf ("%s(%d) %s 0x%x/%u %o %d\n", execname(), pid(), probefunc(), dev_nr, inode_nr, $attr->ia_mode, uid()) }
Like inodewatch-simple.stp from Section 4.2.5, “Monitoring Reads and Writes to a File”, inodewatch2-simple.stp takes the targeted file's device number (in integer format) and
inode
number as arguments. For more information on how to retrieve this information, refer to Section 4.2.5, “Monitoring Reads and Writes to a File”.
The output for inodewatch2-simple.stp is similar to that of inodewatch-simple.stp, except that inodewatch2-simple.stp also contains the attribute changes to the monitored file, as well as the ID of the user responsible (
uid()
). Example 4.9, “inodewatch2-simple.stp Sample Output” contains shows the output of inodewatch2-simple.stp while monitoring /home/joe/bigfile
when user joe
executes chmod 777 /home/joe/bigfile
and chmod 666 /home/joe/bigfile
.
Example 4.9. inodewatch2-simple.stp Sample Output
chmod(17448) inode_setattr 0x800005/6011835 100777 500 chmod(17449) inode_setattr 0x800005/6011835 100666 500