4.2. Disk
4.2.1. Summarizing Disk Read/Write Traffic
Example 4.9. disktop.stp
#!/usr/bin/stap # # Copyright (C) 2007 Oracle Corp. # # Get the status of reading/writing disk every 5 seconds, # output top ten entries # # This is free software,GNU General Public License (GPL); # either version 2, or (at your option) any later version. # # Usage: # ./disktop.stp # global io_stat,device global read_bytes,write_bytes probe vfs.read.return { if ($return>0) { if (devname!="N/A") {/*skip read from cache*/ io_stat[pid(),execname(),uid(),ppid(),"R"] += $return device[pid(),execname(),uid(),ppid(),"R"] = devname read_bytes += $return } } } probe vfs.write.return { if ($return>0) { if (devname!="N/A") { /*skip update cache*/ io_stat[pid(),execname(),uid(),ppid(),"W"] += $return device[pid(),execname(),uid(),ppid(),"W"] = devname write_bytes += $return } } } probe timer.ms(5000) { /* skip non-read/write disk */ if (read_bytes+write_bytes) { printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n", ctime(gettimeofday_s()), "Average:", ((read_bytes+write_bytes)/1024)/5, "Read:",read_bytes/1024, "Write:",write_bytes/1024) /* print header */ printf("%8s %8s %8s %25s %8s %4s %12s\n", "UID","PID","PPID","CMD","DEVICE","T","BYTES") } /* print top ten I/O */ foreach ([process,cmd,userid,parent,action] in io_stat- limit 10) printf("%8d %8d %8d %25s %8s %4s %12d\n", userid,process,parent,cmd, device[process,cmd,userid,parent,action], action,io_stat[process,cmd,userid,parent,action]) /* clear data */ delete io_stat delete device read_bytes = 0 write_bytes = 0 } probe end{ delete io_stat delete device delete read_bytes delete write_bytes }
UID
— user ID. A user ID of0
refers to the root user.PID
— the ID of the listed process.PPID
— the process ID of the listed process's parent process.CMD
— the name of the listed process.DEVICE
— which storage device the listed process is reading from or writing to.T
— the type of action performed by the listed process;W
refers to write, andR
refers to read.BYTES
— the amount of data read to or written from disk.
ctime()
and gettimeofday_s()
. ctime()
derives calendar time in terms of seconds passed since the start of the Unix time (January 1, 1970). gettimeofday_s()
counts the actual number of seconds since the start of the Unix time, which gives a fairly accurate human-readable timestamp for the output.
$return
is a local variable that stores the actual number of bytes each process reads or writes from the virtual file system. $return
can only be used in return probes (for example, vfs.read.return
and vfs.read.return
).
Example 4.10. Example 4.9, “disktop.stp” Sample Output
[...] Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb UID PID PPID CMD DEVICE T BYTES 0 26319 26294 firefox sda5 W 90229 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678 Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb UID PID PPID CMD DEVICE T BYTES 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678
4.2.2. Tracking I/O Time For Each File Read or Write
Example 4.11. iotime.stp
global start global entry_io global fd_io global time_io function timestamp:long() { return gettimeofday_us() - start } function proc:string() { return sprintf("%d (%s)", pid(), execname()) } probe begin { start = gettimeofday_us() } global filenames global filehandles global fileread global filewrite probe syscall.open { filenames[pid()] = user_string($filename) } probe syscall.open.return { if ($return != -1) { filehandles[pid(), $return] = filenames[pid()] fileread[pid(), $return] = 0 filewrite[pid(), $return] = 0 } else { printf("%d %s access %s fail\n", timestamp(), proc(), filenames[pid()]) } delete filenames[pid()] } probe syscall.read { if ($count > 0) { fileread[pid(), $fd] += $count } t = gettimeofday_us(); p = pid() entry_io[p] = t fd_io[p] = $fd } probe syscall.read.return { t = gettimeofday_us(); p = pid() fd = fd_io[p] time_io[p,fd] <<< t - entry_io[p] } probe syscall.write { if ($count > 0) { filewrite[pid(), $fd] += $count } t = gettimeofday_us(); p = pid() entry_io[p] = t fd_io[p] = $fd } probe syscall.write.return { t = gettimeofday_us(); p = pid() fd = fd_io[p] time_io[p,fd] <<< t - entry_io[p] } probe syscall.close { if (filehandles[pid(), $fd] != "") { printf("%d %s access %s read: %d write: %d\n", timestamp(), proc(), filehandles[pid(), $fd], fileread[pid(), $fd], filewrite[pid(), $fd]) if (@count(time_io[pid(), $fd])) printf("%d %s iotime %s time: %d\n", timestamp(), proc(), filehandles[pid(), $fd], @sum(time_io[pid(), $fd])) } delete fileread[pid(), $fd] delete filewrite[pid(), $fd] delete filehandles[pid(), $fd] delete fd_io[pid()] delete entry_io[pid()] delete time_io[pid(),$fd] }
$count
to track the amount of data (in bytes) that any system call attempts to read or write. Note that $return
(as used in Example 4.9, “disktop.stp” from Section 4.2.1, “Summarizing Disk Read/Write Traffic”) stores the actual amount of data read/written. $count
can only be used on probes that track data reads or writes (e.g. syscall.read
and syscall.write
).
Example 4.12. Example 4.11, “iotime.stp” Sample Output
[...] 825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0 825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9 [...] 117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0 117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7 [...] 3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0 3973744 2886 (sendmail) iotime /proc/loadavg time: 11 [...]
- A timestamp, in microseconds.
- Process ID and process name.
- An
access
oriotime
flag. - The file accessed.
access
and iotime
lines should appear together. The access
line's timestamp refers to the time that a given process started accessing a file; at the end of the line, it will show the amount of data read/written (in bytes). The iotime
line will show the amount of time (in microseconds) that the process took in order to perform the read or write.
access
line is not followed by an iotime
line, it simply means that the process did not read or write any data.
4.2.3. Track Cumulative IO
Example 4.13. traceio.stp
#! /usr/bin/env stap # traceio.stp # Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo@redhat.com> # Copyright (C) 2009 Kai Meyer <kai@unixlords.com> # Fixed a bug that allows this to run longer # And added the humanreadable function # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 as # published by the Free Software Foundation. # global reads, writes, total_io probe vfs.read.return { reads[pid(),execname()] += $return total_io[pid(),execname()] += $return } probe vfs.write.return { writes[pid(),execname()] += $return total_io[pid(),execname()] += $return } function humanreadable(bytes) { if (bytes > 1024*1024*1024) { return sprintf("%d GiB", bytes/1024/1024/1024) } else if (bytes > 1024*1024) { return sprintf("%d MiB", bytes/1024/1024) } else if (bytes > 1024) { return sprintf("%d KiB", bytes/1024) } else { return sprintf("%d B", bytes) } } probe timer.s(1) { foreach([p,e] in total_io- limit 10) printf("%8d %15s r: %12s w: %12s\n", p, e, humanreadable(reads[p,e]), humanreadable(writes[p,e])) printf("\n") # Note we don't zero out reads, writes and total_io, # so the values are cumulative since the script started. }
$return
, which is also used by Example 4.9, “disktop.stp” from Section 4.2.1, “Summarizing Disk Read/Write Traffic”.
Example 4.14. Example 4.13, “traceio.stp” Sample Output
[...] Xorg r: 583401 KiB w: 0 KiB floaters r: 96 KiB w: 7130 KiB multiload-apple r: 538 KiB w: 537 KiB sshd r: 71 KiB w: 72 KiB pam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB Xorg r: 588140 KiB w: 0 KiB floaters r: 97 KiB w: 7143 KiB multiload-apple r: 543 KiB w: 542 KiB sshd r: 72 KiB w: 72 KiB pam_timestamp_c r: 138 KiB w: 0 KiB staprun r: 51 KiB w: 51 KiB snmpd r: 46 KiB w: 0 KiB pcscd r: 28 KiB w: 0 KiB irqbalance r: 27 KiB w: 4 KiB cupsd r: 4 KiB w: 18 KiB
4.2.4. I/O Monitoring (By Device)
Example 4.15. traceio2.stp
#! /usr/bin/env stap global device_of_interest probe begin { /* The following is not the most efficient way to do this. One could directly put the result of usrdev2kerndev() into device_of_interest. However, want to test out the other device functions */ dev = usrdev2kerndev($1) device_of_interest = MKDEV(MAJOR(dev), MINOR(dev)) } probe vfs.write, vfs.read { if (dev == device_of_interest) printf ("%s(%d) %s 0x%x\n", execname(), pid(), probefunc(), dev) }
stat -c "0x%D" directory
, where directory is located on the device you wish to monitor.
usrdev2kerndev()
function converts the whole device number into the format understood by the kernel. The output produced by usrdev2kerndev()
is used in conjunction with the MKDEV()
, MINOR()
, and MAJOR()
functions to determine the major and minor numbers of a specific device.
vfs_read
or vfs_write
), and the kernel device number.
stap traceio2.stp 0x805
, where 0x805
is the whole device number of /home
. /home
resides in /dev/sda5
, which is the device we wish to monitor.
Example 4.16. Example 4.15, “traceio2.stp” Sample Output
[...] synergyc(3722) vfs_read 0x800005 synergyc(3722) vfs_read 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 [...]
4.2.5. Monitoring Reads and Writes to a File
Example 4.17. inodewatch.stp
#! /usr/bin/env stap probe vfs.write, vfs.read { # dev and ino are defined by vfs.write and vfs.read if (dev == MKDEV($1,$2) # major/minor device && ino == $3) printf ("%s(%d) %s 0x%x/%u\n", execname(), pid(), probefunc(), dev, ino) }
- The file's major device number.
- The file's minor device number.
- The file's
inode
number.
stat -c '%D %i' filename
, where filename is an absolute path.
/etc/crontab
file, run stat -c '%D %i' /etc/crontab
first. This outputs the following output:
805 1078319
805
is the base-16 (hexadecimal) device number. The lower two digits are the minor device number, and the upper digits are the major number. 1078319
is the inode
number. To start monitoring /etc/crontab
, run stap inodewatch.stp 0x8 0x05 1078319
(The 0x
prefixes indicate base-16 values).
vfs_read
or vfs_write
), the device number (in hex format), and the inode
number. Example 4.18, “Example 4.17, “inodewatch.stp” Sample Output” contains the output of stap inodewatch.stp 0x8 0x05 1078319
(when cat /etc/crontab
is executed while the script is running):
Example 4.18. Example 4.17, “inodewatch.stp” Sample Output
cat(16437) vfs_read 0x800005/1078319 cat(16437) vfs_read 0x800005/1078319
4.2.6. Monitoring Changes to File Attributes
Example 4.19. inodewatch2-simple.stp
global ATTR_MODE = 1 probe kernel.function("inode_setattr") { dev_nr = $inode->i_sb->s_dev inode_nr = $inode->i_ino if (dev_nr == ($1 << 20 | $2) # major/minor device && inode_nr == $3 && $attr->ia_valid & ATTR_MODE) printf ("%s(%d) %s 0x%x/%u %o %d\n", execname(), pid(), probefunc(), dev_nr, inode_nr, $attr->ia_mode, uid()) }
inode
number as arguments. For more information on how to retrieve this information, see Section 4.2.5, “Monitoring Reads and Writes to a File”.
uid()
). Example 4.20, “Example 4.19, “inodewatch2-simple.stp” Sample Output” shows the output of Example 4.19, “inodewatch2-simple.stp” while monitoring /home/joe/bigfile
when user joe
executes chmod 777 /home/joe/bigfile
and chmod 666 /home/joe/bigfile
.
Example 4.20. Example 4.19, “inodewatch2-simple.stp” Sample Output
chmod(17448) inode_setattr 0x800005/6011835 100777 500 chmod(17449) inode_setattr 0x800005/6011835 100666 500