Multi-Process Correlation for Shared Memory¶
LiveRecorder can log accesses to shared memory from multiple processes. This log can be queried in UDB using the ublame command to determine which of the recorded processes modified a region of shared memory.
Using Multi-Process Correlation¶
Modify your program to record itself using LiveRecorder, as documented in using the LiveRecorder API.
After the program creates one or more shared memory maps, but before it forks its children, call
undolr_shmem_log_filename_set()
to set the file where shared memory accesses will be logged.In each process, call
undolr_start()
to start recording.In each process, save a recording by calling
undolr_save_on_termination()
orundolr_save()
.
Here’s an example (with simple error handling to keep it short):
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include "undolr.h"
int
main(void)
{
int e = undolr_shmem_log_filename_set("example.shmem");
if (e < 0) abort();
const int n_elements = 100;
int *shmem_data = mmap(NULL,
n_elements * sizeof *shmem_data,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED,
0, 0);
if (shmem_data == MAP_FAILED) abort();
pid_t pid = fork();
if (pid < 0) abort();
/* Start recording parent or child */
e = undolr_start(NULL);
if (e < 0) abort();
if (pid == 0)
{
/* Child process */
e = undolr_save_on_termination("child.undo");
if (e < 0) abort();
/* Access shmem_data. */
shmem_data[0] = 123;
}
else
{
/* Parent process. */
e = undolr_save_on_termination("parent.undo");
if (e < 0) abort();
/* Wait for the child process. */
waitpid(pid, NULL, 0);
/* Access shmem_data. */
printf("%d\n", shmem_data[0]);
shmem_data[1] = 456;
}
return 0;
}
This program creates three files:
parent.undo
— Undo recording containing the execution history of the parent process.child.undo
— Undo recording containing the execution history of the child process.example.shmem
— Log of shared memory accesses made by both processes.
The recordings can then be loaded into UDB.
Querying the shared memory log¶
Once a recording is loaded in UDB, use the ublame
command to query the associated shared memory log for the accesses to an
address which led to the current value in memory. For example, after loading
parent.undo
and going to the end of main()
, it’s possible to do:
84% 552> ublame &shmem_data[0]
Num PID Time in recording Type Address Len
----- ------- ------------------- ------ -------------- -----
2 3744516 195:0x55c54830e473 write 0x7f6f86168000 4
* 1 3744513 225:0x55c54830e4b0 read 0x7f6f86168000 4
This shows all accesses which led shmem_data[0] to acquire its current value.
ublame [options] address¶
Show accesses to address in shared memory by all recorded processes.
The program being debugged must use the
undolr_shmem_log_filename_set()
function of the LiveRecorder API to enable logging of accesses of shared memory.
- -l N, -limit N¶
Show only the N most recent accesses in the shared memory log.
- -r, -all-reads¶
Show read accesses for all inferiors (not just the current inferior).
- -w, -writes-only¶
Show only write accesses.
- -s SIZE, -size SIZE¶
The size in bytes of the value pointed to by address, for use if this can’t be determined automatically based on the type of address.
For example, when address has the type
int *
, the table includes any access in sizeof(int) bytes starting at address.To show only accesses to the first byte of an integer you can use either:
ublame -size 1 &int_variable
or:
ublame (char *) &int_variable
This option is also useful to look at a portion of an array:
ublame -size 100 &a_big_char_array[50]
The table shows accesses leading to the current contents of memory at address. This means that only accesses made prior to the current time of the current inferior are shown, and that no accesses are shown if address has not yet been accessed by the current inferior. The columns in the table are:
Num
Number identifying the inferior (that is, the recorded process) which executed the access to shared memory.
These numbers can be passed to the ugo inferior command to load the recording for that inferior.
Accesses made by the process corresponding to the currently loaded recording (that is, the current inferior) are marked with an asterisk (
*
).PID
Id of the process that executed the access to shared memory.
Note
It is possible for two different inferiors to have the same process id, due to process id reuse.
Time in recording
The time of the access, in the given inferior.
Note
Time grows monotonically in each inferior, but times for different inferiors are unrelated.
Type
Whether the access was a read, write or read/write operation.
Read/write operations correspond to machine instructions which both read and write a value, for instance the
cmpxchg
instruction on x86 architectures used to implement atomic operations.Address
The address accessed by the inferior.
Note
This can be different from the address passed to ublame, because the inferior might have accessed the memory through a pointer of a different type.
Len
The size of the access, in bytes.
ublame -show-command-line inferior time¶
Show a command line that will start another instance of UDB, load the Undo recording corresponding to inferior, and jump to time.
Querying the extent of the log¶
The shared memory log has a maximum size, and when this size is reached, old shared memory accesses are discarded in first-in, first-out order. This means that ublame may not be able to show the culprit of an update to a shared memory address if that update has been discarded from the log.
The size of the shared memory log can be changed using the
undolr_shmem_log_size_set()
API call.
To show the range of times for logged accesses to shared memory, use the info blame command.
info blame [-s size] [address]¶
Show range of times in which a region of shared memory was accessed.
The program being debugged must use the
undolr_shmem_log_filename_set()
function of the LiveRecorder API to enable logging of accesses of shared memory.
- -s SIZE, -size SIZE¶
The size in bytes of the value pointed to by address, for use if this can’t be determined automatically based on the type of address.
For example:
84% 552> info blame &shmem_data[0] Logged access in recorded range: [225 - 225]If no address or size is given, the region is the whole of address space. For example:
84% 552> info blame Logged access in recorded range: [225 - 552]To look at the memory accesses in more detail at a specific time in the recording, use the ugo time command to go to the time of interest, and the ublame command to show the accesses.
Debugging an example program¶
A simple example program is provided in the undolr/examples/
sub-directory
of the Undo release. See
undolr/examples/mpc_shmem_producer_consumer.c
for the source code.
This program forks into two processes, a producer and a consumer.
The producer generates several numbers in the [0, 100) open range and writes them in an array in shared memory.
The consumer reads the values from shared memory and uses them.
Both processes run until interrupted.
Eventually the program fails with an assertion:
$ ./mpc_shmem_producer_consumer
Will save shared memory log to: ./log-3744746.shmem
Will save recording for producer to: ./producer-3744746.undo
Will save recording for consumer to: ./consumer-3744746.undo
mpc_shmem_producer_consumer: undolr/examples/mpc_shmem_producer_consumer.c:100: consumer_run: Assertion `value >= 0 && value < 100' failed.
Aborted
The assertion failed in consumer_run()
, so load the consumer recording
and jump to the end of the recording (where the assertion failure is):
not running> uload -goto-end consumer-3744746.undo
0x000055878d55e43f in record_me (process_name=0x55878d5a2073 "consumer")
at undolr/examples/mpc_shmem_producer_consumer.c:41
41 int e = undolr_start(NULL);
Performing "ugo end" to jump to the end of the recording. Press ^C to
interrupt this and go back to the start.
end 4,315> backtrace
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x00007f3672c6e40f in __pthread_kill_internal (signo=6, threadid=<optimized out>)
at ./nptl/pthread_kill.c:78
#2 0x00007f3672c1f4f2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f3672c084ed in __GI_abort () at ./stdlib/abort.c:79
#4 0x00007f3672c08415 in __assert_fail_base (
fmt=0x7f3672d7fc08 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x55878d5a20ae "value >= 0 && value < 100",
file=file@entry=0x55878d5a2080 "undolr/examples/mpc_shmem_producer_consumer.c",
line=line@entry=100,
function=function@entry=0x55878d5a2160 <__PRETTY_FUNCTION__.0> "consumer_run")
at ./assert/assert.c:94
#5 0x00007f3672c18012 in __assert_fail (assertion=0x55878d5a20ae "value >= 0 && value < 100",
file=0x55878d5a2080 "undolr/examples/mpc_shmem_producer_consumer.c", line=100,
function=0x55878d5a2160 <__PRETTY_FUNCTION__.0> "consumer_run") at ./assert/assert.c:103
#6 0x000055878d55e5b8 in consumer_run () at undolr/examples/mpc_shmem_producer_consumer.c:100
#7 0x000055878d55e77e in main (argc=1, argv=0x7ffe1bdd43b8)
at undolr/examples/mpc_shmem_producer_consumer.c:194
The program is deep inside the call stack of the assert()
function. To
go back to the example code, use the reverse-finish
command, which executes in reverse until the caller is reached.
reverse-finish
0x00007f3672c1f4ed in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26 ../sysdeps/posix/raise.c: No such file or directory.
99% 4,310> reverse-finish
0x00007f3672c084e8 in __GI_abort () at ./stdlib/abort.c:79
79 ./stdlib/abort.c: No such file or directory.
99% 4,309> reverse-finish
__assert_fail_base (fmt=0x7f3672d7fc08 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x55878d5a20ae "value >= 0 && value < 100",
file=file@entry=0x55878d5a2080 "undolr/examples/mpc_shmem_producer_consumer.c",
line=line@entry=100,
function=function@entry=0x55878d5a2160 <__PRETTY_FUNCTION__.0> "consumer_run")
at ./assert/assert.c:94
94 ./assert/assert.c: No such file or directory.
99% 4,308> reverse-finish
0x00007f3672c1800d in __assert_fail (assertion=0x55878d5a20ae "value >= 0 && value < 100",
file=0x55878d5a2080 "undolr/examples/mpc_shmem_producer_consumer.c", line=100,
function=0x55878d5a2160 <__PRETTY_FUNCTION__.0> "consumer_run") at ./assert/assert.c:103
103 in ./assert/assert.c
39% 1,698> reverse-finish
0x000055878d55e5b3 in consumer_run () at undolr/examples/mpc_shmem_producer_consumer.c:100
100 assert(value >= 0 && value < 100);
Now that the program is inside the consumer_run()
function, print the
value of the variable which caused the assertion to fail:
37% 1,630> print value
$1 = 100
The value is invalid, but what caused it? Set a watchpoint on the
value
variable and execute in reverse to the most recent time when
the variable was updated:
37% 1,630> watch value
Hardware watchpoint 1: value
37% 1,630> reverse-continue
Continuing.
Hardware watchpoint 1: value
Was = 100
Now = 69
0x000055878d55e581 in consumer_run () at undolr/examples/mpc_shmem_producer_consumer.c:97
97 int value = shmem_data_array[index];
This suggests that the bad value came from shared memory and was generated by another process. At this point it would to possible to load the recording for the producer and look for the time when the bad value was generated, but, in a more complex program than the producer-consumer example, this would be a tedious process.
With Multi-Process Correlation for Shared Memory, use the ublame command to look at the last few accesses to the location in shared memory:
37% 1,630> ublame -l 2 &shmem_data_array[index]
Num PID Time in recording Type Address Len
----- ------- ---------------------- ------ -------------- -----
2 3744746 450,143:0x55878d55e60c write 0x7f3672de00a8 4
* 1 3744749 1,630:0x55878d55e57f read 0x7f3672de00a8 4
The output shows that the current process (marked with an asterisk) read the value, but the value was originally written by the other process (the producer). Switch to the other recording at the specified time using the ugo inferior command, and continue the investigation from there:
37% 1,630> ugo inferior 2 450,143:0x55878d55e60c
[Switching to inferior 2 [<null>] (<noexec>)]
Loading recording for inferior 2.
[New Thread 3744746.3744746]
0x000055878d55e60c in producer_run () at undolr/examples/mpc_shmem_producer_consumer.c:120
120 shmem_data_array[index] = new_value;
Now use another watchpoint to find the source of the wrong value:
4% 450,143> print new_value
$2 = 100
4% 450,143> watch new_value
Hardware watchpoint 2: new_value
4% 450,143> reverse-continue
Continuing.
Thread 2.1 "mpc_shmem_produ" hit Hardware watchpoint 2: new_value
Was = 100
Now = 93
0x000055878d55e5f2 in producer_run () at undolr/examples/mpc_shmem_producer_consumer.c:119
119 int new_value = random_in_range(0, 100);
This is the source of the bug! The random_in_range()
function generates
a number in a closed range, so that it returns values in the range 0 to 100
(inclusive), while the consumer only accepts values in the range 0 to 99.
Working with multiple inferiors¶
When loading an Undo recording for which Multi-Process Correlation for Shared Memory was enabled, UDB automatically looks for the shared memory log and for other associated recordings. In order for this to work:
You must have enabled the feature using
undolr_shmem_log_filename_set()
.All recordings and the shared memory log must be in the same directory.
All recordings must have a
.undo
extension.The shared memory log must have a
.shmem
extension.
Use the ublame command to find the processes which accessed a shared memory location, and ugo inferior to switch to the recording of another process (specifying a time as well). List all the available inferiors using the info inferiors command.
info inferiors [inferior1 inferior2…]¶
Describe the specified inferiors (or all available inferiors).
For instance, using this command with the previous example would list the two available recordings:
4% 450,143> info inferiors Num PID Recording ----- ------- --------------------- 1 3744749 consumer-3744746.undo * 2 3744746 producer-3744746.undo
ugo inferior inferior[bbcount[:pc]]¶
Load the Undo recording corresponding to inferior and optionally jump to a time in its execution history.
The inferior can be specified by its number in the output of the info inferiors or ublame command, or by the recording filename.
Displaying the status of shared memory¶
UDB cannot know when a different, and potentially not recorded, process
accesses shared memory, so its understanding of the content of shared
memory is lazily updated for a region of memory only when the recorded program
accesses that memory. This means that when replaying a recording, inspecting a
location in shared memory which has not yet been accessed (for example, using
the print
or x
commands) cannot report the value that was in that
location at the corresponding time in record mode.
Similarly, when accessing shared memory after the recorded program read it, it is possible that another process has already overwritten the location, but UDB cannot know that this happened until the next time the recorded program accesses that location.
To clarify the status of shared memory, UDB has a pretty-printer which provides extra information:
12% 3,456> print &struct_in_shared_memory->value_we_never_read
$1 = <not logged>
12% 3,456> print &struct_in_shared_memory->value_we_read_a_while_ago
$2 = 42 <stale>
<not logged>
This means that there are no entries in the shared memory log for this memory location up to the current time in the recording, so its value is not available.
This means that either the memory location was not accessed by the program, the shared memory log overflowed and entries relating to this memory location were discarded, or the recording does not include the whole of the program’s execution.
Use the
print/r
command (orp/r
; “r” is short for “raw”) to bypass the pretty-printer and see the actual values in memory.
… <stale>
The memory location was accessed at an earlier time, but another process may have already overwritten it.
Limitation¶
Multi-Process Correlation for Shared Memory only supports accesses to shared
memory that is mapped at the same address in all recorded processes. In
practice, this means that all processes should be forked from the same initial
process and that shared memory must be created by calling mmap()
before
calling fork()
.