# Multi-Process Correlation for Shared Memory¶

Live Recorder can record accesses to shared memory from multiple processes to a single shared memory log file. This can be later used to track which process modified a region of shared memory.

First of all, set up your program to record itself using Live Recorder, as documented in using the Live Recorder API.

## Enabling shared memory logging¶

Multi-Process Correlation for Shared Memory is currently used in the following way:

• A process creates one or more shared maps.
• It calls undolr_shmem_log_filename_set to set the file where shared memory accesses are going to be logged.
• It forks some child processes which share the previously created shared memory maps.
• All the processes call undolr_start to record themselves.
• All the processes save their recordings by either calling undolr_save_on_termination or undolr_save.

At its simplest, this feature can be used (omitting all error checking for simplicity) as follows:

#include "undolr.h"

#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>

int main(int argc, char **argv)
{
undolr_shmem_log_filename_set("example.shmem");

const int n_elements = 100;
int *shmem_data = mmap(NULL,
sizeof(*shmem_data) * n_elements,
MAP_ANONYMOUS | MAP_SHARED,
0, 0);

pid_t pid = fork();
if (pid == 0)
{
// Child.
undolr_start(NULL);
undolr_save_on_termination("child.undo");

// Do some work and access shmem_data.
shmem_data[0] = 42;
shmem_data[1] = 123;
}
else
{
// Parent.
undolr_start(NULL);
undolr_save_on_termination("parent.undo");

// Do some work and potentially access shmem_data.
// ...

// Wait for the child process.
waitpid(pid, NULL, 0);

// Do some more work.
shmem_data[0] = 100;
}

return 0;
}


This example program will generate three files:

parent.undo
The recording containing the execution history for the main (parent) process.
child.undo
The recording containing the execution history for the child process.
example.shmem
Containing the details of the shared memory accesses made by both processes.

The recording files (which should have a .undo extension) can then be loaded into UndoDB.

Once a recording is loaded, the ublame command can be used to print the list of accesses to the specified address which lead to the current value in memory. For instance, after loading parent.undo and going at the end of the main function, it’s possible to do:

(udb) ublame &shmem_data[0]
Num    PID  Time in recording    Type    Address           Len
-----  -----  -------------------  ------  --------------  -----
2    130  354:0x55eeabe09804   write   0x7f894cb6c000      4
*     1    126  360:0x55eeabe09848   write   0x7f894cb6c000      4


This shows all accesses which lead shmem_data[0] to acquire its current value in the current process, ordered by access time to the memory.

The columns in the table printed by ublame are:

Num

The number identifying the inferior (that is, the recorded process) which executed the access to shared memory.

UndoDB associates such numbers with each separate inferior, similar to how different threads are identified by a thread number.

PID

The process identifier for the process which executed the access to shared memory.

The operating system may reuse a process identifier after the corresponding process terminated, so there’s a chance of two identical process identifiers for different recording IDs. Because of this, the PID column is provided only for convenience and the recording ID value should be used to uniquely identify processes and recordings.

Time in recording

The time in the history of the specified recording at which the access happened. For details on the format see the time notation documentation.

Note that times grow monotonically for each recording, but times for different recordings are unrelated. That is, each recording has its own time and times from different recordings cannot be directly compared.

Type

Read/write operations correspond to machine instructions which both read and write a value, for instance the cmpxchg instruction on x86 architectures used to implement atomic operations.

The start address at which the access was executed.

This doesn’t necessarily match the argument passed to the ublame command as the same memory can be accessed through pointers to variables of different sizes. For instance, using ublame on a pointer to uint8_t could also report accesses made to more bytes using a pointer to an overlapping larger integer.

Len

The size of the access.

Accesses made by the process corresponding to the currently loaded recording (i.e. the current inferior) are marked with an asterisk (*).

The ublame command only shows accesses leading to the current value of a certain memory area. This means that:

• No accesses will be reported if the memory being blamed was never (or not yet) accessed by the current recording.
• Accesses made at a later time in the recording are not shown.

## Debugging an example program¶

A simple example program is provided in the examples/ directory of UndoDB. The source is in shmem_producer_consumer.c and executables are provided for the supported platforms.

• The example program forks into two processes, a producer and a consumer.
• The producer generates several numbers in the [0, 100) open range and writes them in an array in shared memory.
• The consumer reads the values from shared memory and uses them.

The example program often succeeds, but it occasionally fails and aborts:

$./shmem_producer_consumer Will save shared memory log to: ./log-133.shmem Will save recording for producer to: ./producer-133.undo Will save recording for consumer to: ./consumer-133.undo shmem_producer_consumer: shmem_producer_consumer.c:99: consumer_run: Assertion value >= 0 && value < 100' failed. Child process failed.  The crash happened in the consumer, so we can load its recording in UndoDB: $ udb consumer-133.undo
...

No more reverse-execution history.
0x00005618d2d15a65 in record_me (process_name=0x5618d2db1c0b "consumer") at shmem_producer_consumer.c:43
43      int e = undolr_start(NULL);


The current point in history is at the start of the recording, but the crash happens at the very end, so we can go to the end and check what happened:

(udb) ugo end
udb: Have reached end of recorded history.

Program received signal SIGTRAP, Trace/breakpoint trap.
udb: Have reached end of recorded history.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(udb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f61b01bd801 in __GI_abort () at abort.c:79
#2  0x00007f61b01ad39a in __assert_fail_base (fmt=0x7f61b03347d8 "%s%s%s:%u: %s%sAssertion %s' failed.\n%n", assertion=assertion@entry=0x5618d2db1c2e "value >= 0 && value < 100",
file=file@entry=0x5618d2db1c14 "shmem_producer_consumer.c", line=line@entry=99, function=function@entry=0x5618d2db1d08 <__PRETTY_FUNCTION__.4403> "consumer_run") at assert.c:92
#3  0x00007f61b01ad412 in __GI___assert_fail (assertion=0x5618d2db1c2e "value >= 0 && value < 100", file=0x5618d2db1c14 "shmem_producer_consumer.c", line=99, function=0x5618d2db1d08 <__PRETTY_FUNCTION__.4403> "consumer_run")
at assert.c:101
#4  0x00005618d2d15bef in consumer_run () at shmem_producer_consumer.c:99
#5  0x00005618d2d15da7 in main (argc=1, argv=0x7ffe6a2c91e8) at shmem_producer_consumer.c:176


We are deep inside the call stack of the assert function. To debug the problem, we need to go back to the example code. This can be done by using reverse-finish multiple times. Each time the command is used, execution is continued backwards until the calling point of the current function.

(udb) reverse-finish
79  abort.c: No such file or directory.
(udb) reverse-finish
92  assert.c: No such file or directory.
(udb) reverse-finish
101 in assert.c
(udb) reverse-finish
99          assert(value >= 0 && value < 100);


Now that we are back inside the consumer_run function, we can print the value of the variable which caused the assertion to fail:

(udb) print value
$1 = 100  The value is invalid, but what caused it? It’s possible to use a watchpoint on the value variable and continue execution backwards until the source of that value is found: (udb) watch value Hardware watchpoint 1: value (udb) reverse-continue Continuing. Hardware watchpoint 1: value Old value = 100 New value = 83 0x00005618d2d15bc1 in consumer_run () at shmem_producer_consumer.c:96 96 int value = shmem_data_array[index];  Unfortunately the bad value comes from shared memory and was generated by another process. At this point we could load the recording for the producer and look for the time when the bad value was generated, but, in a less trivial program, it may be difficult to find the exact moment at which it happened. With Multi-Process Correlation for Shared Memory, it’s possible to use the ublame command to find out when the bad value was generated: (udb) ublame &shmem_data_array[index] Num PID Time in recording Type Address Len ----- ----- -------------------- ------ -------------- ----- 2 133 2,822:0x5618d2d15c61 write 0x7f61b0794118 4 * 1 137 3,040:0x5618d2d15bbf read 0x7f61b0794118 4  From the output we can see that the current process (marked with an asterisk) read the value at time 3,040:0x5618d2d15bbf (relative to the currently loaded recording), but the value was originally written by another process, with PID 133 and inferior number 2, at time 2,822:0x5618d2d15c61 (relative to that recording, not the current one). We can switch to the other recording at the specified time to continue the investigation: (udb) ugo inferior 2 2,822:0x5618d2d15c61 ... 0x00005618d2d15c61 124 shmem_data_array[index] = new_value;  Now we can use a watch point again to find the source of the wrong value: (udb) print new_value$2 = 100
(udb) watch new_value
Hardware watchpoint 2: new_value
(udb) reverse-continue
Continuing.

Hardware watchpoint 2: new_value

Old value = 100
New value = 95
0x00005618d2d15c47 in producer_run () at shmem_producer_consumer.c:123
123         int new_value = random_in_range(0, 100);


This is the source of the bug!

Looking at the random_in_range function we can see it accepts a closed range as arguments, so it will return values between 0 and 100 (both included), while the consumer only accepts values in the range 0 to 99.

## Working with multiple inferiors¶

When loading a recording for which Multi-Process correlation for shared memory was enabled, UndoDB will automatically look for the shared memory log and for other associated recordings. For this to work:

• You must have enabled the feature with undolr_shmem_log_filename_set.
• All recordings and the shared memory log must be in the same directory.
• All recordings must have a .undo extension.
• The shared memory log must have a .shmem extension.
• You can rename both recordings and shared memory log as long as you maintain the correct extension.

As shown in the previous sections, you can use the ublame command to look for processes which accessed a shared memory location, and ugo inferior to switch to the recording of another process (optionally specifying a time as well).

Additionally, you can list all the inferiors using the uinfo inferiors command. For instance, using this command with the previous example would list the two available recordings, consumer-133.undo (the one we manually loaded in UndoDB) and producer-133.undo:

(udb) uinfo inferiors
Num    PID  Recording
-----  -----  -----------------
*     1    137  consumer-133.undo
2    133  producer-133.undo


## APIs¶

Note

See the undolr.h header file for the most detailed information about available APIs.

• undolr_shmem_log_filename_set - Set the path where to save the log for shared memory access. If this function is not called, shared memory access across processes is not saved. This needs to be called before recording starts (before undolr_start is called).
• undolr_shmem_log_filename_get - Get the path where to save the log for shared memory access, that is the value previously set with undolr_shmem_log_filename_set.
• undolr_shmem_log_size_set - Set the maximum size (in bytes) for the shared memory log. If not called, a suitable default will be picked. Once the maximum size is reached, old entries are discared and overwritten with new ones.
• undolr_shmem_log_size_get - Get the current maximum size (in bytes) for the shared memory log.

## Displaying the status of shared memory¶

UndoDB cannot know when a different, and potentially not recorded, process accesses shared memory, so its local understanding of the content of shared memory is lazily updated only when the recorded process accesses a region of memory, and only for that accessed area of memory. Due to this, printing at replay time (for instance using the print command) an area of shared memory which was not accessed earlier cannot report the correct value.

Similarly, accessing memory some time after the recorded process read it, may mean that another process has already overwritten that value, but UndoDB cannot know about it until the next time the recorded program accesses it.

To make this situation clearer, UndoDB prints extra information about variables in shared memory:

(udb) print &struct_in_shared_memory->value_we_never_read
$1 = <never accessed> (udb) print &struct_in_shared_memory->value_we_read_a_while_ago$2 = 42 <stale>

<never accessed>
At the current time in the recording, the memory the user is trying to print has not been accessed, so its value is not available.
… <stale>
The memory the user is trying to print was accessed at an earlier time, but, in the meantime, another process may have already overwritten it.

## Current limitations¶

Currently, only memory mapped at the same address in all processes is supported. In practice, this means that all processes should be forked from the same initial process and that shared memory must be set up before calls to fork.