22

I have a bunch of commands that I start with a start.sh script, storing their PID. Then later, I want to stop them with a stop.sh run at the user's convenience.

Watch the trap:

  1. I run start.sh today, which stores in a file the PIDs 15000, 15001, 15002.
  2. I forgot stopping my processes. And a week after, I rebooted my computer.
  3. I run the stop.sh script, now. It attempts to kill the tasks with PIDs 15000, 15001, 15002, reading them without thinking from the file.
    => These tasks, if some happen to have these PIDs on my new rebooted system, are no longer the ones I started by my start.sh script, and I will put my system into an unknown state.

How, when I catch first the PID of a process with a $$ in a Linux script, may I gather other information to ensure I can have no confusion with another task of same PID that could appear in the future?

Gathering PPID, for example, or start date/time, or something that ensures some kind of "universal uniqueness", if I can write this.. ?

How do you gather process information and how do you kill it without confusion?

11
  • 1
    You could start each process/script with its own: ENVVAR="somethingspecific" processname arguments , and later show env vars in ps (ex: ps auxww , on some OSes) and kill the ones having that specific combination? Childs may also inherit that envvar, which you can either overwrite (when starting them) or keep as is Commented Apr 17, 2023 at 10:43
  • Do they need to detach? Instead of command & you may want to run screen command and then you have access to the running command, e.g., to send ctrl-c to stop it later on.
    – allo
    Commented Apr 18, 2023 at 9:42
  • 2
    Does this answer your question? How to avoid killing a wrong process when using PID number for killing?
    – allo
    Commented Apr 18, 2023 at 9:43
  • @allo I think that this thread has now more precise and complete answer than your link Commented Apr 18, 2023 at 10:37
  • 7
    If you're the immediate parent of the process, the PID will never be reused until you reap the dead process's tombstone from the process table by using waitpid(). This is why competently designed process supervision systems don't have this problem: They run as the parent of the services they supervise, and get an immediate SIGCHILD when the process exits and then call waitpid() to reap the zombie process-table entry that the dead process leaves behind; before they call waitpid(), nothing else can be assigned that process ID. Commented Apr 18, 2023 at 20:42

6 Answers 6

40

How do you gather process information and how do you kill it without confusion?

You don’t.

Instead, you ensure the process runs in a context that you can reliably reference later to find it again.

The correct way to handle this is to use your target platform’s service management system (usually systemd on Linux systems these days). They will handle things correctly in a majority of cases and are specifically designed to do this type of thing.

Alternatives, in decreasing order of preference, are:

  • Use a cgroup with a specific name. This approach is Linux-specific, but has a number of distinct benefits, such as being able to reliably and atomically kill all the children of the processes you start. They inherently solve the lifetime issue because the specific processes are explicitly associated with the cgroup, instead of the PIDs being associated with the cgroup.
  • Use a supervisor system such as runit, s6, or daemontools. These solve the issue by utilizing a process that can be easily and reliably located to act as a parent for the process you want to monitor.
  • Put your PID files in /run, where they should be. The issue you point out of PID reuse across system reboots is a known issue that’s been reliably solved for decades by simply putting PID files in a directory that gets wiped each time the system reboots. /run is the standard location for this on Linux systems. This still has issues with PID reuse (because PIDs are only unique for the lifetime of the associated processes, so one of your processes dying unexpectedly and leaving a PID file behind may still run into a reuse problem).
2
  • 2
    I think you have the wisest answer, but in case of dire need, other solutions are interesting to know. Commented Apr 17, 2023 at 2:45
  • Portable version of /run pid files in /var/run which is a symlink to somewhere under /run.
    – Joshua
    Commented Apr 17, 2023 at 16:04
12

When I had to launch all kind of diagnostic scripts across a large network, I had all my scripts accept (and ignore) a --tag=.... option. (Obviously, you cannot do this with standard commands, but you can wrap them in a parent shell).

A typical --tag would contain (at least) the hostname from which it was launched, a random number, and the time of launch (down to nanosecond accuracy). For a remote task, you may not even know the pid on the remote system.

The ps command can show the args so you can grep a specific process. You could even have a cron job that produced a report of latent processes at intervals, and weeded out from your list those that had terminated.

1
  • 1
    You could write a saferkill PID TAG command that checks if the PID has a specific tag, and if so, kills it. It won't solve really fast replacement. Commented Apr 17, 2023 at 16:47
11

Store the PID information on a tmpfs filesystem, so after reboot, the files do not exist.

/run is usually tmpfs, or /tmp on some distros is tmpfs.

or mount your own

# mount tmpfs /path/to/your/mountpoint -t tmpfs 
5
  • does some kind of unique session id exist, on Linux, a number that would be generated at boot time and that I could use to create a temp file in /tmp : /tmp/<session-id>/<pid> and that would be unique? It would avoid the need of being root/sudo to create that file. Commented Apr 16, 2023 at 15:41
  • @MarcLeBihan Maybe just a timestamp? Commented Apr 16, 2023 at 18:43
  • 7
    This is not sufficient, because PIDs can be reused even without rebooting the system. PIDs can be reused. The question specifically states that the requirement is to avoid "confusion with another task of same pid that could appear in the future"; this approach does not achieve that.
    – D.W.
    Commented Apr 16, 2023 at 23:04
  • It does achieve it in the OP's scenario though. Modify the stop.sh script to also remove the file, in case you stop them and forget.
    – Steve
    Commented Apr 16, 2023 at 23:20
  • 2
    Perhaps one thing it does not protect against is if the processed crashed and the file remains behind.
    – Steve
    Commented Apr 16, 2023 at 23:20
10

You could gather additional information on the processes, but strictly speaking, it's hard to be absolutely sure the process you're going to kill is the same as the one you think it is.

Even disregarding reboots, it's possible that just before you kill the process, it dies of an unrelated reason, and another process starts at just the right time to get the same PID again. Any checks would still be racy, as it's possible for the process to be replaced with another between the check and the sending of the terminating signal.

However, if you take care to start the process under an UID that is used for nothing else, and send the terminating signal from a process with that same UID (and not root), you'll know you won't succeed in killing anything else. (The PID could still be reused by some unrelated process, then you just couldn't kill the new one.)

If do go the way of checking if the process is the same (and can ignore the chance of a race), you should probably at least check the process name / full command line of the process you're going to kill. E.g. run something like pgrep somecmd, or pgrep -f 'somecmd with some args' to find processes with the given name / command line, and see if your PID is listed in the output.

9

Have start.sh bind-mount /proc/15000 to some other directory somewhere. (Preferably this would be done by the parent process before it exits or waits to avoid a race condition.) In stop.sh, try to open the directory you bind-mounted. If the bind-mount is gone, then the system got rebooted. If opening the directory fails with ESRCH (No such process), then the process exited. (This will happen even if a new process with the same PID has since started running.) If opening the directory succeeds and the bind-mount is still there, then it's still your process and it's safe to kill. (To avoid yet another race condition, prefer pidfd_send_signal to kill it.)

6

You also need to deal with PID rollover - I believe the default on Linux is 32768 - hence if you execute enough processes the next one will rollover.

Note: The rollover bypasses any processes that already exist such as (PID 1) init.

However if you store both the PID and the start time of the process that should be sufficient to identify the process uniquely on this machine.

If you need uniqueness across machines you will also need to store something that uniquely identifies the machine.

9
  • pids in Linux haven't been signed 16 bit in at least 15 years. Probably more.
    – RonJohn
    Commented Apr 16, 2023 at 8:20
  • 3
    Your correct 32768 is still the default on 32 bit systems, but the default on 64 bit systems is 2^22
    – DavidT
    Commented Apr 16, 2023 at 8:37
  • 2^22 is (metaphorically) an odd number to choose as the maximum pid.
    – RonJohn
    Commented Apr 16, 2023 at 8:40
  • 1
    man 5 proc - On 32-bit platforms, 32768 is the maximum value for pid_max. On 64-bit systems, pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4 million).
    – DavidT
    Commented Apr 16, 2023 at 8:42
  • 1
    @RonJohn I would imagine the Kernel packs the rest of that int (or long) with a bunch of other information, like an index into a struct, and maybe privilege levels and states. 22 bits is 4 million processes, each with a stack, read-only segment and code segment. So other system limits (like page tables) limit process numbers too. Commented Apr 16, 2023 at 14:25

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .