This project was inspired by an article about how to write a thread watchdog in C
. After reading it I thought "this would be a nice Ada project!"
So, here it is. This post is about my experience in writing it. My main motivation was to do an "exercise" in programming, but maybe it can be useful somewhere.
Task watchdog and how I did it
The problem is to monitor different tasks in a multi-task program and raise an alarm if a task stops working. A task proves that it is still alive by calling a specific function I_Am_Alive
. If it fails to call it regularly, it is considered dead and an alarm is raised.
Three ingredients are involved in this
- The watcher itself, that is, the task that check regularly if the other tasks are still alive.
- A connection to the watchdog used by the controlled task to tell the watchdog Ehi, I am still alive!
- An alarm handler that does something when the watchdog raises an alarm.
Let's examine the three ingredients
The watcher
Let's check the package interface. The key ingredient is the watcher
package Watchdogs.Connections is
--
-- Type representing the watcher, that is, the object that wakes up
-- and checks if its tasks are still alive.
--
type Watcher_Type is private;
--
-- Create a new watcher type specifying an alarm handler and a sampling
-- time, that is, the time interval between successive wake ups.
--
function Create (Alarm_Handler : Alarm_Handlers.Alarm_Handler_Access;
Sampling : Duration := 1.0)
return Watcher_Type;
-- Other stuff ...
end Watchdogs.Connections;
Watcher_Type
represents the object that does the "dirty work." The task to be controlled will make a connection to it and use it to communicate with the watcher.
A watcher must be created with the function Create
that expects as first parameter an access (a pointer in C jargon) to an alarm handler.
The alarm handler
The definition of alarm handler is the following
package Watchdogs.Alarm_Handlers is
--
-- Interface for an alarm handler. Every handler you want to implement
-- must descend from this and implement Task_Exited and Task_Unresponsive.
--
type Alarm_Handler_Interface is limited interface;
type Alarm_Handler_Access is
access all Alarm_Handler_Interface'Class;
For the non experienced in Ada:
interface
means thatAlarm_Handler_Interface
is an abstract type and you cannot create variables of this type, it works like a interface template. You need to derive new concrete classes from it.limited
means that you can derive limited types, that is, types that cannot be assigned (see in the following). Finally,Alarm_Handler_Interface'Class
is a catch-all type that includesAlarm_Handler_Interface
and every type derived from it. In other words,Alarm_Handler_Access
can point to values of any type derived byAlarm_Handler_Interface
.
The interface Alarm_Handler_Interface
requires that any non-abstract descendant implements two procedure: Task_Unresponsive
(called when a task does not respond anymore, maybe because is stuck somewhere) and Task_Exited
, called when a task exists
--
-- Called when a task is unresponsive. It receives the task identification
-- (name and/or ID) and the latest registered checkpoint
--
procedure Task_Unresponsive (Handler : in out Alarm_Handler_Interface;
ID : Task_Identification.Task_Id;
Name : String;
Checkpoint : Checkpoint_Type)
is abstract
with Pre'Class =>
(Name /= "" or ID /= Task_Identification.Null_Task_Id);
--
-- Called when a task exits. It receives the task identification
-- (name and/or ID)
--
procedure Task_Exited (Handler : in out Alarm_Handler_Interface;
ID : Task_Identification.Task_Id;
Name : String)
is abstract
with Pre'Class =>
(Name /= "" or ID /= Task_Identification.Null_Task_Id);
end Watchdogs.Alarm_Handlers;
Both procedures expects as parameter an identification of the task, namely its Task_ID
and/or a name. The ID or the name (but not both) can be empty. The procedure Task_Unresponsive
also expect a Checkpoint
value that can be used to know where the task got stuck, more about this later.
This approach allows the user to implement its own alarm handler that can do anything. For convenience, package
Watchdogs.Alarm_Handlers.To_Stderr
defines an alarm handler prêt à porter that just prints a message to the standard error.
This is an example (from main.adb
) of how to create a watcher
--
-- Get a watcher
--
Watcher : constant Connections.Watcher_Type :=
Connections.Create (Alarm_Handler => new To_Stderr.Handler_Type);
The connection
The type for a connection to the watcher is Watchdog_Connection
. Its definition is
--
-- A watchdog connection allows a task to communicate with the
-- watcher
--
type Watchdog_Connection (<>) is limited private;
If you have no experience with Ada you can find the syntax above a bit obscure. The
(<>)
means thatWatchdog_Connection
can have some unknown discriminant. Without entering in technical details, this prevents the user to declare a variable of this type without initialization. Thelimited
part means that you cannot copy a value of typeWatchdog_Connection
, a value is born, lives and dies in the same variable. This is useful for values that carry an "external connection" and it makes no sense to copy.
A connection is created with the function Open
--
-- Open a connection with the watcher. The task needs to introduce itself
-- with a name or a Task_ID, possibly both. Those values will be passed
-- to the Alarm_Handler if the task becomes unresponsive.
--
function Open (Watchdog : Watcher_Type;
Task_Name : String := "";
ID : Task_Identification.Task_Id := Task_Identification.Null_Task_Id)
return Watchdog_Connection
with
Pre => (Task_Name /= "" or ID /= Task_Identification.Null_Task_Id);
The function expects the watcher to connect to and a way to identify the task, it can be a name, the task ID or both, but at least one value must be present, as specified by the pre-condition.
Again, if you have no experience with Ada you could wonder what the part
with Pre => ...
is. It is a pre-condition, a condition that must be satisfied when you call the function. It can be considered part of the documentation, but it has the advantage that the compiler (if instructed to do so) can add code that checks the pre-condition at run-time and raises an exception if not satisfied. A powerful bug trap...
When the task ends the connection is automatically closed (and Task_Exited
called) by the destroyer of the connection.
After opening a connection the task must declare its being alive; it does it by calling I_Am_Alive
type Checkpoint_Type is mod 2 ** 16;
--
-- Let the watcher know that we are still alive. If this function is
-- called in different points of the task it is possible to distinguish
-- different calls via the Checkpoint parameter. The reason for having
-- it is just to know what is the latest instance of I_Am_Alive
-- called before the task crash. Its value will be given to the
-- alarm handler.
--
procedure I_Am_Alive (Connection : in out Watchdog_Connection;
Checkpoint : Checkpoint_Type := 0);
The procedure I_Am_Alive
can accept a Checkpoint
parameter (a 16 bit unsigned integer, actually) that the task can use to distinguish between different calls of I_Am_Alive
. If the task gets stuck the latest Checkpoint
is given to the alarm handler together with the task identification (ID or name), allowing to identify where the task got stuck.
An example
This is a very simple example of how a connection is used. This is a simplified version of what you find in main.adb
task body Foo is
Connection : Connections.Watchdog_Connection :=
Connections.Open (Watchdog => Watcher,
Task_Name => "my name is foo, task foo",
ID => Task_Identification.Current_Task);
Sleep_Time : Duration := 0.1;
begin
--
-- Now we are connected with the watcher that will check that we
-- call I_Am_Alive regularly
--
loop
--
-- At every iteration we increase the Sleep_Time so that sooner
-- or later it will exceed the wake up time of the watcher
--
delay Sleep_Time;
Sleep_Time := Sleep_Time + 0.2;
--
-- Tell the watcher we are alive
--
Connections.I_Am_Alive (Connection);
end loop;
end Foo;
Digging in the internals
The user API is nice and cool, but you want a bit of gory details about the implementation, right? OK, so let's checkout the private definition of the watcher from package Watchdogs.Connections
private
type Watcher_Type is access Watchers.Watchdog_Core;
Uh?!? That's it? Just an access to a "core type"? That's cheating...
Well, let's check the definition of Watchdog_Core
in Watchdogs.Connections
private package Watchdogs.Watchers is
--
-- Object doing all the work. This exports an interface similar
-- to the user visible Watcher_Type. This object is multitask safe
-- (with Ada it is just too easy...)
--
type Watchdog_Core is limited private;
private
--
-- Other stuff...
--
type Watchdog_Core is limited
record
Doa_Table : Task_Table_Access;
Watcher : Watchdog_Task_Access;
Handler : Alarm_Handlers.Alarm_Handler_Access;
end record;
end Watchdogs.Watchers;
Several comments are in order.
First, do you see the keyword private
before package
? This means that Watchdogs.Watchers
is a private package and it cannot be made visible outside the hierarchy of Watchdogs
. In particular, the library user (i.e., the programmer that uses the library) will not be able to access directly the resources provided by Watchdogs.Watchers
, but only through Watchdogs.Connections
that with
s Watchdogs.Watchers
with
private with Watchdogs.Watchers;
The keyword private
before with
says "Listen, I need the resources in Watchdogs.Watchers
, but I promise, cross my heart, that never ever I'll let the user see it". Indeed, if you check watchdogs-connections.ads you'll see that Watchdogs.Watchers
is referred only in the private part of the package, out of reach of the prying hands of the user...
Second, the definition of Watcher_Type
looks simple, just three fields. The last one is the access to the alarm handler (this is easy), what are the other two fields? Here the hard stuff lies... ;-)
Let's begin with the easy stuff: the field Watcher
is a Watchdog_Task_Access
that we guess being an access to Watchdog_Task
, but what is the latter? Well, a task
task type Watchdog_Task is
--
-- This task is the real watchdog: it wakes up every now and then,
-- check the task table for dead tasks and, if necessary, call
-- the alarm handler
--
entry Init (Sampling : Duration;
Table : Task_Table_Access;
Handler : Alarm_Handlers.Alarm_Handler_Access);
end Watchdog_Task;
type Watchdog_Task_Access is access Watchdog_Task;
Since we declared it as task type
Watchdog_Task
behaves as type and we can, for example, declare variables of this type. Declaring a variable of type Watchdog_Task
would start a new task that proceeds in parallel. In Ada synchronization is done traditionally by message passing via the call to task entry
. In this case the entry is just used to give the task few parameters. The task will wake up every Sampling seconds, check the unresponsive tasks and call, if necessary, the handler.
OK, cool, and what about Table
in the parameter list and in the definition of Watcher_Type
? Well, here is where most of the complexity is hidden. A Task_Table_Access
is an access
to a Task_Table
that in turn has the following definition
--
-- The protected object Task_Table is the core data structure.
-- It keeps which tasks are still alive and which ones did not
-- confirm that they are alive.
--
protected type Task_Table is
-- Register that the task associated with the connection
-- just went by th checkpoint
procedure Mark_Alive (Connection : Connection_ID;
Checkpoint : Checkpoint_Type);
-- Get the set of tasks that did not declared themselves alive
procedure Get_Dead_Tasks (Set : out Connection_To_Checkpoint_Tables.Map);
-- Reset the state, setting all task as "to be confirmed alive"
procedure Reset;
-- Delete a task
procedure Delete (Connection : Connection_ID);
-- Allocate a new connection ID to a task
procedure Get_New_Id (Connection : out Connection_ID;
Name : String;
ID : Task_Identification.Task_Id);
function ID_Of (Connection : Connection_ID)
return Task_Identification.Task_Id;
function Name_Of (Connection : Connection_ID)
return String;
private
--
-- It works in this way: we keep two sets of "tasks:" Alive (the
-- tasks that declared to be alive) and Dead (the task that still
-- have to declare to be alive). At timeout we read the Dead list
-- and raise a warning for the tasks in list; successively we copy
-- (with a Reset) Alive to Dead, restarting the iteration
--
Alive : Connection_To_Checkpoint_Tables.Map;
Dead : Connection_To_Checkpoint_Tables.Map;
Next_Id : Connection_ID := Connection_ID'First;
-- Keep name and ID of the tasks associated with a connection
Connection_Table : Connection_To_Task_Tables.Map;
end Task_Table;
A Task_Table
is the object that stores the state of the monitored tasks: if dead or alive and their identifications. It is a protected type
which means that it is accessed according to a reader/writer model (many tasks can read it at the same time, but writers have exclusive access). The compiler will take care of inserting the required synchronization code.
This object is manipulated mainly by the task watcher whose body is
task body Watchdog_Task is
Task_Table : Task_Table_Access;
Alarm_Handler : Alarm_Handlers.Alarm_Handler_Access;
Sampling_Period : Duration;
Dead_Tasks : Connection_To_Checkpoint_Tables.Map;
use Connection_To_Checkpoint_Tables;
begin
-- Accept calls to the Init entry
accept Init (Sampling : Duration;
Table : Task_Table_Access;
Handler : Alarm_Handlers.Alarm_Handler_Access)
do
Sampling_Period := Sampling;
Task_Table := Table;
Alarm_Handler := Handler;
end Init;
loop
delay Sampling_Period; -- get some sleep
--
-- Extract from the task table the task that did not
-- claimed to be alive
--
Task_Table.Get_Dead_Tasks (Dead_Tasks);
--
-- Iterate over the list of dead tasks
--
for Pos in Dead_Tasks.Iterate loop
declare
Connection : constant Connection_ID := Key (Pos);
Checkpoint : constant Checkpoint_Type := Element (Pos);
begin
--
-- Call the alarm handler with the task data
--
Alarm_Handler.Task_Unresponsive
(ID => Task_Table.Id_Of (Connection),
Name => Task_Table.Name_Of (Connection),
Checkpoint => Checkpoint);
--
-- Remove the task from the table
--
Task_Table.Delete (Connection);
end;
end loop;
Dead_Tasks.Clear;
--
-- All the tasks that were declared alive get marked as
-- dead. Let them prove that they are alive! :-)
--
Task_Table.Reset;
end loop;
end Watchdog_Task;
Conclusion
As I said, I wrote this for the fun of it and, indeed, fun it was (Yoda-style). I hope you found this interesting.
Top comments (0)