On Thu, 21 Jul 2005, gabriele renzi wrote:

> Ara.T.Howard ha scritto:
>  plus dirwatch is really designed to setup a processing system
>> which runs external programs on files as they arrive in directories vs.
>> running a ruby block or some such.
>
> I don't know the internals nor the api for dirwatch, but could ypu explain
> where the difference would be ?

well, dirwatch is an application vs. and api.  so you don't have something
like

   open('directory').on('created') do |file|
     puts "#{ file } created"
   end

or however you might imagine an api for watching directory events...


with dirwatch, which is a command line tool, you'd do something like this to
setup a watch

   ~ > dirwatch some_directory create

this initializes an sqlite database, config files, log files, generates sample
scripts, etc.  all this will end up in ./some_directory/.dirwatch/.  example:

   jib:~ > mkdir some_directory

   jib:~ > dirwatch some_directory/ create
   ---
   /home/ahoward/some_directory:
     dirwatch_dir : /home/ahoward/some_directory/.dirwatch
     db           : /home/ahoward/some_directory/.dirwatch/db
     logs_dir     : /home/ahoward/some_directory/.dirwatch/logs
     config       : /home/ahoward/some_directory/.dirwatch/dirwatch.conf
     commands_dir : /home/ahoward/some_directory/.dirwatch/commands


if we peeked in dirwatch.conf we'd see something like
   ...
   ...
   ...
     actions:
       updated :
         -
           command: simple.sh
           type: simple
           pattern: ^.*$
           timing: sync
         -
           command: yaml.rb
           type: simple
           pattern: ^.*$
           timing: sync
   ...
   ...
   ...

(did i mention i love yaml? ;-) )

the 'actions' section is where you setup what to do on certain events.  the
possible events are 'created', 'modified', 'deleted', or 'existing' (all of
which are pretty obvious) and the action 'updated' which is the union of
'created' or 'modified'.  so this config is saying that, whenever a file is
updated we'll run two commands 'simple.sh' and 'yaml.rb'.  note that a list of
commands can be specified - they will be run in that order.  the list of
commands themselves are configured with a few paramters

   command:

     the command to run.  the .dirwatch/commands_dir/ is pre-pended to PATH
     when running commands so it's convenient to put them there.  the
     example/auto-generated commands are in that directory.

   type:

     this is the calling convention.  for example simple commands are called
     like

       simple.sh file_that_was_updated mtime_of_that_file

     and is called once for each file.  yaml commands are called like

       yaml.rb < (list of __every__ updated file and it's mtime on stdin in yaml format)

     there are two other types but essentially you just have a choice - your
     script is run once with every file or it gets all the files at once on
     stdin.

   pattern:

     only files matching this regex will get passed to this command.  dirwatch
     itself has a --pattern option which causes it to see only files matching
     that pattern but that affects everything.  this is on a per command basis.
     so you might see

       updated :
         -
           command: gif2png
           type: simple
           pattern: ^.*\.gif$
           timing: sync
         -
           command: png2ps
           type: simple
           pattern: ^.*\.png$
           timing: sync

   timing:

     whether we wait for each command to finish or just spawn in the background
     and collect exit_status later.  this is extremely dangerous on systems
     that could update 1,000,000 files at once.



next you'd simply start dirwatch using

   jib:~ > dirwatch some_directory/ watch
   I, [2005-07-21T09:04:48.668571 #27750]  INFO -- : ** STARTED **
   I, [2005-07-21T09:04:48.669050 #27750]  INFO -- : config </home/ahoward/some_directory/.dirwatch/dirwatch.conf>
   I, [2005-07-21T09:04:48.669252 #27750]  INFO -- : flat <false>
   I, [2005-07-21T09:04:48.669324 #27750]  INFO -- : files_only <false>
   I, [2005-07-21T09:04:48.682278 #27750]  INFO -- : no_follow <false>
   I, [2005-07-21T09:04:48.682358 #27750]  INFO -- : pattern <>
   I, [2005-07-21T09:04:48.682461 #27750]  INFO -- : n_loops <>
   I, [2005-07-21T09:04:48.682629 #27750]  INFO -- : interval <00:05:00>
   I, [2005-07-21T09:04:48.683028 #27750]  INFO -- : lockfile </home/ahoward/some_directory/.dirwatch.lock>
   I, [2005-07-21T09:04:48.683147 #27750]  INFO -- : tmpwatch[all] <false>
   I, [2005-07-21T09:04:48.683213 #27750]  INFO -- : tmpwatch[nodirs] <false>
   I, [2005-07-21T09:04:48.683278 #27750]  INFO -- : tmpwatch[force] <true>
   I, [2005-07-21T09:04:48.683454 #27750]  INFO -- : tmpwatch[age] <30 days> == <2592000.0s>
   I, [2005-07-21T09:04:48.683530 #27750]  INFO -- : tmpwatch[rm] <rm_rf>
   ...
   ...
   ...

now, if i dropped a file into some_directory/ in another terminal:

   jib:~/some_directory > touch a

i'd see this in the terminal running dirwatch

   I, [2005-07-21T09:06:13.721967 #27839]  INFO -- : ACTION.UPDATED.0.0 - cmd : simple.sh '/home/ahoward/some_directory/a' '2005-07-21 15:05:38.000000'
   I, [2005-07-21T09:06:13.795296 #27839]  INFO -- : ACTION.UPDATED.0.0 - exit_status : 0

the 'ACTION.UPDATED.0.0' is a uniq tag that makes finding the exit_status easy
in the event that the command was run 'async' and it's exit_status ends up in
the log 4000 lines later...


when running from the console like this the stdout of the command run shows
too, so i also saw this - the output of running simple.sh - in the terminal
running dirwatch:

   dirwatch_dir: </home/ahoward/some_directory>
   dirwatch_action: <updated>
   dirwatch_type: <simple>
   dirwatch_n_paths: <1>
   dirwatch_path_idx: <0>
   dirwatch_path: </home/ahoward/some_directory/a>
   dirwatch_mtime: <2005-07-21 15:05:38.000000>
   dirwatch_pid: <27839>
   dirwatch_id: <ACTION.UPDATED.0.0>
   command_line: </home/ahoward/some_directory/a 2005-07-21 15:05:38.000000>
   path: </home/ahoward/some_directory/a>
   mtime: <2005-07-21 15:05:38.000000>


simple.sh basically just prints it's environment and the argv it was called
with, here's the whole script:

   jib:~/some_directory > cat .dirwatch/commands/simple.sh
   #!/bin/sh
   echo "dirwatch_dir: <$DIRWATCH_DIR>"
   echo "dirwatch_action: <$DIRWATCH_ACTION>"
   echo "dirwatch_type: <$DIRWATCH_TYPE>"
   echo "dirwatch_n_paths: <$DIRWATCH_N_PATHS>"
   echo "dirwatch_path_idx: <$DIRWATCH_PATH_IDX>"
   echo "dirwatch_path: <$DIRWATCH_PATH>"
   echo "dirwatch_mtime: <$DIRWATCH_MTIME>"
   echo "dirwatch_pid: <$DIRWATCH_PID>"
   echo "dirwatch_id: <$DIRWATCH_ID>"
   echo "command_line: <$@>"
   path=$1
   mtime=$2
   echo "path: <$path>"
   echo "mtime: <$mtime>"

you'll notice quite a bit of information is passed via the environment and
that the mtime is also passed in on the command line.  typical programs won't
use all this - but it's there.  'dirwatch --help' explains the meaning of
these environment variables.


so, normally you don't run like that (from the console) and instead have
something like this in your crontab to maintain an 'immortal' daemon

   */15 * * * * dirwatch /home/ahoward/some_directory watch --daemon

this does NOT start a daemon every fifteen minutes.  the daemon always sets up
of a lockfile and refuses to start if one is already running.  so, this just
makes sure exactly one daemon is running at all times - even after machine
reboots or if some bug causes dirwatch to crash.  this may seem a bit odd but
those of you that don't have root on all your boxes in the office will
understand why it can work like that - you can setup robust daemons without
any special privledges.  of course you can start it from init.d and it
supports 'start', 'stop', and 'restart' arguments too so this is trivial.

so that's it basically.  dirwatch simply scans a directory, compares what it
finds to what's in it's database (sqlite), and runs appropriate actions in the
way you've configured it to do, and then sleeps for a while.  it never stops,
automatically logs rolls, and does some other stuff too.  there's a whole lot
of options like recursing into subdirectories, ignoring anything that's not a
file, a tmpwatch like facility built-in, etc.  but you can read about that in
with --help.

cheers.

btw.  i inlined the output of --help below.  note that i just did a massive
re-write so some of this is a little off, but it's close.


-a
-- 
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple.  My religion is kindness.
| --Tenzin Gyatso
===============================================================================

NAME
   dirwatch v0.9.0

SYNOPSIS
   dirwatch [ options ]+ mode [ directory = ./ ]

DESCRIPTTION
   dirwatch is a tool used to rapidly build processing systems from file system
   events.

   dirwatch manages an sqlite database that mirrors the state of a directory and
   then triggers user definable event handlers for certain filesystem activities
   such file creation, modification, deletion, etc.  dirwatch can also implement
   a tmpwatch like behaviour to ensure files of a certain age are removed from
   the directory being watched.  dirwatch normally runs as a daemon process by
   first sychronizing the database inventory with that of the directory and then
   firing appropriate triggers as they occur.

   -----------------------------------------------------------------------------
   the following actions may have triggers configured for them
   -----------------------------------------------------------------------------

   created  -> a file was detected that was not already in the database
   modified -> a file in the database was detected as being modified
   updated  -> a file was created or modified (union of these two actions)
   deleted  -> a file in the database is no longer in the directory
   existing -> a file in the database still exists in the directory and has not
               been modified

   -----------------------------------------------------------------------------
   the command line 'mode' must be one of the following
   -----------------------------------------------------------------------------

   create   (c) -> initialize the database and supporting files
   watch    (w) -> monitor directory and trigger actions in the foreground
   start    (S) -> spawn a daemon watcher in the background
   restart  (R) -> (re)spawn a daemon watcher in the background
   stop     (H) -> stop/halt any currently running watcher
   status   (T) -> determine if any watcher is currently running
   truncate (D) -> truncate/delete all entries from the database
   archive  (a) -> create a tar.gz archive of a watch's directory contents
   list     (l) -> dump database to stdout in silky smooth yaml format

   for all modes the command line argument must be the name of the directory to
   which to apply the operation - which defaults to the current directory.

   -----------------------------------------------------------------------------
   mode: create (c)
   -----------------------------------------------------------------------------

   initializes a storage directory with all required database files, logs,
   command directories, sample configuration, sample programs, etc.

   by default the storage dir will be stored in a subdirectory specfied as the
   'directory' command line argument, eg:

     directory/.dirwatch/

   the --dirwatch_dir option can be used to specify an alternate location.  this
   is particularly important to use if you, for instance, have an external
   program like tmpwatch running which might delete this directory!

   when a dirwatch storage directory is created a few files are directories are
   created underneath it.  the hierarchy is

     directory/.dirwatch/
                         commands/
                         logs/
                         db
                         dirwatch.conf
                         dirwatch.pid

    where

     commands/     -> any programs placed here will be automatically found as
                      this location is added to PATH
     logs/         -> logs are kept here and are auto-rolled to no scrubbing is needed
     db            -> this is an sqlite database file
     dirwatch.conf -> a yaml configuration file used to configure which commands
                      to trigger for which actions
     dirwatch.pid  -> a file containing the pid of the daemon process

   examples:

     0) initialize the directory incoming_data/ to be dirwatched using all
        defaults

       ~ > dirwatch create incoming_data/

     1) initialize the directory incoming_data/ to be dirwatched storing all
        metadata in /usr/local/dirwatch/incoming_data

       ~ > dirwatch create incoming_data/           --dirwatch_dir=/usr/local/dirwatch/incoming_data/

   -----------------------------------------------------------------------------
   mode: start (S)
   -----------------------------------------------------------------------------

   dirwatch is normally run in daemon mode.  the start mode is equivalent to
   running in 'watch' mode with the '--daemon' and '--quiet' flags.

   examples:

     ~ > dirwatch start incoming_data/

   -----------------------------------------------------------------------------
   mode: restart (R)
   -----------------------------------------------------------------------------

   'restart' mode checks a watcher's pidfile and either restarts the currently
   running watcher or starts a new one as in 'start' mode.  this is equivalent to
   sending SIGHUP to the watcher daemon process.

   examples:

     ~ > dirwatch restart incoming_data/

   -----------------------------------------------------------------------------
   mode: stop (H)
   -----------------------------------------------------------------------------

   'stop' mode checks for any process watching the specified directory and  kills
   this process if it exists.  this is equivalent to sending TERM to the watcher
   daemon process.  the process will not exit immediately but will do at the
   first possible safe opportunity.  do not kill -9 the daemon process.

   examples:

     ~ > dirwatch stop incoming_data/

   -----------------------------------------------------------------------------
   mode: status (T)
   -----------------------------------------------------------------------------

   'status' mode reports whether or not a watcher is running for the given
   directory.

   examples:

     ~ > dirwatch status incoming_data/

   -----------------------------------------------------------------------------
   mode: truncate (D)
   -----------------------------------------------------------------------------

   'truncate' (delete) mode atomically empties the database of all state.

   examples:

     ~ > dirwatch truncate incoming_data/

   -----------------------------------------------------------------------------
   mode: archive (a)
   -----------------------------------------------------------------------------

   archive mode is used to atomically create a tgz file of a the storage
   directory for a given directory while respecting the locking subsystem.

   examples:

     ~ > dirwatch archive incoming_data/

   essentially this is useful for making hot backups.  you system must have the
   tar command for this to operate.

   -----------------------------------------------------------------------------
   mode: watch (w)
   -----------------------------------------------------------------------------

   this is the biggie.

   dirwatch is designed to run as a daemon, updating the database inventory at
   the interval specified by the '--interval' option (5 minutes by default) and
   firing appropriate trigger commands.  two watchers may not watch the same
   dir simoultaneously and attempting the start a second watcher will fail when
   the second watcher is unable to obtain the pid lockfile.  it is a non-fatal
   error to attempt to start another watcher when one is running and this failure
   can be made silent by using the '--quiet' option.  the reason for this is to
   allow a crontab entry to be used to make the daemon 'immortal'.  for example,
   the following crontab entry

     */15 * * * * dirwatch directory --daemon --dbdir=0 \
                                     --files_only --flat \
                                     --interval=10minutes --quiet

   or (same but shorter)

     */15 * * * * dirwatch directory -D -d0 -f -F -i10m -q

   will __attempt__ to start a daemon watching 'directory' every fifteen minutes.
   if the daemon is not already running one will started, otherwise dirwatch will
     simply fail silently (no cron email sent due to stderr).

   this feature allows a normal user to setup daemon processes that not only will
   run after machine reboot, but which will continue to run after other terminal
   program behaviour.

   the meaning of the options in the above crontab entry are as follows

     --daemon     -> become a child of init and run forever
     --dbdir      -> the storage directory, here the default is specified
     --files_only -> inventory files only (default is files and directories)
     --flat       -> do not recurse into subdirectories (default recurses)
     --interval   -> generate inventory, at mininum, every 10 minutes
     --quiet      -> be quiet when failing due to another daemon already watching

   as the watcher runs and maintains the inventory it is noted when
   files/directories (entries) have been created, modified, updated, deleted, or
   are existing.  these entries are then handled by user definable triggers as
   specified in the config file.  the config file is of the format

     ...
     actions :
       created :
         commands :
           ...
       updated :
         commands :
           ...
       ...
     ...

   where the commands to be run for each trigger type are enumerated.  each
   command entry is of the following format:
         ...
         -
           command : command to run
           type    : calling convention
           pattern : filter files further by this pattern
           timing  : synchronous or asynchronous execution
         ...

   the meaning of each field is as follows:

     command: this is the program to run.  the search path for the program is
              determined dynamically by the action run.  for instance, when a
              file is discovered to be 'modified' the search path for the
              command will be

                dbdir/commands/modified/ + dbdir/commands/ + $PATH

              this dynamic path setting simply allows for short pathnames if
              commands are stored in the dbdir/commands/* subdirectories.

     type:    there are four types of commands.  the type merely indicates the
              calling convention of the program.  when commands are run there
              are two peices of information which must be passed to the
              program, the file in question and the mtime of that file.  the
              mtime is less important but programs may use it to know if the file
              has been changed since they were spawned.  mtime will probably be
              ignored for most commands.  the four types of commands fall into
              two catagories: those commands called once for each file and those
              types of commands called once with __all__ files

              each file:

                simple:  the command will be called with three arguments: the file
                         in question, the mtime date, and the mtime time. eg:

                           command foobar.txt 2002-11-04 01:01:01.1234

                expaned: the command will be have the strings '@file' and
                         '@mtime' replaced with appropriate values. eg:

                           command '@file' '@mtime'

                         expands to (and is called as)

                           command 'foobar.txt' '2002-11-04 01:01:01.1234'

              all at once:

                filter:  the stdin of the program will be given a list where each
                         line contains three items, the file, the mtime data, and
                         the mtime time.

                yaml:    the stdin of the program will be given a list where each
                         entry contains two items, the file and the mtime.  the
                         format of the list is valid yaml and the schema is an
                         array of hashes with the keys 'path' and 'mtime'.

     pattern: all the files for a given action are filtered by this pattern,
              and only those files matching pattern will have triggers fired.


     timing:  if timing is asynchronous the command will be run and not waited
              for before starting the next command.  asynchronous commands may
              yield better performance but may also result in many commands
              being run at once.  asyncronous commands should not load the
              system heavily unless one is looking to freeze a machine.
              synchronous commands are spawned and waited for before the next
              command is started.  a side effect of synchronous commands is
              that the time spent waiting may sum to an ammount of time greater
              than the interval ('--interval' option) specified - if the amount
              of time running commands exceeds the interval the next inventory
              simply begins immeadiately with no pause.  because of this one
              should think of the interval used as a minimum bound only,
              especially when synchronous commands are used.


   note that sample commands of each type are auto-generated in the
   dbdir/commands directory.  reading these should answer any questions regarding
   the calling conventions of any of the four types.  for other questions regard
   the sample config, which is also auto-generated.


   -----------------------------------------------------------------------------
   mode: list (l)
   -----------------------------------------------------------------------------

   dump the contents of the database in yaml format for easy viewing/parsing


ENVIRONMENT

   for dirwatch itself:

     export SLDB_DEBUG=1     -> cause sldb library actions (sql) to be logged
     export LOCKFILE_DEBUG=1 -> cause lockfile library actions to be logged

   for programs run by dirwatch the following environment variables will be set:

     DIRWATCH_DIR      -> the directory being watched
     DIRWATCH_ACTION   -> action type, one of 'instance', 'created', 'modified',
                          'updated', 'deleted', or 'existing'
     DIRWATCH_TYPE     -> command type, one of 'simple', 'expanded', 'filter', or
                          'yaml'
     DIRWATCH_N_PATHS  -> the total number of paths for this action.  the paths
                          themselves will be passed to the program in a different
                          way depending on DIRWATCH_TYPE, for instance on the
                          command line or on stdin, but this number will always
                          be the total number of paths the program should expect.
     DIRWATCH_PATH_IDX -> for some command types, like 'simple', the program will
                          be run more than once to handle all paths since calling
                          convention only allows the program to be called with
                          one path at a time.  this number is the index of the
                          current path in such cases.  for instance, a 'simple'
                          program may only be called with one path at a time so
                          if 10 files were created in the directory that would
                          result in the program being called 10 times.  in each
                          case DIRWATCH_N_PATHS would be 10 and DIRWATCH_PATH_IDX
                          would range from 0 to 9 for each of the 10 calls to the
                          program.  in the case of 'filter' and 'yaml' command
                          types, where every path is given at once on stdin this
                          value will be equal to DIRWATCH_N_PATHS
     DIRWATCH_PATH     -> for 'simple' and 'expanded' command types, which are
                          called once for each path, this will contain the path
                          the program is being called with.  in the case of
                          'filter' or 'yaml' command types the varible contains
                          the string 'stdin' implying that all paths are
                          available on stdin.
     DIRWATCH_MTIME    -> for 'simple' and 'expanded' command types, which are
                          called once for each path, this will contain the mtime
                          the program is being called with.  in the case of
                          'filter' or 'yaml' command types the varible contains
                          the string 'stdin' implying that all mtimes are
                          available on stdin.
     DIRWATCH_PID      -> the pid of dirwatch watcher process
     DIRWATCH_ID       -> an identifier for this action that will be unique for
                          any given run of a dirwatch watcher process.
                          restarting the watcher resets the generator.  this
                          identifier is logged in the dirwatch watcher logs to is
                          useful to match program logs with dirwatch logs
     PATH              -> the normal shell path.  for each program run the PATH
                          is modified to contain the commands dir of the dirwatch
                          watcher processs.  normally this is
                          $DIRWATCH_DIR/.dirwatch/commands/


FILES
   directory/.dirwatch/              -> dirwatch data files
   directory/.dirwatch/dirwatch.conf -> default configuration file
   directory/.dirwatch/commands/     -> default location for triggers
   directory/.dirwatch/db            -> sldb/sqlite database
   directory/.dirwatch/dirwatch.pid  -> default pidfile
   directory/.dirwatch/logs/         -> automatically rolled log files

DIAGNOSTICS
   success -> $? == 0
   failure -> $? != 0


AUTHOR
   ara.t.howard / noaa.gov


BUGS
   1 < bugno && bugno < 42

OPTIONS
   --help, -h
         this message
   --log=path, -l
         set log file - (default stderr)
   --verbosity=verbostiy, -v
         0|fatal < 1|error < 2|warn < 3|info < 4|debug - (default info)
   --config=path
         valid path - specify config file (default nil)
   --template=[path]
         valid path - generate a template config file in path (default stdout)
   --dirwatch_dir=dirwatch_dir
         specify dirwatch storage dir
   --daemon, -d
         specify daemon mode
   --quiet, -q
         be wery wery quiet
   --flat, -F
         do not recurse into subdirectories
   --files_only, -f
         consider only files
   --no_follow, -n
         do not follow links
   --pattern=pattern, -p
         consider only entries that match pattern
   --n_loops=n_loops, -N
         loop only this many times before exiting
   --interval=interval, -i
         sleep at least this long between loops
   --lockfile=[lockfile], -k
         specify a lockfile path
   --show_input, -s
         show input to all commands run