Following discussion with slurm-llnl's maintainer, here's a testing setup: Create 3 VMs: * one with {{{slurmd}}} (work node, 2 CPUs) * one with {{{slurmctld}}} * and one with {{{slurmdbd}}}. The hostname are the services they run (populate {{{/etc/hostname}}} and {{{/etc/hosts}}} accordingly). {{{slurm.conf}}} and {{{slurmdbd.conf}}} are below.<
> They all share the same {{{/etc/munge/munge.key}}} file. Make sure munged is running everywhere ({{{update-rc.d munge enable}}}). {{{/etc/slurm-llnl/slurm.conf}}}: {{{ ControlMachine=slurmctld AuthType=auth/munge CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear AccountingStorageEnforce=association AccountingStorageHost=slurmdbd AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/linux JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log NodeName=slurmd CPUs=2 State=UNKNOWN PartitionName=debug Nodes=slurmd Default=YES MaxTime=INFINITE State=UP }}} {{{/etc/slurm-llnl/slurmdbd.conf}}}: {{{ AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdHost=localhost DebugLevel=3 StorageHost=localhost StorageLoc=slurm StoragePass=shazaam StorageType=accounting_storage/mysql StorageUser=slurm LogFile=/var/log/slurm-llnl/slurmdbd.log PidFile=/var/run/slurm-llnl/slurmdbd.pid SlurmUser=slurm ArchiveDir=/var/log/slurm-llnl/ ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=yes ArchiveSuspend=yes PurgeEventAfter=1hour PurgeJobAfter=1hour PurgeResvAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour }}} On {{{slurmdbd}}} create a MySQL database called {{{slurm}}} with write permission for user {{{slurm}}} with password {{{shazaam}}}: {{{ CREATE DATABASE slurm; GRANT ALL PRIVILEGES ON slurm.* TO 'slurm' IDENTIFIED BY 'shazaam'; }}} With {{{sacctmgr}}} (package {{{slurm-client}}}) add a cluster, an account and a user: {{{ sacctmgr -i add cluster cluster sacctmgr -i add account oliva Cluster=cluster sacctmgr -i add user oliva Account=oliva }}} Then run a couple of jobs as user {{{oliva}}} with {{{srun}}} or {{{sbatch}}}: you can see them in the cluster history with {{{sacct}}}. {{{ # nodes status slurmctld# sinfo # send job slurmctld# srun -l /bin/hostname # list jobs slurmctld# sacct # reset node (e.g. stuck in 'alloc' state) slurmctld# scontrol update NodeName=slurmd State=down reason=x slurmctld# scontrol update NodeName=slurmd State=resume }}} Given the settings of the {{{slurmdbd.conf}}} below, this job information are purged at the beginning of the hour after the job has run and are stored in two files called: {{{ cluster_job_archive_2019-12-09T01:00:00_2019-12-09T01:59:59 cluster_step_archive_2019-12-09T01:00:00_2019-12-09T01:59:59 }}} with the current date under {{{/var/log/slurm-llnl/}}}. CVE-2019-12838 note: to reproduce, try to reload the files with the command: {{{ sacctmgr archive load file=/var/log/slurm-llnl/... }}} See also https://slurm.schedmd.com/quickstart.html and https://slurm.schedmd.com/troubleshoot.html