[mephi-hpc] (no subject)

anikeev anikeev at ut.mephi.ru
Thu Mar 9 14:01:57 MSK 2017


On Thu, 2017-03-09 at 10:05 +0000, Курельчук Ульяна Николаевна wrote:
> Здравствуйте! Я ставлю задачи, но в очереди их не видно, выполнение
> не начинается. 20339, 20340.
>                                                                     

Добрый день!

Эти задачи завершены с ошибкой:

anikeev at master.cherenkov ~ $ tracejob 20339
...
03/09/2017 12:48:45  S    job exit status 1 handled
...
03/09/2017 12:48:45  S    dequeuing from long, state COMPLETE
...

anikeev at master.cherenkov ~ $ tracejob 20340
...
03/09/2017 12:48:50  S    job exit status 1 handled
...
03/09/2017 12:48:50  S    dequeuing from long, state COMPLETE
...

Status 1 означает: SIGHUP Hangup detected on controlling terminal or
death of controlling process

http://support.ersa.edu.au/hpc/pbs-exit-codes.html

Подробная информация об ошибках содержится в OUT-файлах:

master.cherenkov anikeev # locate *.e20339
/home/unk/qe/work/low/1fcc.sh.e20339
/opt/environments/debian8-i386/home/unk/qe/work/low/1fcc.sh.e20339
master.cherenkov anikeev # locate *.e20340
/home/unk/qe/work/low/1cube.sh.e20340
/opt/environments/debian8-i386/home/unk/qe/work/low/1cube.sh.e20340

master.cherenkov anikeev # cat /home/unk/qe/work/low/1cube.sh.e20340
---------------------------------------------------------------------
-----
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
---------------------------------------------------------------------
-----
---------------------------------------------------------------------
-----
mpirun has exited due to process rank 1 with PID 21174 on
node n219 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
---------------------------------------------------------------------
-----
...

Информацию по отладке распределённых MPI-приложений можно найти здесь:

https://www.open-mpi.org/faq/?category=debugging#serial-debuggers

>                                                                      
>                                 
> _______________________________________________
> hpc mailing list
> hpc at lists.mephi.ru
> https://lists.mephi.ru/listinfo/hpc
-- 
С уважением,
аспирант кафедры 4 МИФИ,
инженер отдела Unix-технологий,
Аникеев Артём.
Тел.: 8 (495) 788-56-99, доб. 8998


More information about the hpc mailing list