[mephi-hpc] Mpi fork problem

Курельчук Ульяна Николаевна UNKurelchuk at mephi.ru
Mon Feb 6 17:34:03 MSK 2017


Здравствуйте! Считаю в QuatnumEspresso -6.0, сталкиваюсь с такой проблемой:

unk at master.cherenkov /home/cherenkov/unk/pool/1/qe/work $ sh rx.sh

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          master (PID 32167)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
#0  0x7FEFD1118407
#1  0x7FEFD1118A1E
#2  0x7FEFD04160DF
#3  0x7FEFD197F372
#4  0x7FEFD18C547F
#5  0x7FEF7B23B7B6
#6  0x7FEF7B243501
#7  0x7FEFD18D290A
#8  0x7FEFD1C1C722
#9  0x841ABE in fftw_import_wisdom
#10  0x636837
#11  0x54C0C5
#12  0x535B75
#13  0x536BA0
#14  0x40C2A2
#15  0x411DF4
#16  0x4CA217
#17  0x408F73
#18  0x408C0C
#19  0x7FEFD0402B44
#20  0x408C35
#21  0xFFFFFFFFFFFFFFFF
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 32167 on node master exited on signal 24 (CPU time limit exceeded).

Вот мой скрипт 
#!/bin/sh
#
#PBS -l nodes=16, walltime=24:00:00
mpirun -np 16 /usr/bin/pw.x < 100.in > 100.out  

( -np не ошибка, с эспрессо приходится указывать, иначе считает на 1. кстати в этом случае проблема тоже возникает) 

В выдаче программы есть оценка ресурсов:

Estimated max dynamical RAM per process >      10.58Mb

Estimated total allocated dynamical RAM >     169.33Mb вроде не так много(

Пробовала запускать с mpirun --mca mpi_warn_on_fork 0, но 

unk at master.cherenkov /home/cherenkov/unk/pool/1/qe/work $ sh rx.sh

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:

Program received signal SIGXCPU: CPU time limit exceeded.

Backtrace for this error:
#0  0x7F055FC39407
#1  0x7F055FC39A1E
#2  0x7F055EF370DF
#3  0x7F050AC21000
#4  0x7F05604A0349
#5  0x7F05603E5E37
#6  0x7F0509D6303D
#7  0x7F0509D633F4
#8  0x7F0509D5AC02
#9  0x7F0509F73864
#10  0x7F05603F3CCC
#11  0x7F056073D7A3
#12  0x7F0563B220C7
#13  0x7F0563B24BEF
#14  0x7F0563B2E900
#15  0x84F1F0 in fftw_import_wisdom
#16  0x636928
#17  0x54C0C5
#18  0x535B75
#19  0x536BA0
#20  0x40C2A2
#21  0x411DF4
#22  0x4CA217
#23  0x408F73
#24  0x408C0C
#25  0x7F055EF23B44
#26  0x408C35
#27  0xFFFFFFFFFFFFFFFF
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 8519 on node master exited on signal 24 (CPU time limit exceeded).

Подскажите пожалуйста, что может вызывать проблему?  сообщений о багах версии 6.0 и похожих проблемах не нашла. 


More information about the hpc mailing list