[mephi-hpc] ошибка в расчете

Богданович Ринат Бекирович RBBogdanovich at mephi.ru
Mon Dec 25 15:18:01 MSK 2017


Добрый день, возникают ошибки в расчете (до четверга все считалось номрально, в течение последнего года).
Скажите, пожалуйста, это временное явление?

Ошибка 1.

MCU Step: state input
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 25327 on
node n121 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.


Ошибка 2.


 MCU Step: state input
Warning: state input has already been finished. Restored.

 MCU Step: state calculation

  WARNINGS in initial data of MCU:           0
  ERRORS   in initial data of MCU:           0

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54185,1],31]
  Exit code:    2



[n121][[54185,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[n108][[54185,1],95][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[n113][[54185,1],63][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
At line 17112 of file MCUmpi.F90 (unit = 20, file = '/mnt/pool/2/rynatb/MCUPTR_10/PIN-GAP_BASOV/c2m6_62.039--16-BASOV_PG.MCU_P31')
Fortran runtime error: Operation now in progress


С уважением,
Ринат

--
Ринат Богданович
Rynat Bahdanovich

Postgraduate student, assistant
National Research Nuclear University "MEPhI"
Department of Theoretical and Experimental Physics of Nuclear Reactors (№5)
Moscow, Russia, +7 (495) 788 56 99 (ext. 9364), +7 (925) 846 28 14
RBBogdanovich at mephi.ru<mailto:RBBogdanovich at mephi.ru>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mephi.ru/pipermail/hpc/attachments/20171225/532ea336/attachment.html>


More information about the hpc mailing list