[mephi-hpc] ошибка в расчете
anikeev
anikeev at ut.mephi.ru
Mon Dec 25 17:50:51 MSK 2017
On Mon, 2017-12-25 at 12:18 +0000, Богданович Ринат Бекирович wrote:
> Добрый день, возникают ошибки в расчете (до четверга все считалось
> номрально, в течение последнего года).
Добрый вечер!
> Скажите, пожалуйста, это временное явление?
Подскажите, как я могу воспроизвести ошибку, не повредив Ваши данные?
> Ошибка 1.
>
> MCU Step: state input
> -------------------------------------------------------------------
> -------
> mpirun has exited due to process rank 0 with PID 25327 on
> node n121 exiting improperly. There are three reasons this could
> occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls
> "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> 3. this process called "MPI_Abort" or "orte_abort" and the mca
> parameter
> orte_create_session_dirs is set to false. In this case, the run-time
> cannot
> detect that the abort call was an abnormal termination. Hence, the
> only
> error message you will receive is this one.
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>
> You can avoid this message by specifying -quiet on the mpirun command
> line.
>
>
> Ошибка 2.
>
>
> MCU Step: state input
> Warning: state input has already been finished. Restored.
>
> MCU Step: state calculation
>
> WARNINGS in initial data of MCU: 0
> ERRORS in initial data of MCU: 0
>
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> -------------------------------------------------------------------
> -------
> mpirun detected that one or more processes exited with non-zero
> status, thus causing
> the job to be terminated. The first process to do so was:
>
> Process name: [[54185,1],31]
> Exit code: 2
>
>
>
> [n121][[54185,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [n108][[54185,1],95][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [n113][[54185,1],63][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> At line 17112 of file MCUmpi.F90 (unit = 20, file =
> '/mnt/pool/2/rynatb/MCUPTR_10/PIN-GAP_BASOV/c2m6_62.039--16-
> BASOV_PG.MCU_P31')
> Fortran runtime error: Operation now in progress
>
>
> С уважением,
> Ринат
>
> --
> Ринат Богданович
> Rynat Bahdanovich
>
> Postgraduate student, assistant
> National Research Nuclear University "MEPhI"
> Department of Theoretical and Experimental Physics of Nuclear
> Reactors (№5)
> Moscow, Russia, +7 (495) 788 56 99 (ext. 9364), +7 (925) 846 28 14
> RBBogdanovich at mephi.ru
>
>
> _______________________________________________
> hpc mailing list
> hpc at lists.mephi.ru
> https://lists.mephi.ru/listinfo/hpc
--
С уважением,
инженер отдела Unix-технологий МИФИ,
Аникеев Артём.
Тел.: 8
(495) 788-56-99, доб. 8998
More information about the hpc
mailing list