[mephi-hpc] ошибка в расчете

anikeev anikeev at ut.mephi.ru
Mon Dec 25 17:50:51 MSK 2017


On Mon, 2017-12-25 at 12:18 +0000, Богданович Ринат Бекирович wrote:
> Добрый день, возникают ошибки в расчете (до четверга все считалось
> номрально, в течение последнего года).

Добрый вечер!

> Скажите, пожалуйста, это временное явление?

Подскажите, как я могу воспроизвести ошибку, не повредив Ваши данные?

> Ошибка 1.
>  
> MCU Step: state input
> -------------------------------------------------------------------
> -------
> mpirun has exited due to process rank 0 with PID 25327 on
> node n121 exiting improperly. There are three reasons this could
> occur:
>  
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls
> "init",
> then ALL processes must call "init" prior to termination.
>  
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>  
> 3. this process called "MPI_Abort" or "orte_abort" and the mca
> parameter
> orte_create_session_dirs is set to false. In this case, the run-time
> cannot
> detect that the abort call was an abnormal termination. Hence, the
> only
> error message you will receive is this one.
>  
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>  
> You can avoid this message by specifying -quiet on the mpirun command
> line.
>  
>  
> Ошибка 2.
>  
>  
>  MCU Step: state input
> Warning: state input has already been finished. Restored.
>  
>  MCU Step: state calculation
>  
>   WARNINGS in initial data of MCU:           0
>   ERRORS   in initial data of MCU:           0
>  
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> -------------------------------------------------------------------
> -------
> mpirun detected that one or more processes exited with non-zero
> status, thus causing
> the job to be terminated. The first process to do so was:
>  
>   Process name: [[54185,1],31]
>   Exit code:    2
>  
>  
>  
> [n121][[54185,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [n108][[54185,1],95][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [n113][[54185,1],63][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> At line 17112 of file MCUmpi.F90 (unit = 20, file =
> '/mnt/pool/2/rynatb/MCUPTR_10/PIN-GAP_BASOV/c2m6_62.039--16-
> BASOV_PG.MCU_P31')
> Fortran runtime error: Operation now in progress
>  
>  
> С уважением,
> Ринат
> 
> -- 
> Ринат Богданович
> Rynat Bahdanovich 
>  
> Postgraduate student, assistant 
> National Research Nuclear University "MEPhI"
> Department of Theoretical and Experimental Physics of Nuclear
> Reactors (№5)
> Moscow, Russia, +7 (495) 788 56 99 (ext. 9364), +7 (925) 846 28 14
> RBBogdanovich at mephi.ru
>  
>  
> _______________________________________________
> hpc mailing list
> hpc at lists.mephi.ru
> https://lists.mephi.ru/listinfo/hpc
-- 
С уважением,
инженер отдела Unix-технологий МИФИ,
Аникеев Артём.
Тел.: 8
(495) 788-56-99, доб. 8998


More information about the hpc mailing list