[mephi-hpc] ошибка в расчете
Богданович Ринат Бекирович
RBBogdanovich at mephi.ru
Mon Dec 25 21:26:12 MSK 2017
Нужно запустить run11.sh или run10.sh в папке MCUPTR_11 или MCUPTR_10.
-----Original Message-----
From: hpc [mailto:hpc-bounces at lists.mephi.ru] On Behalf Of anikeev
Sent: Monday, December 25, 2017 5:51 PM
To: NRNU MEPhI HPC discussion list <hpc at lists.mephi.ru>
Subject: Re: [mephi-hpc] ошибка в расчете
On Mon, 2017-12-25 at 12:18 +0000, Богданович Ринат Бекирович wrote:
> Добрый день, возникают ошибки в расчете (до четверга все считалось
> номрально, в течение последнего года).
Добрый вечер!
> Скажите, пожалуйста, это временное явление?
Подскажите, как я могу воспроизвести ошибку, не повредив Ваши данные?
> Ошибка 1.
>
> MCU Step: state input
> -------------------------------------------------------------------
> -------
> mpirun has exited due to process rank 0 with PID 25327 on node n121
> exiting improperly. There are three reasons this could
> occur:
>
> 1. this process did not call "init" before exiting, but others in the
> job did. This can cause a job to hang indefinitely while it waits for
> all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> 3. this process called "MPI_Abort" or "orte_abort" and the mca
> parameter orte_create_session_dirs is set to false. In this case, the
> run-time cannot detect that the abort call was an abnormal
> termination. Hence, the only error message you will receive is this
> one.
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>
> You can avoid this message by specifying -quiet on the mpirun command
> line.
>
>
> Ошибка 2.
>
>
> MCU Step: state input
> Warning: state input has already been finished. Restored.
>
> MCU Step: state calculation
>
> WARNINGS in initial data of MCU: 0
> ERRORS in initial data of MCU: 0
>
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned a non-zero
> exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> -------------------------------------------------------------------
> -------
> mpirun detected that one or more processes exited with non-zero
> status, thus causing the job to be terminated. The first process to do
> so was:
>
> Process name: [[54185,1],31]
> Exit code: 2
>
>
>
> [n121][[54185,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [n108][[54185,1],95][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [n113][[54185,1],63][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) At
> line 17112 of file MCUmpi.F90 (unit = 20, file =
> '/mnt/pool/2/rynatb/MCUPTR_10/PIN-GAP_BASOV/c2m6_62.039--16-
> BASOV_PG.MCU_P31')
> Fortran runtime error: Operation now in progress
>
>
> С уважением,
> Ринат
>
> --
> Ринат Богданович
> Rynat Bahdanovich
>
> Postgraduate student, assistant
> National Research Nuclear University "MEPhI"
> Department of Theoretical and Experimental Physics of Nuclear Reactors
> (№5) Moscow, Russia, +7 (495) 788 56 99 (ext. 9364), +7 (925) 846 28
> 14 RBBogdanovich at mephi.ru
>
>
> _______________________________________________
> hpc mailing list
> hpc at lists.mephi.ru
> https://lists.mephi.ru/listinfo/hpc
--
С уважением,
инженер отдела Unix-технологий МИФИ,
Аникеев Артём.
Тел.: 8
(495) 788-56-99, доб. 8998
_______________________________________________
hpc mailing list
hpc at lists.mephi.ru
https://lists.mephi.ru/listinfo/hpc
More information about the hpc
mailing list