[mephi-hpc] ошибка в расчете

Tue Dec 26 14:35:48 MSK 2017

On Mon, 2017-12-25 at 18:26 +0000, Богданович Ринат Бекирович wrote:

Добрый день!

> Нужно запустить run11.sh или run10.sh в папке MCUPTR_11 или
> MCUPTR_10.

Спасибо!

К сожалению, приложение mcu5_mpi_ptr некорректно работает даже на
головном узле и даже без использования MPI:

rynatb at master.basov /mnt/pool/2/rynatb/MCUPTR_11 $ ./mcu5_mpi_ptr

 MCU Step: state input
rynatb at master.basov /mnt/pool/2/rynatb/MCUPTR_11 $ valgrind --leak-
check=full ./mcu5_mpi_ptr
...
==6013== LEAK SUMMARY:
==6013==    definitely lost: 606 bytes in 8 blocks
==6013==    indirectly lost: 3,031 bytes in 7 blocks
==6013==      possibly lost: 288 bytes in 1 blocks
==6013==    still reachable: 3,046,734 bytes in 15,692 blocks
==6013==         suppressed: 0 bytes in 0 blocks
==6013== Reachable blocks (those to which a pointer was found) are not
shown.
==6013== To see them, rerun with: --leak-check=full --show-leak-
kinds=all
==6013== 
==6013== For counts of detected and suppressed errors, rerun with: -v
==6013== ERROR SUMMARY: 8 errors from 8 contexts (suppressed: 283 from
18)

Как минимум, у приложения проблемы с утечкой памяти. Я бы рекомендовал
собрать приложение с опцией компилятора -ggdb3 -O0. Если приложение
можно сконфигурировать с выводом дополнительной отладочной информации,
это тоже лучше сделать. Затем можно будет заняться отладкой с
использованием приложений valgrind, gdb, strace.

Есть другой путь. Можно взять стартовые файлы от старой работающей
задачи и по одному переносить изменения от новой неработающей. Это
позволит сузить круг поиска проблемы.

Последние изменения вносились на кластере Басов 14 декабря, были
обновлены следующие пакеты:

app-portage/elt-patches-20170826.1
dev-python/pyblake2-1.1.0
sys-apps/portage-2.3.16
dev-libs/libxml2-2.9.6
dev-python/lxml-4.1.1
app-portage/repoman-2.3.6
sci-physics/geant-4.10.03

Я проверил, Ваше приложение с этими пакетами не линкуется:

rynatb at master.basov /mnt/pool/2/rynatb/MCUPTR_11 $ ldd ./mcu5_mpi_ptr
        linux-vdso.so.1 (0x00007ffff87b9000)
        /opt/intel/composerxe-2013.2.144/compiler/lib/intel64/libimf.so 
(0x00007f9b0f8e3000)
        libmpi_usempi.so.1 => /usr/lib64/libmpi_usempi.so.1
(0x00007f9b0f652000)
        libmpi_mpifh.so.2 => /usr/lib64/libmpi_mpifh.so.2
(0x00007f9b0f406000)
        libmpi.so.1 => /usr/lib64/libmpi.so.1 (0x00007f9b0f131000)
        libgfortran.so.3 => /usr/lib/gcc/x86_64-pc-linux-
gnu/4.8.2/libgfortran.so.3 (0x00007f9b0ee19000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9b0eb1f000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-
gnu/4.8.2/libgcc_s.so.1 (0x00007f9b0e909000)
        libquadmath.so.0 => /usr/lib/gcc/x86_64-pc-linux-
gnu/4.8.2/libquadmath.so.0 (0x00007f9b0e6cd000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9b0e4b0000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9b0e10d000)
        libintlc.so.5 => /opt/intel/composerxe-
2013.2.144/compiler/lib/intel64/libintlc.so.5 (0x00007f9b0deb7000)
        libopen-rte.so.7 => /usr/lib64/libopen-rte.so.7
(0x00007f9b0dc3b000)
        libopen-pal.so.6 => /usr/lib64/libopen-pal.so.6
(0x00007f9b0d991000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f9b0d78d000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f9b0d585000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00007f9b0d36e000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f9b0d16b000)
        libhwloc.so.5 => /usr/lib64/libhwloc.so.5 (0x00007f9b0cf31000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9b0fda6000)
        libudev.so.1 => /lib64/libudev.so.1 (0x00007f9b0cd0d000)
        libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0
(0x00007f9b0cb04000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f9b0c8ed000)

Так что, вероятнее всего, причина проблемы в изменившихся входных
файлах или параметрах запуска.

> -----Original Message-----
> From: hpc [mailto:hpc-bounces at lists.mephi.ru] On Behalf Of anikeev
> Sent: Monday, December 25, 2017 5:51 PM
> To: NRNU MEPhI HPC discussion list <hpc at lists.mephi.ru>
> Subject: Re: [mephi-hpc] ошибка в расчете
> 
> On Mon, 2017-12-25 at 12:18 +0000, Богданович Ринат Бекирович wrote:
> > Добрый день, возникают ошибки в расчете (до четверга все считалось 
> > номрально, в течение последнего года).
> 
> Добрый вечер!
> 
> > Скажите, пожалуйста, это временное явление?
> 
> Подскажите, как я могу воспроизвести ошибку, не повредив Ваши данные?
> 
> > Ошибка 1.
> >  
> > MCU Step: state input
> > -------------------------------------------------------------------
> > -------
> > mpirun has exited due to process rank 0 with PID 25327 on node
> > n121 
> > exiting improperly. There are three reasons this could
> > occur:
> >  
> > 1. this process did not call "init" before exiting, but others in
> > the 
> > job did. This can cause a job to hang indefinitely while it waits
> > for 
> > all processes to call "init". By rule, if one process calls
> > "init", 
> > then ALL processes must call "init" prior to termination.
> >  
> > 2. this process called "init", but exited without calling
> > "finalize".
> > By rule, all processes that call "init" MUST call "finalize" prior
> > to 
> > exiting or it will be considered an "abnormal termination"
> >  
> > 3. this process called "MPI_Abort" or "orte_abort" and the mca 
> > parameter orte_create_session_dirs is set to false. In this case,
> > the 
> > run-time cannot detect that the abort call was an abnormal 
> > termination. Hence, the only error message you will receive is
> > this 
> > one.
> >  
> > This may have caused other processes in the application to be 
> > terminated by signals sent by mpirun (as reported here).
> >  
> > You can avoid this message by specifying -quiet on the mpirun
> > command 
> > line.
> >  
> >  
> > Ошибка 2.
> >  
> >  
> >  MCU Step: state input
> > Warning: state input has already been finished. Restored.
> >  
> >  MCU Step: state calculation
> >  
> >   WARNINGS in initial data of MCU:           0
> >   ERRORS   in initial data of MCU:           0
> >  
> > -------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned a non-
> > zero 
> > exit code.. Per user-direction, the job has been aborted.
> > -------------------------------------------------------
> > -------------------------------------------------------------------
> > -------
> > mpirun detected that one or more processes exited with non-zero 
> > status, thus causing the job to be terminated. The first process to
> > do 
> > so was:
> >  
> >   Process name: [[54185,1],31]
> >   Exit code:    2
> >  
> >  
> >  
> > [n121][[54185,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> > (104) 
> > [n108][[54185,1],95][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> > (104) 
> > [n113][[54185,1],63][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > At 
> > line 17112 of file MCUmpi.F90 (unit = 20, file =
> > '/mnt/pool/2/rynatb/MCUPTR_10/PIN-GAP_BASOV/c2m6_62.039--16-
> > BASOV_PG.MCU_P31')
> > Fortran runtime error: Operation now in progress
> >  
> >  
> > С уважением,
> > Ринат
> > 
> > --
> > Ринат Богданович
> > Rynat Bahdanovich
> >  
> > Postgraduate student, assistant
> > National Research Nuclear University "MEPhI"
> > Department of Theoretical and Experimental Physics of Nuclear
> > Reactors 
> > (№5) Moscow, Russia, +7 (495) 788 56 99 (ext. 9364), +7 (925) 846
> > 28 
> > 14 RBBogdanovich at mephi.ru
> >  
> >  
> > _______________________________________________
> > hpc mailing list
> > hpc at lists.mephi.ru
> > https://lists.mephi.ru/listinfo/hpc
> 
> --
> С уважением,
> инженер отдела Unix-технологий МИФИ,
> Аникеев Артём.
> Тел.: 8
> (495) 788-56-99, доб. 8998
> _______________________________________________
> hpc mailing list
> hpc at lists.mephi.ru
> https://lists.mephi.ru/listinfo/hpc
> _______________________________________________
> hpc mailing list
> hpc at lists.mephi.ru
> https://lists.mephi.ru/listinfo/hpc
-- 
С уважением,
инженер отдела Unix-технологий МИФИ,
Аникеев Артём.
Тел.: 8
(495) 788-56-99, доб. 8998