<div dir="ltr">Ok, я понял, спасибо большое!<div><br></div><div>У меня не получается запустить задачу, не могу понять причину. Похоже, что не виден входной файл. В скрипте указано следующее:</div><div><br></div><div><p style="margin:0px">NUMPROC=160<br></p><p style="margin:0px">INPUT= TADEK.in</p>
<p style="margin:0px">cd /mnt/pool/3/phkorneev/TADEK_2p/</p><p style="margin:0px"><br></p><p style="margin:0px">mpirun -np $NUMPROC ./ipicls2d_mb/exe/ipicls2d < ./$INPUT >> <a href="http://outout.info">outout.info</a><br></p><p style="margin:0px"><br></p><p style="margin:0px">Вот файл с ошибкой:</p><p style="margin:0px"><br></p><p style="margin:0px">/var/spool/pbs/mom_priv/jobs/<a href="http://13422.master.SC">13422.master.SC</a>: line 22: TADEK.in: command not found</p><p style="margin:0px">At line 459 of file input.f (unit = 5, file = 'stdin')</p><p style="margin:0px">Fortran runtime error: End of file</p><p style="margin:0px">[n219:15316] [[41479,0],0]-[[41479,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">[n211:24383] [[41479,0],6]-[[41479,1],96] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">[n209:04112] [[41479,0],8]-[[41479,1],128] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">[n213:18193] [[41479,0],4]-[[41479,1],64] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">[n214:06576] [[41479,0],3]-[[41479,1],48] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">[n208:17251] [[41479,0],9]-[[41479,1],144] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">[n217:29233] [[41479,0],1]-[[41479,1],16] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)</p><p style="margin:0px">--------------------------------------------------------------------------</p><p style="margin:0px">mpirun has exited due to process rank 0 with PID 15317 on</p><p style="margin:0px">node n219 exiting improperly. There are two reasons this could occur:</p><p style="margin:0px"><br></p><p style="margin:0px">1. this process did not call "init" before exiting, but others in</p><p style="margin:0px">the job did. This can cause a job to hang indefinitely while it waits</p><p style="margin:0px">for all processes to call "init". By rule, if one process calls "init",</p><p style="margin:0px">then ALL processes must call "init" prior to termination.</p><p style="margin:0px"><br></p><p style="margin:0px">2. this process called "init", but exited without calling "finalize".</p><p style="margin:0px">By rule, all processes that call "init" MUST call "finalize" prior to</p><p style="margin:0px">exiting or it will be considered an "abnormal termination"</p><p style="margin:0px"><br></p><p style="margin:0px">This may have caused other processes in the application to be</p><p style="margin:0px">terminated by signals sent by mpirun (as reported here).</p><p style="margin:0px">--------------------------------------------------------------------------</p><p style="margin:0px">[n215:06781] [[41479,0],2]->[[41479,1],32] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 58]</p><p style="margin:0px">
</p><p style="margin:0px"><br></p><p style="margin:0px"><br></p><p style="margin:0px">Это я неправильно что-то делаю или сбой?</p><p style="margin:0px"><br></p><p style="margin:0px">С Уважением.</p><p style="margin:0px">ф.к.</p></div></div><div class="gmail_extra"><br><div class="gmail_quote">2016-12-21 18:59 GMT+03:00 Andrew A. Savchenko <span dir="ltr"><<a href="mailto:bircoph@ut.mephi.ru" target="_blank">bircoph@ut.mephi.ru</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, 21 Dec 2016 18:29:20 +0300 Phil Korneev wrote:<br>
> Спасибо!<br>
> Но только проблема осталась:<br>
><br>
> Unable to copy file /var/spool/pbs/spool/13418.<wbr>master.OU to<br>
> /mnt/pool/1/phkorneev/TADEK_2+<wbr>/TADEK_2+.o13418<br>
> *** error from copy<br>
> /bin/cp: cannot create regular file<br>
> '/mnt/pool/1/phkorneev/TADEK_<wbr>2+/TADEK_2+.o13418': No such file or directory<br>
> *** end error output<br>
> Output retained on that host in: /var/spool/pbs/undelivered/<wbr>13418.master.OU<br>
><br>
> Unable to copy file /var/spool/pbs/spool/<a href="http://13418.master.ER" rel="noreferrer" target="_blank">13418.<wbr>master.ER</a> to<br>
> /mnt/pool/1/phkorneev/TADEK_2+<wbr>/TADEK_2+.e13418<br>
> *** error from copy<br>
> /bin/cp: cannot create regular file<br>
> '/mnt/pool/1/phkorneev/TADEK_<wbr>2+/TADEK_2+.e13418': No such file or directory<br>
> *** end error output<br>
> Output retained on that host in: /var/spool/pbs/undelivered/<a href="http://13418.master.ER" rel="noreferrer" target="_blank">134<wbr>18.master.ER</a><br>
<br>
</span>Это другая проблема: pool 1 и 2 доступны только на голове<br>
cherenkov, на вычислительных узлах cherenkov их нет, поскольку это<br>
полки basov и соединение между basov и cherenkov гораздо медленнее<br>
соединения между узлами cherenkov. Если мы разрешим использование<br>
pool/{1,2} на вычислительных узлах cherenkov, интерконнект между<br>
cherenkov и basov станет узким местом и задачи будут работать<br>
очень медленно.<br>
<br>
Точно так же на basov: там родные 1 и 2, а 3 и 4 доступны только на<br>
голове (для облегчения переноса данных между кластерами).<br>
<br>
Вся эта информация была указана ещё весной этого года в<br>
информационной рассылке по поводу запуска cherenkov.<br>
<br>
Best regards,<br>
Andrew Savchenko<br>
<br>______________________________<wbr>_________________<br>
hpc mailing list<br>
<a href="mailto:hpc@lists.mephi.ru">hpc@lists.mephi.ru</a><br>
<a href="https://lists.mephi.ru/listinfo/hpc" rel="noreferrer" target="_blank">https://lists.mephi.ru/<wbr>listinfo/hpc</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">All the best , <br>Philipp K</div>
</div>