[mephi-hpc] Ошибка при запуске задач на cherenkov
Nikolai Bukharskii
n.bukharskii at gmail.com
Wed Jan 3 11:44:46 MSK 2024
Добрый день!
При попытке запуска sbatch скрипта на cherenkov возникает следующая ошибка:
"sbatch: error: Batch job submission failed: I/O error writing
script/environment to file"
Кроме того, при тестировании на управляющем узле, без постановки в очередь
slurm, я обратил внимание, что возникает следующее сообщение об ошибке:
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /tmp/ompi.cherenkov.1366/pid.10080
Error: No space left on device
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
Полный текст в приложенном файле. Возможно, это как-то связано с первой
ошибкой?
Можно ли как-то поправить это?
---
С уважением,
Бухарский Николай
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mephi.ru/pipermail/hpc/attachments/20240103/6ddeefb2/attachment.htm>
-------------- next part --------------
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /tmp/ompi.cherenkov.1366/pid.10080
Error: No space left on device
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[cherenkov:10080] [[51734,0],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[cherenkov:10080] [[51734,0],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[cherenkov:10079] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ../../../../../../orte/mca/ess/singleton/ess_singleton_module.c at line 716
[cherenkov:10079] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ../../../../../../orte/mca/ess/singleton/ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cherenkov:10079] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
More information about the hpc
mailing list