[mephi-hpc] Ошибка при запуске задач на cherenkov

Nikolai Bukharskii n.bukharskii at gmail.com
Wed Jan 3 11:44:46 MSK 2024


Добрый день!

При попытке запуска sbatch скрипта на cherenkov возникает следующая ошибка:
"sbatch: error: Batch job submission failed: I/O error writing
script/environment to file"

Кроме того, при тестировании на управляющем узле, без постановки в очередь
slurm, я обратил внимание, что возникает следующее сообщение об ошибке:
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /tmp/ompi.cherenkov.1366/pid.10080
  Error:     No space left on device

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
Полный текст в приложенном файле. Возможно, это как-то связано с первой
ошибкой?

Можно ли как-то поправить это?

---
С уважением,
Бухарский Николай
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mephi.ru/pipermail/hpc/attachments/20240103/6ddeefb2/attachment.htm>
-------------- next part --------------
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /tmp/ompi.cherenkov.1366/pid.10080
  Error:     No space left on device

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[cherenkov:10080] [[51734,0],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[cherenkov:10080] [[51734,0],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[cherenkov:10079] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ../../../../../../orte/mca/ess/singleton/ess_singleton_module.c at line 716
[cherenkov:10079] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ../../../../../../orte/mca/ess/singleton/ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cherenkov:10079] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!


More information about the hpc mailing list