Temp files in Google Cloud Dataflow -
i'm trying write temporary files on workers executing dataflow jobs, seems files getting deleted while job still running. if ssh running vm, i'm able execute exact same file-generating command , files not destroyed -- perhaps cleanup happens dataflow runner user only. is possible use temp files or platform limitation?
specifically, i'm attempting write location returned files.createtempdir()
, /tmp/someidentifier
.
edit: not sure happening when posted, files.createtempdirectory()
works...
we make no explicit guarantee lifetime of files write local disk.
that said, writing temporary file inside processelement work. can write , read within same processelement. similarly, files created in dofn.startbundle visible in processelement , finishbundle.
you should avoid writing /dataflow/logs/taskrunner/harness
. writing files there might conflict dataflow's logging. encourage use standard java apis file.createtempfile()
, file.createtempdirectory()
instead.
if want preserve data beyond finishbundle should write data durable storage such gcs. can emitting data sideoutput , using textio or 1 of other writers. alternatively, write gcs directly inside dofn.
since dataflow runs inside containers won't able see files ssh'ing vm. container has of directories of host vm mounted, /tmp
not 1 of them. need attach appropriate container e.g. running
docker exec -t -i <container id> /bin/bash
that command start shell inside running container.
Comments
Post a Comment