diff --git a/_layouts/file_avail.html b/_layouts/file_avail.html index 4817ab8f..be922b4e 100644 --- a/_layouts/file_avail.html +++ b/_layouts/file_avail.html @@ -9,9 +9,8 @@

Which Option is the Best for Your Files?

Input Sizes Output Sizes - Link to Guide File Location - How to Transfer + Syntax for transfer_input_files Availability, Security @@ -19,30 +18,18 @@

Which Option is the Best for Your Files?

0 - 100 MB per file, up to 500 MB per job 0 - 5 GB per job - Small Input/Output File Transfer via HTCondor /home - submit file; filename in transfer_input_files - CHTC, UW Grid, and OSG; works for your jobs + No special syntax + CHTC and external pools - - 100 MB - 1 GB per repeatedly-used file - Not available for output - Large Input File Availability Via Squid - /squid - submit file; http link in transfer_input_files - CHTC, UW Grid, and OSG; files are made *publicly-readable* via an HTTP address - - - 100 MB - TBs per job-specific file; repeatedly-used files > 1GB 4 GB - TBs per job - Large Input and Output File Availability Via Staging /staging - job executable; copy or move within the job - a portion of CHTC; accessible only to your jobs + osdf:/// or file:/// + all of CHTC/external pools or a subset of CHTC diff --git a/_redirects/file-avail-squid.md b/_redirects/file-avail-squid.md new file mode 100644 index 00000000..c4d9dfbb --- /dev/null +++ b/_redirects/file-avail-squid.md @@ -0,0 +1,5 @@ +--- +layout: redirect +redirect_url: /uw-research-computing/htc-job-file-transfer +permalink: /uw-research-computing/file-avail-squid +--- diff --git a/_redirects/file-availability.md b/_redirects/file-availability.md new file mode 100644 index 00000000..8456db53 --- /dev/null +++ b/_redirects/file-availability.md @@ -0,0 +1,5 @@ +--- +layout: redirect +redirect_url: /uw-research-computing/htc-job-file-transfer +permalink: /uw-research-computing/file-availability +--- diff --git a/_uw-research-computing/file-avail-squid.md b/_uw-research-computing/archived/file-avail-squid.md similarity index 99% rename from _uw-research-computing/file-avail-squid.md rename to _uw-research-computing/archived/file-avail-squid.md index f43bbb97..3dca9099 100644 --- a/_uw-research-computing/file-avail-squid.md +++ b/_uw-research-computing/archived/file-avail-squid.md @@ -2,8 +2,8 @@ highlighter: none layout: file_avail title: Transfer Large Input Files Via Squid +published: false guide: - category: Manage data tag: - htc --- diff --git a/_uw-research-computing/file-availability.md b/_uw-research-computing/archived/file-availability.md similarity index 99% rename from _uw-research-computing/file-availability.md rename to _uw-research-computing/archived/file-availability.md index f51d430e..2194fdc4 100644 --- a/_uw-research-computing/file-availability.md +++ b/_uw-research-computing/archived/file-availability.md @@ -3,8 +3,8 @@ highlighter: none layout: file_avail title: Small Input and Output File Availability Via HTCondor alt_title: Transfer Small Input and Output +published: false guide: - category: Manage data tag: - htc --- diff --git a/_uw-research-computing/check-quota.md b/_uw-research-computing/check-quota.md index db627df7..39045696 100644 --- a/_uw-research-computing/check-quota.md +++ b/_uw-research-computing/check-quota.md @@ -9,29 +9,25 @@ guide: --- The following commands will allow you to monitor the amount of disk -space you are using in your home directory on our (or another) submit node and to determine the -amount of disk space you have been allotted (your quota). - -If you also have a `/staging` directory on the HTC system, see our -[staging guide](file-avail-largedata.html#5-checking-your-quota-data-use-and-file-counts) for -details on how to check your quota and usage. -\ -The default quota allotment on CHTC submit nodes is 20 GB with a hard -limit of 30 GB (at which point you cannot write more files).\ -\ -**Note: The CHTC submit nodes are not backed up, so you will want to +space you are using in your home directory on the access point and to determine the +amount of disk space you have been allotted (your quota). + +The default quota allotment in your `/home` directory is 20 GB with a hard +limit of 30 GB (at which point you cannot write more files). + +**Note: The CHTC access points are not backed up, so you should copy completed jobs to a secure location as soon as a batch completes, and then delete them on the submit node in order to make room for future -jobs.** If you need more disk space to run a single batch or concurrent -batches of jobs, please contact us ([Get Help!](get-help.html)). We have multiple ways of dealing with large disk space -requirements to make things easier for you. +jobs.** Disk space provided is intended for *active* calculations only, not permanent storage. +If you need more disk space to run a single batch or concurrent +batches of jobs, please contact us ([Get Help!](get-help.html)). We have multiple ways of dealing with large disk space requirements to make things easier for you. If you wish to change your quotas, please see [Request a Quota Change](quota-request). -**1. Checking Your User Quota and Usage** +**1. Checking Your `/home` Quota and Usage** ------------------------------------- -From any directory location within your home directory, type +From any directory location within your `/home` directory, use the command `quota -vs`. See the example below: ``` @@ -42,18 +38,39 @@ Disk quotas for user alice (uid 20384): ``` {:.term} -The output will list your total data usage under `blocks`, your soft +The output will list your total data usage under `space`, your soft `quota`, and your hard `limit` at which point your jobs will no longer -be allowed to save data. Each of the values given are in 1-kilobyte +be allowed to save data. Each value is given in 1-kilobyte blocks, so you can divide each number by 1024 to get megabytes (MB), and -again for gigabytes (GB). (It also lists information for ` files`, but -we don\'t typically allocate disk space by file count.) +again for gigabytes (GB). (It also lists information for number of `files`, but +we don't typically allocate disk space in `/home` by file count.) + +**2. Checking Your `/staging` Quota and Usage** +------------------------------------------------ +Users may have a `/staging` directory, meant for staging large files and data intended for +job submission. See our [Managing Large Data in HTC Jobs](file-avail-largedata) guide for +more information. + +To check your `/staging` quota, use the command `get_quotas /staging/username`. + +``` +[alice@submit]$ get_quotas /staging/alice +Path Quota(GB) Items Disk_Usage(GB) Items_Usage +/staging/alice 20 5 3.18969 5 +``` +{:.term} + +Your `/staging` directory has a disk and item quota. In the example above, the disk quota is +20 GB, and the items quota is 5 items. The current usage is printed in the following columns; +in the example, the user has used 3.19 GB and 5 items. + +To request a quota increase, [fill out our quota request form](quota-request). -**2. Checking the Size of Directories and Contents** +**3. Checking the Size of Directories and Contents** ------------------------------------------------ -Move to the directory you\'d like to check and type `du` . After several -moments (longer if you\'re directory contents are large), the command +Move to the directory you'd like to check and type `du` . After several +moments (longer if the contents of your directory are large), the command will add up the sizes of directory contents and output the total size of each contained directory in units of kilobytes with the total size of that directory listed last. See the example below: diff --git a/_uw-research-computing/file-avail-largedata-test.md b/_uw-research-computing/file-avail-largedata-test.md deleted file mode 100644 index 0522644d..00000000 --- a/_uw-research-computing/file-avail-largedata-test.md +++ /dev/null @@ -1,651 +0,0 @@ ---- -highlighter: none -layout: guide -title: Managing Large Data in HTC Jobs -published: false ---- - -# Data Transfer Solutions By File Size - -Due to the distributed nature of CHTC's High Throughput Computing (HTC) system, -your jobs will run on a server (aka an execute server) that is separate and -distinct from the server that your jobs are submitted from (aka the submit server). -This means that a copy of all the files needed to start your jobs must be -made available on the execute server. Likewise, any output files created -during the execution of your jobs, which are written to the execute server, -will also need to be transferred to a location that is accessible to you after your jobs complete. -**How input files are copied to the execute server and how output files are -copied back will depend on the size of these files.** This is illustrated via -the diagram below: - -![CHTC File Management Solutions](images/chtc-file-transfer.png) - -CHTC's data filesystem called "Staging" is a distinct location for -temporarily hosting files that are too large to be handled in a -high-throughput fashion via the default HTCondor file transfer -mechanism which is otherwise used for small files hosted in your `/home` -directory on your submit server. - -CHTC's `/staging` location is specifically intended for: - -- any individual input files >100MB -- input files totaling >500MB per job -- individual output files >4GB -- output files totaling >4GB per job - -This guide covers when and how to use `/staging` for jobs run in CHTC. - -# Table of Contents - -- [Who Should Use Staging](#use) -- [Policies and User Responsibilities](#policies-and-user-responsibilities) -- [Quickstart Instructions](#quickstart-instructions) -- [Get Access To Staging](#access) -- [Use The Transfer Server To Move Files To/From Staging](#transfer) -- [Submit Jobs With Input Files in Staging](#input) -- [Submit Jobs That Transfer Output Files To Staging](#output) -- [Tips For Success When Using Staging](#tips) -- [Managing Staging Data and Quotas](#quota) - - -# Who Should Use `/staging` - -`/staging` is a location specifically for hosting singularly larger input (>100MB) -and/or larger ouput (>4GB) files or when a job needs 500MB or more of total input -or will produce 4GB or more of total output. Job input and outupt of these -sizes are too large to be managed by CHTC's other data movement methods. - -**Default CHTC account creation does not include access to `/staging`.** -Access to `/staging` is provided as needed for supporting your data management -needs. If you think you need access to `/staging`, or would -like to know more about managing your data needs, please contact us at -. - -Files hosted in `/staging` are only excessible to jobs running in the CHTC pool. -About 50% of CHTC execute servers have access to `/staging`. Users will get -better job throughput if they are able to break up their work into smaller jobs -that each use or produce input and output files that do not require `/staging`. - -# Policies and User Responsibilities - -**USERS VIOLATING ANY OF THE POLICIES IN THIS GUIDE WILL -HAVE THEIR DATA STAGING ACCESS AND/OR CHTC ACCOUNT REVOKED UNTIL CORRECTIVE -MEASURES ARE TAKEN. CHTC STAFF RESERVE THE RIGHT TO REMOVE ANY -PROBLEMATIC USER DATA AT ANY TIME IN ORDER TO PRESERVE PERFORMANCE** - -

Jobs should NEVER be submitted from -/staging. All HTCondor job submissions must be performed from your -/home directory on the submit server and job log, -error, and output files should never be -written to /staging.

- -- **Backup your files**: As with all CHTC file spaces, CHTC does not back -up your files in `/staging`. - -- **Use bash script commands to access files in `/staging`**: Files placed in `/staging` -should **NEVER** be listed in the submit file, but rather accessed -via the job's executable (aka .sh) script. More details provided -in [Submit Jobs With Input Files in Staging](#input) -and [Submit Jobs That Transfer Output Files To Staging](#output). - -- **Use the transfer server**: We expect that users will only use our dedicated -transfer server, transfer.chtc.wisc.edu, instead of the submit server, -to upload and download appropriate files to and from `/staging`. Transferring -files to `/staging` with the submit server can negatively impact job submission for -you and other users. For more details, please see -[Use The Transfer Server To Move Files To/From Staging](#transfer) - -- **Quota control**:`/staging` directories include disk space and -items (i.e. directories and files) quotas. Quotas are necessary for -maintaning the stability and reliability of `/staging`. Quota changes can -be requested by emailing and -users can monitor quota settings and usage as described in -[Managing Staging Data and Quotas](#quota) - -- **Reduce file size and count**: We expect that users will use `tar` and -compression to reduce data size and file counts such that a single tarball -is needed and/or produced per job. More details provided in [Submit Jobs With Input Files in Staging](#input) -and [Submit Jobs That Transfer Output Files To Staging](#output). - -- **Shared group data**: `/staging` directories are owned by the user, -and only the user's own jobs can access these files. We can create shared group -`/staging` directories for sharing larger input and output files as needed. -[Contact us](mailto:chtc@cs.wisc.edu) to learn more. - -- **Remove data**: We expect that users will remove data from `/staging` as -soon as it is no longer needed for actively-running jobs. - -- CHTC staff reserve the right to remove data from `/staging` -(or any CHTC file system) at any time. - -# Quickstart Instructions - -1. Request access to `/staging`. - - * For more details, see [Get Access To Staging](#access) - -1. Review `/staging` [Policies and User Responsibilities](#policies-and-user-responsibilities) - -1. Prepare input files for hosting in `/staging`. - - * Compress files to reduce file size and speed up -file transfer. - - * If your jobs need multiple large input files, -use `tar` and `zip` to combine files so that only a single `tar` or `zip` -archive is needed per job. - -1. Use the transfer server, `transfer.chtc.wisc.edu`, to upload input -files to your `/staging` directory. - - * For more details, see [Use The Transfer Server To Move Files To/From Staging](#transfer). - - * For details, see [Submit Jobs With Input Files in Staging](#input). - -1. Create your HTCondor submit file. - - * Include the following submit detail to ensure that -your jobs will have access to your files in `/staging`: - - ``` {.sub} - requirements = (HasCHTCStaging =?= true) - ``` - -1. Create your executable bash script. - - * Use `cp` or `rsync` to copy large input -from `/staging` that is needed for the job. For example: - - ``` - cp /staging/username/my-large-input.tar.gz ./ - tar -xzf my-large-input.tar.gz - ``` - {:.file} - - * If the job will produce output >4GB this output should be -be compressed moved to `/staging` before job terminates. If multiple large output -files are created, use `tar` and `zip` to reduce file counts. For -example: - - ``` - tar -czf large_output.tar.gz output-file-1 output-file-2 output_dir/ - mv large_output.tar.gz /staging/username - ``` - {:.file} - - * Before the job completes, delete input copied from `/staging`, the -extracted large input file(s), and the uncompressed or untarred large output files. For example: - - ``` - rm my-large-input.tar.gz - rm my-large-input-file - rm output-file-1 output-file-2 - ``` - {:.file} - - * For more details about job submission using input from `/staging` or for hosting -output in `/staging`, please see [Submit Jobs With Input Files in Staging](#input) and -[Submit Jobs That Transfer Output Files To Staging](#output). - -1. Remove large input and output files `/staging` after jobs complete using -`transfer.chtc.wisc.edu`. - - - -# Get Access To `/staging` - -
Click to learn more -

- -CHTC accounts do not automatically include access to `/staging`. If you think -you need a `/staging` directory, please contact us at . So -we can process your request more quickly, please include details regarding -the number and size of the input and/or output files you plan to host in -`/staging`. You will also be granted access to out dedicated transfer -server upon creation of your `/staging` directory. - -*What is the path to my `/staging` directory?* -- Individual directories will be created at `/staging/username` -- Group directories will be created at `/staging/groups/group_name` - -*How much space will I have?* - -Your quota will be set based on your specific data needs. To see more -information about checking your quota and usage in staging, see the -end of this guide: [Managing Staging Data and Quotas](#quota) - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Use The Transfer Server To Move Files To/From `/staging` - -
Click to learn more -

- -![Use Transfer Server](images/use-transfer-staging.png) - -Our dedicated transfer server, `transfer.chtc.wisc.edu`, should be used to -upload and/or download your files to/from `/staging`. - -The transfer server is a separate server that is independent of the submit -server you otherwise use for job submission. By using the transfer server -for `/staging` data upload and download, you are helping to reduce network -bottlenecks on the submit server that could otherwise negatively impact -the submit server's performance and ability to manage and submit jobs. - -**Users should not use their submit server to upload or download files -to/from `staging` unless otherwise directed by CHTC staff.** - -*How do I connect to the transfer server?* -Users can access the transfer server the same way they access their -submit server (i.e. via Terminal app or PuTTY) using the following -hostname: `transfer.chtc.wisc.edu`. - -*How do I upload/download files to/from `staging`?* -Several options exist for moving data to/from `staging` including: - -- `scp` and `rsync` can be used from the terminal to move data -to/from your own computer or *another server*. For example: - - ``` - $ scp large.file username@transfer.chtc.wisc.edu:/staging/username/ - $ scp username@serverhostname:/path/to/large.file username@transfer.chtc.wisc.edu:/staging/username/ - ``` - {:.term} - - **Be sure to use the username assigned to you on the other submit server for the first - argument in the above example for uploading a large file from another server.** - -- GUI-based file transfer clients like WinSCP, FileZilla, and Cyberduck -can be used to move files to/from your personal computer. Be -sure to use `transfer.chtc.wisc.edu` when setting up the connection. - -- Globus file transfer can be used to transfer files to/from a Globus Endpoint. -See our guide [Using Globus To Transfer Files To and From CHTC](globus.html) -for more details. - -- `smbclient` is available for managing file transfers to/from file -servers that have `smbclient` installed, like DoIT's ResearchDrive. See our guide -[Transferring Files Between CHTC and ResearchDrive](transfer-data-researchdrive.html) -for more details. - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Submit Jobs With Input Files in `/staging` - -
Click to learn more -

- -![Staging File Transfer](images/staging-file-transfer.png) - -`/staging` is a distinct location for temporarily hosting your -individually larger input files >100MB in size or in cases when jobs -will need >500MB of total input. First, a copy of -the appropriate input files must be uploaded to your `/staging` directory -before your jobs can be submitted. As a reminder, individual input files <100MB -in size should be hosted in your `/home` directory. - -Both your submit file and bash script -must include the necessary information to ensure successful completion of -jobs that will use input files from `/staging`. The sections below will -provide details for the following steps: - -1. Prepare your input before uploading to `/staging` -2. Prepare your submit files for jobs that will use large input -files hosted in `/staging` -3. Prepare your executable bash script to access and use large input -files hosted in `/staging`, delete large input from job - -## Prepare Large Input Files For `\staging` - -**Organize and prepare your large input such that each job will use a single, -or as few as possible, large input files.** - -As described in our policies, data placed in `/staging` should be -stored in as few files as possible. Before uploading input files -to `/staging`, use file compression (`zip`, `gzip`, `bzip`) and `tar` to reduce -file sizes and total file counts such that only a single, or as few as -possible, input file(s) will be needed per job. - -If your large input will be uploaded from your personal computer -Mac and Linux users can create input tarballs by using the command `tar -czf` -within the Terminal. Windows users may also use a terminal if installed, -else several GUI-based `tar` applications are available, or ZIP can be used -in place of `tar`. - -The following examples demonstrate how to make a compressed tarball -from the terminal for two large input files named `file1.lrg` and -`file2.lrg` which will be used for a single job: - -``` -my-computer username$ tar -czf large_input.tar.gz file1.lrg file2.lrg -my-computer username$ ls -file1.lrg -file2.lrg -large_input.tar.gz -``` -{: .term} - -Alternatively, files can first be moved to a directory which can then -be compressed: - -``` -my-computer username$ mkdir large_input -my-computer username$ mv file1.lrg file2.lrg large_input/ -my-computer username$ tar -czf large_input.tar.gz large_input -my-computer username$ ls -F -large_input/ -large_input.tar.gz -``` -{: .term} - -After preparing your input, -use the transfer server to upload the tarballs to `/staging`. Instructions for -using the transfer server are provide in the above section -[Use The Transfer Server To Move Large Files To/From Staging](#transfer). - -## Prepare Submit File For Jobs With Input in `/staging` - -Not all CHTC execute servers have access to `/staging`. If your job needs access -to files in `/staging`, you must tell HTCondor to run your jobs on the approprite servers -via the `requirements` submit file attribute. **Be sure to request sufficient disk -space for your jobs in order to accomodate all job input and output files.** - -An example submit file for submitting a job that requires access to `/staging` -and which will transfer a smaller, <100MB, input file from `/home`: - -```{.sub} -# job with files in staging and input in home example - -log = my_job.$(Cluster).$(Process).log -error = my_job.$(Cluster).$(Process).err -output = my_job.$(Cluster).$(Process).out - -...other submit file details... - -# transfer small files from home -transfer_input_files = my_smaller_file - -requirements = (HasCHTCStaging =?= true) - -queue -``` - -**Remember:** If your job has any other requirments defined in -the submit file, you should combine them into a single `requirements` statement: - -```{.sub} -requirements = (HasCHTCStaging =?= true) && other requirements -``` - -## Use Job Bash Script To Access Input In `/staging` - -Unlike smaller, <100MB, files that are transferred from your home directory -using `transfer_input_files`, files placed in `/staging` should **NEVER** -be listed in the submit file. Instead, you must include additional -commands in the job's executable bash script that will copy (via `cp` or `rsync`) -your input in `/staging` to the job's working directory and extract ("untar") and -uncompress the data. - -**Additional commands should be included in your bash script to remove -any input files copied from `/staging` before the job terminates.** -HTCondor will think the files copied from `/staging` are newly generated -output files and thus, HTCondor will likely transfer these files back -to your home directory with other, real output. This can cause your `/home` -directory to quickly exceed its disk quota causing your jobs to -go on hold with all progress lost. - -Continuing our example, a bash script to copy and extract -`large_input.tar.gz` from `/staging`: - -``` -#!/bin/bash - -# copy tarball from staging to current working dir -cp /staging/username/large_input.tar.gz ./ - -# extract tarball -tar -xzf large_input.tar.gz - -...additional commands to be executed by job... - -# delete large input to prevent -# HTCondor from transferring back to submit server -rm large_input.tar.gz file1.lrg file2.lrg - -# END -``` -{:.file} - -As shown in the exmaple above, \*both\* the original tarball, `large_input.tar.gz`, and -the extracted files are deleted as a final step in the script. If untarring -`large_input.tar.gz` insteads creates a new subdirectory, then only the original tarball -needs to be deleted. - -

Want to speed up jobs with larger input data? -

- -If your your job will transfer >20GB worth of input file, then using `rm` to remove these -files before the job terminates can take a while to complete which will add -unnecessary runtime to your job. In this case, you can create a -subdirectory and move (`mv`) the large input to it - this will complete almost -instantaneously. Because these files will be in a subdirectory, HTCondor will -ignore them when determining with output files to transfer back to the submit server. - -For example: - -``` -# prevent HTCondor from transferring input file(s) back to submit server -mkdir ignore/ -mv large_input.tar.gz file1.lrg file2.lrg ignore/ -``` -{:.file} - -

-
- -## Remove Files From `/staging` After Jobs Complete - -Files in `/staging` are not backed up and `/staging` should not -be used as a general purpose file storage service. As with all -CHTC file spaces, data should be removed from `/staging` as -soon as it is no longer needed for actively-running jobs. Even if it -will be used in the future, your data should be deleted and copied -back at a later date. Files can be taken off of `/staging` using similar -mechanisms as uploaded files (as above). - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Submit Jobs That Transfer Output Files To `/staging` - -
Click to learn more -

- -![Staging File Transfer](images/staging-file-transfer.png) - -`/staging` is a distinct location for temporarily hosting -individual output files >4GB in size or in cases when >4GB -of output is produced by a single job. - -Both your submit file and job bash script -must include the necessary information to ensure successful completion of -jobs that will host output in `/staging`. The sections below will -provide details for the following steps: - -1. Prepare your submit files for jobs that will host output in `/staging` -2. Prepare your executable bash script to tar output and move to `/staging` - -## Prepare Submit File For Jobs That Will Host Output In `/staging` - -Not all CHTC execute servers have access to `/staging`. If your -will host output files in `/staging`, you must tell HTCondor to run -your jobs on the approprite servers via the `requirements` submit -file attribute: - -```{.sub} -# job that needs access to staging - -log = my_job.$(Cluster).$(Process).log -error = my_job.$(Cluster).$(Process).err -output = my_job.$(Cluster).$(Process).out - -...other submit file details... - -requirements = (HasCHTCStaging =?= true) - -queue -``` - -**Remember:** If your job has any other requirments defined in -the submit file, you should combine them into a single `requirements` statement: - -```{.sub} -requirements = (HasCHTCStaging =?= true) && other requirements -``` - -## Use Job Bash Script To Move Output To `/staging` - -Output generated by your job is written to the execute server -where the run jobs. For output that is large enough (>4GB) to warrant use -of `/staging`, you must include steps in the executable bash script of -your job that will package the output into a tarball and relocate it -to your `/staging` directory before the job completes. **This can be -acheived with a single `tar` command that directly writes the tarball -to your staging directory!** It is IMPORTANT that no other files be written -directly to your `/staging` directory during job execution except for -the below `tar` example. - -For example, if a job writes a larger ammount of output to -a subdirectory `output_dir/` along with an additional -larger output file `output.lrg`, the following steps will -package the all of the output into a single tarball that is -then moved to `/staging`. **Note:** `output.lrg` will still exist -in the job's working directory after creating the tarball and thus -must be deleted before job completes. - -``` -#!/bin/bash - -# Commands to execute job - -... - -# create tarball located in staging containing >4GB output -tar -czf /staging/username/large_output.tar.gz output_dir/ output.lrg - -# delete an remaining large files -rm output.lrg - -# END -``` -{: .file} - -If a job generates a single large file that will not shrink much when -compressed, it can be moved directly to staging. If a job generates -multiple files in a directory, or files can be substantially made smaller -by zipping them, the above example should be followed. - -``` -#!/bin/bash - -# Commands to execute job - -... - -# move single large output file to staging -mv output.lrg /staging/username/ - -# END -``` -{: .file} - -## Managing Larger `stdout` From Jobs - -Does your software produce a large amount of output that gets -saved to the HTCondor `output` file? Some software are written to -"stream" output directly to the terminal screen during interactive execution, but -when the software is executed non-interactively via HTCondor, the output is -instead saved in the `output` file designated in the HTCondor submit file. - -Because HTCondor will transfer `output` back to your home directory, if your -jobs produce HTCondor `output` files >4GB it is important to move this -data to `/staging` by redirecting the output of your job commands to a -separate file that gets packaged into a compressed tarball and relocated -to `/staging`: - -``` -#!/bin/bash - -# redirect standard output to a file in the working directory -./myprogram myinput.txt > large.stdout - -# create tarball located in staging containing >4GB output -tar -czf /staging/username/large.stdout.tar.gz large.stdout - -# delete large.stdout file -rm large.stdout - -# END -``` -{: .file} - -[Return to top of page](#data-transfer-solutions-by-file-size) - -

-
- - -# Tips For Success When Using `/staging` - -In order to properly submit jobs use `/staging` for managing larger -input and output file, always do the following: - -- **Submit from `/home`**: ONLY submit jobs from within your home directory - (`/home/username`), and NEVER from within `/staging`. - -- **No large data in the submit file**: Do NOT list any files from `/staging` in -your submit file and do NOT use `/staging` as a path for any submit file attributes -such as `executable, log, output, error, transfer_input_files`. -As described in this guide, all interaction with `/staging` will occur via -command in the executable bash script. - -- **Request sufficient disk space**: Using `request_disk`, request an amount of disk -space that reflects the total of a) input data that each job will copy into -the job working directory from `/staging` including the size of the tarball and the -extracted files b) any input transferred via `transfer_input_files`, -and c) any output that will be created in the job working directory. - -- **Require access to `/staging`**: Tell HTCondor that your jobs need to run on -execute servers that can access `/staging` using the following submit file attribute: - - ```{.sub} - Requirements = (Target.HasCHTCStaging == true) - ``` - -[Return to top of page](#data-transfer-solutions-by-file-size) - - -# Managing `/staging` Data and Quotas - -Use the command `get_quotas` to see what disk -and items quotas are currently set for a given directory path. -This command will also let you see how much disk is in use and how many -items are present in a directory: - -``` -[username@transfer ~]$ get_quotas /staging/username -``` -{:.term} - -[Return to top of page](#data-transfer-solutions-by-file-size) diff --git a/_uw-research-computing/file-avail-largedata.md b/_uw-research-computing/file-avail-largedata.md index 3faa1cf9..845b5660 100644 --- a/_uw-research-computing/file-avail-largedata.md +++ b/_uw-research-computing/file-avail-largedata.md @@ -21,17 +21,25 @@ familiar with:** 1. Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables"). -2. CHTC's [Intro to Running HTCondor Jobs](helloworld.html) -3. CHTC's guide for [Typical File Transfer](file-availability.html) +2. CHTC's [Intro to Running HTCondor Jobs](htcondor-job-submission) +3. CHTC's guide for [Typical File Transfer](htc-job-file-transfer) {% capture content %} -1. [Policies and Intended Use](#1-policies-and-intended-use) -2. [Staging Large Data](#2-staging-large-data) -3. [Using Staged Files in a Job](#3-using-staged-files-in-a-job) - * [Accessing Large Input Files](#a-accessing-large-input-files) - * [Moving Large Output Files](#b-moving-large-output-files) -4. [Submit Jobs Using Staged Data](#4-submit-jobs-using-staged-data) -5. [Checking your Quota, Data Use, and File Counts](#5-checking-your-quota-data-use-and-file-counts) +- [1. Policies and Intended Use](#1-policies-and-intended-use) + * [A. Intended Use](#a-intended-use) + * [B. Access to Large Data Staging](#b-access-to-large-data-staging) + * [C. User Data Management Responsibilities](#c-user-data-management-responsibilities) + * [D. Data Access Within Jobs](#d-data-access-within-jobs) +- [2. Staging Large Data](#2-staging-large-data) + * [A. Get a Directory](#a-get-a-directory) + * [B. Reduce File Counts](#b-reduce-file-counts) + * [C. Use the Transfer Server](#c-use-the-transfer-server) + * [D. Remove Files After Jobs Complete](#d-remove-files-after-jobs-complete) +- [3. Using Staged Files in a Job](#3-using-staged-files-in-a-job) + * [A. Transferring Large Input Files](#a-transferring-large-input-files) + * [B. Transferring Large Output Files](#b-transferring-large-output-files) +- [4. Submit Jobs Using Staged Data](#4-submit-jobs-using-staged-data) +- [5. Related Pages](#5-related-pages) {% endcapture %} {% include /components/directory.html title="Table of Contents" %} @@ -47,13 +55,11 @@ familiar with:** Our large data staging location is only for input and output files that are individually too large to be managed by our other data movement -methods, HTCondor file transfer or SQUID. This includes individual input files +methods, HTCondor file transfer. This includes individual input files greater than 100MB and individual output files greater than 3-4GB. Users are expected to abide by this intended use expectation and follow the -instructions for using `/staging` written in this guide (e.g. files placed -in `/staging `should NEVER be listed in the submit file, but rather accessed -via the job's executable (aka .sh) script). +instructions for using `/staging` written in this guide. ## B. Access to Large Data Staging @@ -82,11 +88,11 @@ location (or any CHTC file system) at any time. ## D. Data Access Within Jobs - Staged large data will +Staged large data will be available only within the the CHTC pool, on a subset of our total capacity. -Staged data are owned by the user, and only the user's own +Staged data are owned by the user and only the user's own jobs can access these files (unless the user specifically modifies unix file permissions to make certain files available for other users). @@ -156,100 +162,27 @@ back at a later date. Files can be taken off of `/staging` using similar mechanisms as uploaded files (as above). # 3. Using Staged Files in a Job - -As shown above, the staging directory for large data is `/staging/username`. -All interaction with files in this location should occur within your job's -main executable. - -## A. Accessing Large Input Files - -To use large data placed in the `/staging` location, add commands to your -job executable that copy input -from `/staging` into the working directory of the job. Program should then use -files from the working directory, being careful to remove the coiped -files from the working -directory before the completion of the job (so that they're not copied -back to the submit server as perceived output). - -Example, if executable is a shell script: - -``` -#!/bin/bash -# -# First, copy the compressed tar file from /staging into the working directory, -# and un-tar it to reveal your large input file(s) or directories: -cp /staging/username/large_input.tar.gz ./ -tar -xzvf large_input.tar.gz -# -# Command for myprogram, which will use files from the working directory -./myprogram large_input.txt myoutput.txt -# -# Before the script exits, make sure to remove the file(s) from the working directory -rm large_input.tar.gz large_input.txt -# -# END -``` -{: .file} - - -## B. Moving Large Output Files - -If jobs produce large (more than 3-4GB) output files, have -your executable write the output file(s) to a location within -the working directory, and then make sure to move this large file to -the `/staging` folder, so that it's not transferred back to the home directory, as -all other "new" files in the working directory will be. - -Example, if executable is a shell script: +## A. Transferring Large Input Files +Staged files should be specified in the job submit file using the `osdf:///` or `file:///` syntax, +depending on the size of the files to be transferred. [See this table for more information](htc-job-file-transfer#transferring-data-to-jobs-with-transfer_input_files). ``` -#!/bin/bash -# -# Command to save output to the working directory: -./myprogram myinput.txt output_dir/ -# -# Tar and mv output to staging, then delete from the job working directory: -tar -czvf large_output.tar.gz output_dir/ other_large_files.txt -mv large_output.tar.gz /staging/username/ -rm other_large_files.txt -# -# END +transfer_input_files = osdf:///chtc/staging/username/file1, file:///staging/username/file2, file3 ``` -{: .file} +{:.sub} -## C. Handling Standard Output (if needed) -In some instances, your software may produce very large standard output -(what would typically be output to the command screen, if you ran the -command for yourself, instead of having HTCondor do it). Because such -standard output from your software will usually be captured by HTCondor -in the submit file "output" file, this "output" file WILL still be -transferred by HTCondor back to your home directory on the submit -server, which may be very bad for you and others, if that captured -standard output is very large. +## B. Transferring Large Output Files -In these cases, it is useful to redirect the standard output of commands -in your executable to a file in the working directory, and then move it -into `/staging` at the end of the job. +By default, any new or changed files in the top-level directory are transferred to your working directory on `/home`. If you have large output files, this is undesirable. -Example, if "`myprogram`" produces very large standard output, and is -run from a script (bash) executable: +Large outputs should be transferred to staging using the same file transfer protocols in HTCondor's `transfer_output_remaps` option: ``` -#!/bin/bash -# -# script to run myprogram, -# -# redirecting large standard output to a file in the working directory: -./myprogram myinput.txt myoutput.txt > large_std.out -# -# tar and move large files to staging so they're not copied to the submit server: -tar -czvf large_stdout.tar.gz large_std.out -cp large_stdout.tar.gz /staging/username/subdirectory -rm large_std.out large_stdout.tar.gz -# END +transfer_output_files = file1, file2 +transfer_output_remaps = "file1 = osdf:///chtc/staging/username/file1; file2 = file:///staging/username/file2" ``` -{: .file} +{:.sub} # 4. Submit Jobs Using Staged Data @@ -260,13 +193,9 @@ In order to properly submit jobs using staged large data, always do the followin In your submit file: -- **No large data in the submit file**: Do NOT list any `/staging` files in any of the submit file - lines, including: `executable, log, output, error, transfer_input_files`. Rather, your - job's ENTIRE interaction with files in `/staging` needs to occur - WITHIN each job's executable, when it runs within the job (as shown [above](#3-using-staged-files-in-a-job)) - **Request sufficient disk space**: Using `request_disk`, request an amount of disk -space that reflects the total of a) input data that each job will copy into - the job working directory from `/staging,` and b) any output that +space that reflects the total of (a) input data that each job will copy into + the job working directory from `/staging,` and (b) any output that will be created in the job working directory. - **Require access to `/staging`**: Include the CHTC specific attribute that requires servers with access to `/staging` @@ -284,8 +213,7 @@ log = myprogram.log output = $(Cluster).out error = $(Cluster).err -## Do NOT list the large data files here -transfer_input_files = myprogram +transfer_input_files = osdf:///chtc/staging/username/myprogram, file:///staging/username/largedata.tar.gz # IMPORTANT! Require execute servers that can access /staging Requirements = (Target.HasCHTCStaging == true) @@ -295,32 +223,8 @@ Requirements = (Target.HasCHTCStaging == true) queue ``` -> **Note: in no way should files on `/staging` be specified in the submit file, -> directly or indirectly!** For example, do not use the `initialdir` option ( -> [Submitting Multiple Jobs in Individual Directories](multiple-job-dirs.html)) -> to specify a directory on `/staging`. - -# 5. Checking your Quota, Data Use, and File Counts - -You can use the command `get_quotas` to see what disk -and items quotas are currently set for a given directory path. -This command will also let you see how much disk is in use and how many -items are present in a directory: - -``` -[username@transfer ~]$ get_quotas /staging/username -``` -{:.term} - -Alternatively, the `ncdu` command can also be used to see how many -files and directories are contained in a given path: - -``` -[username@transfer ~]$ ncdu /staging/username -``` -{:.term} +# 5. Related Pages -When `ncdu` has finished running, the output will give you a total file -count and allow you to navigate between subdirectories for even more -details. Type `q` when you\'re ready to exit the output viewer. More -info here: +* [Data Storage Locations on the HTC](htc-job-file-transfer) +* [Check Disk Quota and Usage](check-quota) +* [Request a /staging directory or quota change](quota-request) \ No newline at end of file diff --git a/_uw-research-computing/htc-job-file-transfer.md b/_uw-research-computing/htc-job-file-transfer.md new file mode 100644 index 00000000..b490cdcd --- /dev/null +++ b/_uw-research-computing/htc-job-file-transfer.md @@ -0,0 +1,102 @@ +--- +highlighter: none +layout: file_avail +title: Using and transferring data in jobs on the HTC system +guide: + order: 1 + category: Handling Data in Jobs + tag: + - htc +--- + +{% capture content %} +- [Data Storage Locations](#data-storage-locations) + * [/home](#home) + * [/staging](#staging) +- [Transferring Data to Jobs with `transfer_input_files`](#transferring-data-to-jobs-with-transfer_input_files) + * [Important Note: File Transfers and Caching with `osdf:///`](#important-note-file-transfers-and-caching-with-osdf) +- [Transferring Data Back from Jobs to `/home` or `/staging`](#transferring-data-back-from-jobs-to-home-or-staging) + * [Default Behavior for Transferring Output Files](#default-behavior-for-transferring-output-files) + * [Specify Which Output Files to Transfer with `transfer_output_files` and `transfer_output_remaps`](#specify-which-output-files-to-transfer-with-transfer_output_files-and-transfer_output_remaps) +- [Related pages](#related-pages) +{% endcapture %} +{% include /components/directory.html title="Table of Contents" %} + +## Data Storage Locations +The HTC system has two primary locations where users can place their files: +### /home +* The default location for files and job submission +* Efficiently handles many files +* Smaller input files (<100 MB) should be placed here + +### /staging +* Expandable storage system but cannot efficiently handle many small (few MB or less) files +* Larger input files (>100 MB) should be placed here, including container images (.sif) + +The data management mechanisms behind `/home` and `/staging` are different and are optimized to handle different file sizes and numbers of files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. + +> If you need a `/staging` directory, [request one here](quota-request). + + +## Transferring Data to Jobs with `transfer_input_files` + +In the HTCondor submit file, `transfer_input_files` should always be used to tell HTCondor what files to transfer to each job, regardless of if that file originates from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your job will change depending on the file size. + +| Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | +| ----------- | ----------- | ----------- | ----------- | +| 0 - 100 MB | `/home` | `transfer_input_files = input.txt` | +| 100 MB - 30 GB | `/staging` | `transfer_input_files = osdf:///chtc/staging/NetID/input.txt` | +| > 100 MB - 100 GB | `/staging/groups` | `transfer_input_files = file:///staging/NetID/input.txt` | +| > 30 GB | `/staging` | `transfer_input_files = file:///staging/NetID/input.txt` | +| > 100 GB | | For larger datasets (100GB+ per job), contact the facilitation team about the best strategy to stage your data | + +Multiple input files and file transfer protocols can be specified and delimited by commas, as shown below: + +``` +# My job submit file + +transfer_input_files = file1, osdf:///chtc/staging/username/file2, file:///staging/username/file3, dir1, dir2/ + +... other submit file details ... +``` +{:.sub} + +Ensure you are using the correct file transfer protocol for efficiency. Failure to use the right protocol can result in slow file transfers or overloading the system. + +### Important Note: File Transfers and Caching with `osdf:///` +The `osdf:///` file transfer protocol uses a [caching](https://en.wikipedia.org/wiki/Cache_(computing)) mechanism for input files to reduce file transfers over the network. This can affect users who refer to input files that are frequently modified. + +*If you are changing the contents of the input files frequently, you should rename the file or change its path to ensure the new version is transferred.* + +## Transferring Data Back from Jobs to `/home` or `/staging` + +### Default Behavior for Transferring Output Files +When a job completes, by default, HTCondor will return **newly created or edited files only in top-level directory** back to your `/home` directory. **Files in subdirectories are *not* transferred.** Ensure that the files you want are in the top-level directory by moving them or [creating tarballs](transfer-files-computer#c-transferring-multiple-files). + +### Specify Which Output Files to Transfer with `transfer_output_files` and `transfer_output_remaps` +If you don't want to transfer all files but only *specific files*, in your HTCondor submit file, use +``` +transfer_output_files = file1.txt, file2.txt, file3.txt +``` +{:.sub} + +To transfer a file or folder back to `/staging`, you will need an additional line in your HTCondor submit file: +``` +transfer_output_remaps = "file1.txt = file:///staging/NetID/output1.txt; file2.txt = /home/NetId/outputs/output2.txt" +``` +{:.sub} + +In this example above, `file1.txt` is remapped to the staging directory using the `file:///` transfer protocol and simultaneously renamed `output1.txt`. In addition, `file2.txt` is renamed to `output2.txt`and will be transferred to a different directory on `/home`. Ensure you have the right file transfer syntax (`osdf:///` or `file:///` depending on the anticipated file size). + +If you have multiple files or folders to transfer back to `/staging`, use a semicolon (;) to separate each object: +``` +transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; output2.txt = file:///staging/NetID/output2.txt" +``` +{:.sub} + +Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps`. + +## Related pages +- [Managing Large Data in HTC Jobs](/uw-research-computing/file-avail-largedata) +- [Transfer files between CHTC and your computer](/uw-research-computing/transfer-files-computer) +- [Transfer files between CHTC and ResearchDrive](/uw-research-computing/transfer-data-researchdrive) \ No newline at end of file diff --git a/_uw-research-computing/osdf-fileXfer-draft.md b/_uw-research-computing/osdf-fileXfer-draft.md deleted file mode 100644 index a8025ee6..00000000 --- a/_uw-research-computing/osdf-fileXfer-draft.md +++ /dev/null @@ -1,59 +0,0 @@ ---- -highlighter: none -layout: guide -title: HTC Data Storage Locations -guide: - order: 6 - category: FILL IN THIS BLANK - tag: - - htc ---- - -[toc] - -# Data Storage Locations -The HTC system has two primary locations where users can store files: `/home` and `/staging`. - -The mechanisms behind `/home` and `/staging` that manage data are different and are optimized to handle different file sizes. `/home` is more efficient at managing small files, while `/staging` is more efficient at managing larger files. It's important to place your files in the correct location, as it will improve the speed and efficiency at which your data is handled and will help maintain the stability of the HTC filesystem. - - -# Understand your file sizes -To know whether a file should be placed in `/home` or in `/staging`, you will need to know it's file size (also known as the amount of "disk space" a file uses). There are many commands to print out your file sizes, but here are a few of our favorite: - -## Use `ls` with `-lh` flags -The command `ls` stands for "list" and, by default, lists the files in your current directory. The flag `-l` stands for "long" and `-h` stands for "human-readable". When the flags are combined and passed to the `ls` command, it prints out the long metadata associated with the files and converts values such as file sizes into human-readable formats (instead of a computer readable format). - -``` -NetID@submit$ ls -lh -``` - -## Use `du -h` -Similar to `ls -lh`, `du -h` prints out the "disk usage" of directories in a human-readable format. - -``` -NetID@submit$ du -h -``` - - -# Transferring Data to Jobs -The HTCondor submit file `transfer_input_files =` line should always be used to tell HTCondor what files to transfer to each job, regardless of if that file is origionating from your `/home` or `/staging` directory. However, the syntax you use to tell HTCondor to fetch files from `/home` and `/staging` and transfer to your running job will change: - - -| Input Sizes | File Location | Submit File Syntax to Transfer to Jobs | -| ----------- | ----------- | ----------- | ----------- | -| 0-500 MB | /home | transfer_input_files = input.txt | -| 500-10GB | /staging | transfer_input_files = **osdf:///chtc/staging/NetID/input.txt | -| 10GB + | /staging | transfer_input_files = **file:///staging/NetID/input.txt | - - -## Transfer Data Back from Jobs to `/home` or `/staging` - -When a job completes, by default, HTCondor will return newly created or edited files on the top level directory back to your `/home` directory. - -To transfer files or folders back to `/staging`, in your HTCondor submit file, use -transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt", where `output1.txt` is the name of the output file or folder you would like transfered back to a /staging directory. - -If you have more than one file or folder to transfer back to `/staging`, use a semicolon (;) to seperate multiple files for HTCondor to transfer back like so: -transfer_output_remaps = "output1.txt = file:///staging/NetID/output1.txt; output2.txt = file:///staging/NetID/output2.txt" - -Make sure to only include one set of quotation marks that wraps around the information you are feeding to `transfer_output_remaps =`.