Skip to content

Сommand-line parameters

POST-INSTALL DATA AND CONFIGURATION

  • --data
    Creates the uorf4u_data folder in the current working directory. The folder will contain the adjustable configuration file templates, palettes, tables as well as the necessary sample.

  • --linux
    All Linux user should run it only once after installation.
    Replaces the tools paths in the premade config files from the MacOS' version [default] to the Linux'.

  • --blastp_path
    Update the blastp path in the pre-made config files.
    Required for using local blastp databases with -lbdb parameter.

MANDATORY ARGUMENTS

  • -an accession_number
    Protein's RefSeq accession number.

    OR

  • -hl accession_number1 [accession_number2, ...]
    Space separated list of proteins accession numbers which will be used as list of homologous.

    OR

  • -hlf file.txt
    Path to a file with list of accession numbers. File format: one accession number per line, no header.

    OR

  • -fa file.fa
    Path to a fasta file with upstream sequences.

  • -c bacteria|eukaryotes|file.cfg
    Path to a configuration file [default: internal].

OPTIONAL ARGUMENTS

  • -bdb <efseq_select|refseq_protein
    Online blastp database to perform blastp searching for homologues.
    [default: from config; refseq_select for bacteria, refseq_protein for eukaryotes]

  • -lbdb path to a database
    Local blastp database to perform blastp searching for homologues.
    Note: You have to specify path to a blastp with --blastp_path command before using this argument.

  • -bh number_of_hits
    Max number of blastp hits in homologous searching.

  • -bid identity_cutoff [0-1]
    BlastP searching cutoff for hit's identity to your query protein.

  • -mna number_f_assemblies
    Max number of assemblies to take into analysis for each protein. If there are more sequences in the identical protein database then random sampling will be used.

  • -al path_to/assemblies_list.tsv
    Path to an assemblies list file. During each run of uorf4u, a tsv table with information about assemblies (from identical protein database, ncbi) for each protein is saved to your output folder (output_dir_name/assemblies_list.tsv). There are cases with multiple assemblies for one protein accession numbers (up to thousands). In case to control assemblies included in the analysis this table can be filtered (simply by removing rows) and then used with this parameter as part of input to the next run.
    In addition, config file (see config parameters section) has max_number_of_assemblies parameter. It can be used to limit max number of assemblies included in the analysis. In case number of assemblies is more than the cutoff, random sampling will be used to take only subset of them.

  • -annot
    Retrieve sequences annotation (to be sure that annotated uORFs is not overlapped with a known CDS.

  • -ul length
    Length of upstream sequences to retrieve.

  • -dl length
    Length of downstream sequences to retrieve.

  • -asc
    Include alternative start codons in uORF annotation step. List of alternative start codons are taken from the ncbi genetic code.

  • -nsd
    Deactivate filtering ORFs by SD sequence presence. [default: True for 'prokaryotes' config and False for 'eukaryotes' config].

  • -at aa|nt
    Alignment type used by uorf4u for conserved ORFs searching [default: aa].

  • -pc cutoff [0-1]
    A cutoff of presence (number of ORFs in a list/number of sequences) for an ORFs set to be called conserved and returned [default: 0.4, set in the config].

  • -fast
    Fast searching mode with less accuracy (>~300 sequences or >~2000 ORFs).

  • -o dirname
    Output dirname. It will be created if it's not exist. All output dirs will be then created in this folder [default: uorf4u_{current_date}; e.g. uorf4u_2022_07_25-20_41].

MISCELLANEOUS ARGUMENTS

  • -h, --help
    Show help message and exit.

  • -v, --version
    Show program version.

  • --debug
    Provide detailed stack trace for debugging purposes.

  • --quiet
    Don't show progress messages.