baseq : R package for Basic Sequence Processing in Biological Data
In the field of biology, processing biological sequence data is an essential task. Such data includes DNA and RNA sequences, which contain valuable information about the genetic makeup of organisms. To help with this task, the “baseq” package in R offers several functions for basic sequence processing.
Commands in baseq :
clean_sequence()
count_bases()
count_seq_pattern()
dna_to_protein()
dna_to_rna()
gc_content()
read_fasta()
reverse_complement()
rna_reverse_complement()
rna_to_dna()
rna_to_protein()
How to install
2. Using R console
install.packages("baseq")
One of the most fundamental tasks when working with DNA sequences is cleaning them. Non-DNA characters can cause issues downstream in analyses, which is where the clean_sequence() function comes in handy. It removes all non-DNA characters from a DNA sequence, leaving only valid DNA characters (A, C, G, T).
Another critical function in the “baseq” package is count_bases(). It counts the number of occurrences of each nucleotide (A, C, G, and T) in a DNA sequence. This function provides a quick and straightforward way to obtain a summary of the nucleotide composition of a DNA sequence.
Recommended by LinkedIn
Sometimes, it’s necessary to count the occurrence of a particular pattern within a sequence. The count_seq_pattern() function is designed for this purpose. It takes two arguments: the sequence to be searched and the pattern to be counted. The function returns the number of occurrences of the pattern in the sequence.
The reverse_complement() function takes a DNA sequence as input and returns its reverse complement. Similarly, the rna_reverse_complement() function returns the reverse complement of an RNA sequence. These functions are useful when working with sequences in their reverse orientation.
Transcription is a crucial process in gene expression. The dna_to_rna() function takes a DNA sequence as input and returns its RNA transcript. On the other hand, the rna_to_dna() function performs the opposite task by converting an RNA sequence into its DNA complement.
Proteins are essential molecules that carry out a wide range of functions in living organisms. The dna_to_protein() function takes a DNA sequence as input and returns the corresponding protein sequence using the standard genetic code. This function checks all six frames for potential protein-coding sequences and outputs the translated protein sequences in a list with the frame number as a prefix. Similarly, the rna_to_protein() function translates an RNA sequence in all six reading frames.
Lastly, the gc_content() function calculates the percentage of G and C nucleotides in a DNA sequence. This function is useful when analyzing the characteristics of a sequence.
In addition to these functions, the read_fasta() function reads a file in the FASTA format and returns the sequences and sequence headers as a named list in R. This function is particularly helpful when working with large datasets that are stored in files.
The “baseq” package in R provides several useful functions for basic sequence processing in biological data. These functions can help with cleaning sequences, counting nucleotides, identifying patterns, translating sequences, and calculating GC content. By utilizing these functions, researchers can perform various analyses of sequence data and gain insights into the genetic makeup of organisms.
baseq is written and published by Ambu Vijayan
Official CRAN Repository link: baseq in CRAN
Download CRAN Source file: baseq source
Official GitHub Repository link: baseq in GitHub
Download latest version from GitHub: Download baseq