This is not intended to be a comprehensive review of Bioconductor packages - there are too many of them. These are my personal notes.
First of all, I must declare a love-hate relationship with many Bioconductor packages. On one hand, they are very useful for specific purposes. On the other hand, there is often less underlying logic for these packages as compared to the tidyverse ecosystem. Even the authors admit that sometimes they forget what functions are there in their packages (I should link here to a Bioconductor support page, but not in the mood to do so).
This documents the common pitfalls when working with Bioinformatics data and how to prevent them.
Headers Case use janitor::clean_names to standardize names to snakecases.
Names use a standardized name:
chr for chromosome, instead of chrom, seqnames etc. Sometimes you have to change the name to fit a certain software (e.g. GenomicRanages), but only convert the name within the call of the function itself, and immediately change back. Never propagate the name change to the next function because it will then be a headache to deal with the dependencies between functions.
Recently I have been tidying up data for my research projects in NUS. This process of dealing with a few TBs of data in one day made me slightly paranoid of the integrity of the data: where should they be stored, which archiving + compresssion protocal should be used, which local/remote file transferring algorithms should be used and even what kind of media - should they be transferred via USB or ethernet.
I am writing this post not as a guideline, but mainly for self-reference and hopefully a prompt for discussion.
The boom of bioinformatics in recent years is coupled with cheaper technologies and consequently the surge of the amount of data available. The rapid development of the field itself is an anti-estblishment movement - even the most experienced bioinformaticians must spend a significant amount of time getting updated with the resources and toolkits.