Data Science

String processing in R

Case snakecase::to_any_case() help to fix the case of words - useful for converting between presentation data and processing data. Regex The regex in R is not 100% perl flavored. For example, escape character is \\ instead of \. I love the [{rev}](https://github.com/kevinushey/rex package). The most salient point that this package solves is on the interpretation of regex. When not to use regex Not all strings are interpreted as a regex. Sometimes, one needs to opt-in via a parameter in the function.

The bright side of plots (R plot notes)

Font Font size Font format (e.g. subscripts) Font family Combine plots Helpful packages: Colours 3D rendering Rough plots It will be useful to consult Rmarkdown notes because I often use Rmarkdown to render ggplot. Font Font size Global adjustment (e.g. the default font size is small when rendering with Rmarkdown with ) theme’s basic size See individual adjustment here and also in this RStudio2021 conference talk and this

Working with NA in R

NA are necessary markers for missing data. However, Working with them can be tricky because of their special properties. Care should also be taken when reading in and presenting the data. Properties of NA Types There are different types of NA that are denoted by the NA_*. This shhould be noted when working with NA data in a data.frame. Operations like case_when require all output data to be of the same type.

Learning functional programming in R

Why use functional programming Avoid intermediate objects In any loop, the standard practice is to create a new list before the loop, do some processing for each element of the list in the loop and then add the processing result as an element to the new list (following the same index). It makes programming more fun Thinking about that index i is simply not as fun as working with the whole list.

Notes on Bioconductor packages

This is not intended to be a comprehensive review of Bioconductor packages - there are too many of them. These are my personal notes. First of all, I must declare a love-hate relationship with many Bioconductor packages. On one hand, they are very useful for specific purposes. On the other hand, there is often less underlying logic for these packages as compared to the tidyverse ecosystem. Even the authors admit that sometimes they forget what functions are there in their packages (I should link here to a Bioconductor support page, but not in the mood to do so).

Research tools

Semantic scholar A NLP-powered “PubMed” that generates quick summaries for articles Scite For each article, it indicates the nature of the citation i.e. approving, neutral, disproving Meta A research feed generator Others Connected Papers and CORE. Both are available on aRxiv and show relationship between papers. Connected Papers show a map.

Caveats when working with bioinformatics data

This documents the common pitfalls when working with Bioinformatics data and how to prevent them. Headers Case use janitor::clean_names to standardize names to snakecases. Names use a standardized name: chr for chromosome, instead of chrom, seqnames etc. Sometimes you have to change the name to fit a certain software (e.g. GenomicRanages), but only convert the name within the call of the function itself, and immediately change back. Never propagate the name change to the next function because it will then be a headache to deal with the dependencies between functions.

How to split a string column by length

Intro This is a documentation of how I split a string type column by its length, and combine them together in a directory format (which was a necessary step for me to check whether each directory existed in my analysis). library(tidyverse) data <- tibble(string = c("123456", "987654")) print(data) ## # A tibble: 2 x 1 ## string ## <chr> ## 1 123456 ## 2 987654 Step 1 strsplit splits the string into a list of strings, and in tibble it will show up as a column of list type.

How I would learn programming in 7 days

This will be a part of a series of articles on learning programming and data science. There are many articles on this topic already, but these are for my friends. This post focuses on learning programming. Most data scientists use Python and R. Between the two, I think Python is a more programming-oriented language. The types of objects are more straightforward, the syntax is easier, the object-oriented approach is clearer, too.

Should I reflect daily or hourly

This is a fun and quick Christmas project and a reflection on whether I overthink too much. Compound interest is the eighth wonder of the world. He who understands it, earns it … he who doesn’t … pays it. - Einstein Regardless whether Einstein actually said that, it is no doubt that even small incremental improvement makes a large difference in the long term. Here I decide to apply the model to personal growth.