Trial, error, and the help of a like-minded community…

The tools necessary to become a great data scientist are openly accessible online. They have been built by contributions from an active community of millions of people working to understand a variety of abstract questions. Accessibility is not the same as approachability however, and many approaching data science get stuck at the barrier of “where to even start?” In many ways the internet’s accessibility to so many resources has made approaching data science daunting. This is unfortunate because opporunities in data science careers abound in the twenty-first century.

These posts are meant to cover how I approach data science research. I’ve been priviledged to have a career that allowed me to approach the challenge of learning data science at my own pace, with many trials and errors. There are many great tutorials online, covering many topics, and all freely accessible. As a differentiator, I hope these posts can provide accessibility and approachability to data science by:

  1. Focusing on the scientific and critical mentality necessary to approach new questions.
  2. Outlining not only how, but the mentality of why I perform each step in an analysis.
  3. Using real world examples on complex biological datasets.
  4. Not only properly analyzing results, but placing them in perspective of the larger question being addressed.

With these posts I hope to more actively join the community of people making data science accessible online though examples, while also detailing my approach, the resources I use, my common pitfalls, and best practices I’ve developed. I don’t claim my approaches are the best, but I can say they have resulted in high quality published science, and hope they can serve as a resource for others.

What is a data scientist?

Many paths incorporate data and computers: programmers, bionformaticians, analysts…

What sets a data scientist apart? I believe it is the role of critical scientific thinking. A data scientist is first and foremost a scientist, focused on a novel question and addressing hypotheses that confirm or deny their model. To put it another way, other computational paths solve problems, while data scientists must first define those problems.

Advice

  1. Before begining always outline the question and hypothesis:

“When your only tool is a hammer, everything looks like a nail”

One of the biggest problems facing many learning data science is a hammer driven mentality. But you say, “there are so many shiny new computational tools being created all the time!” To myself as a young data scientist these created incredible opportunities to address novel questions. They also posed one of the biggest pitfalls, accessibility to tools distracts from approaching the right solution. Nothing is worse than coding an elegant and functional script, only to realize the output doesn’t address the right question.

In statistics there are three types of error which should be taught. Many people only learn the first two, but in many ways the third is the most important and can be the most nefarious in drawing the wrong conclusion.

Type Error Example
1. False Positive Diagnosing a disease in a healthy person.
2. False Negative Diagnosing someone as healthy who has a disease.
3. False Hypothesis Diagnosing a disease based on symptoms not caused by the disease.

Without a clear question and *hypothesis** the third type of error arises.

  1. Don’t ask “how long will it take me to become a data scientist?”

The same applies to people asking how long it will take to learn them to code. At least monthly I look at solutions to problems I have previously solved, and am astonished by how differently I would do it months or even weeks later. Science by definition is pushing the forefront of human knowledge, and for this reason constantly progressing is the only way to not fall behind.

  1. With a strong question and hypothetical prediction of the outcome, next consider your resources