Learning Data Science and Machine Learning: First Steps
Everything was good until a few aspirants pointed out that there are too many resources and many of them are expensive. Python programming was the only branch that had a number of really good courses, but it ends right there for beginners.
A few important questions on foundational data science struck me:
- What should one do after learning how to code? Are there topics that help you strengthen your foundations for data science?
- I hate math, and there are either very basic tutorials or too deep for me. Can you recommend a compact yet comprehensive course on Math and Statistics?
- How much math is enough to start learning how ML algorithms work?
- What are some essential statistics topics to get started with data analysis or data science?
So here goes the essence of this article, the first steps to learning data science or ML and also Best Books for Machine Learning
The Three Pillars of Data Science & ML
If you go through the pre-requisites or pre-work of any ML/DS course, you’ll find a combination of programming, math, and statistics.
1. Essential Programming
Most data roles are programming-based except for a few like business intelligence, market analysis, product analyst, etc.
I am going to focus on technical data jobs that require expertise in at least one programming language. I personally prefer Python over any other language because of its versatility and ease of learning — hands-down, a good pick for developing end-to-end projects.
A glimpse of topics/libraries one must master for data science:
- Common data structures (data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries.
- Writing python scripts to extract, format, and store data into files or back to databases.
- Handling multi-dimensional arrays, indexing, slicing, transposing, broadcasting and pseudorandom number generation using NumPy.
- Performing vectorized operations using scientific computing libraries like NumPy.
- Manipulate data with Pandas — series, dataframe, indexing in a dataframe, comparison operators, merging dataframes, mapping, and applying functions.
- Wrangling data using Pandas — checking for null values, imputing it, grouping data, describing it, performing exploratory analysis, etc.
- Data Visualization using Matplotlib — the API hierarchy, adding styles, color, and markers to a plot, knowledge of various plots and when to use them, line plots, bar plots, scatter plots, histograms, boxplots, and seaborn for more advanced plotting.
2. Essential Mathematics
There are practical reasons why math is essential for folks who want a career as an Machine Learning practitioner, Data Scientist, or Deep Learning Engineer.
#1 Linear algebra to represent data
ML is inherently data-driven because data is at the heart of machine learning. We can think of data as vectors — an object that adheres to arithmetic rules. This leads us to understand how rules of linear algebra operate over arrays of data.
#2 Calculus to train ML models
If you are under the impression that a model training happens “automatically,” then you are wrong. Calculus is what drives the learning of most ML and DL algorithms.
One of the most commonly used optimization algorithms — gradient descent — is an application of partial derivatives.
A model is a mathematical representation of certain beliefs and assumptions. It is said to learn (approximate) the process (linear, polynomial, etc.) of how the data is provided, was generated in the first place, and then make predictions based on that learned process.
Important topics include:
- Basic algebra — variables, coefficients, equations, and linear, exponential, logarithmic functions, etc.
- Linear Algebra — scalars, vectors, tensors, Norms (L1 & L2), dot product, types of matrices, linear transformation, representing linear equations in matrix notation, solving linear regression problem using vectors and matrices.
- Calculus — derivatives and limits, derivative rules, chain rule (for backpropagation algorithm), partial derivatives (to compute gradients), the convexity of functions, local/global minima, the math behind a regression model, applied math for training a model from scratch.
#3 Essential Statistics
Every organisation today is striving to become data-driven. To achieve that, Analysts and Scientists are required to use put data to use in different ways in order to drive decision making.
Describing data — from data to insights
Data always comes in raw and ugly. The initial exploration tells you what’s missing, how the data is distributed, and what’s the best way to clean it to meet the end goal.
In order to answer the defined questions, descriptive statistics enables you to transform each observation in your data into insights that make sense.
Furthermore, the ability to quantify uncertainty is the most valuable skill that is highly regarded at any data company. Knowing the chances of success in any experiment/decision is very crucial for all businesses.
Here are a few of the main staples of statistics that constitute the bare minimum:
- Estimates of location — mean, median, and other variants of these.
- Estimates of variability
- Correlation and covariance
- Random variables — discrete and continuous
- Data distributions — PMF, PDF, CDF
- Conditional probability — Bayesian statistics
- Commonly used statistical distributions — Gaussian, Binomial, Poisson, Exponential
- Important theorems — Law of large numbers and Central limit theorem.
- Inferential Statistics — A more practical and advanced branch of statistics that helps in designing hypothesis testing experiments, pushes us to understand the meaning of metrics deeply and at the same time helps us in quantifying the significance of the results.
- Important tests — Student’s t-Test, Chi-Square test, ANOVA test, etc.