About The Author
My name is Michael Grogan, I am a data scientist with a profound passion for statistics and programming. Moreover, I am highly interested in the role of programming languages in conducting advanced statistical analysis.
My background is originally in economics, having graduated with a Master’s degree. However, throughout the course of my studies I increasingly found myself drawn to the more statistical elements of the subject, such as econometrics, business analytics and quantitative finance. In this regard, I frequently began to improve my knowledge in programming languages intrinsically linked to statistics and big data, including Python, R and SQL, which I have increasingly utilised in my own analysis when working with business-related data for a wide variety of clients.
I founded this website to illustrate the use of such languages in conducting statistical analysis, and illustrate programs which are applicable across a wide range of datasets. Additionally, my website goes into depth on a range of cross-sectional and time series methods of analysis, including probability and forecasting methods. My goal is to explore in depth the tools and methods used across the ever expanding field of data science.
Please contact me at firstname.lastname@example.org.
Programs and Tutorials
ARIMA (Autoregressive Integrated Moving Average) and Holt-Winters are two major tools used in time series analysis to attempt to forecast future values of a variable based on its present value. In the following example, I use a stock price dataset of Johnson & Johnson (JNJ) from 2006-2016, and use the aforementioned two models to conduct price forecasting on this stock.
A K-Means Clustering algorithm allows us to group observations in close proximity to the mean, which lends to greater efficiency in categorisation of data into specific segments. This type of analysis can be employed across many different types of situations where we wish to analyse data broken down into specific groups and networks, including social network analysis, market segmentation, and so on.
Statisticians and web developers have thus seemed an unlikely mix till now, but make no mistake that the interactions between these two groups will continue to increase as the need for web-based platforms becomes ever more popular in the world of data science. In this regard, the interaction of the R and Shiny platforms is quickly becoming a cornerstone of interaction between the world of data and the web.
When running a regression in Python, it is often the case that we wish to do so using multiple independent variables. While this can be done using scikit-learn or scipy, it can be a bit trickier to set up and I find it can be accomplished with more ease using the statsmodels library.
The power of a test allows us to determine the minimum sample size that would be expected to yield significant results for a test. We have already discussed the law of large numbers – the idea that if a large enough sample size is collected, the difference between the expected value and the actual value drops to zero. However, is it really efficient to be collecting hundreds or potentially thousands of observations in order to yield a significant result? Of course it’s not. Data collection on such a vast scale would likely lead to far greater time and cost expenditure than is necessary.
The following program allows us to calculate a variance-covariance matrix in R, along with shrinkage estimate of covariance and the calculation of a covariance into a correlation matrix. The purpose of a variance-covariance matrix is to illustrate the variance of a particular variable (diagonals) while covariance illustrates the covariances between the exhaustive combinations of variables.
This particular program analyses returns contained in a csv file to generate a histogram graphing the same in order to analyse return distributions. As we saw previously, the import csv function is what allows us to analyse data in a csv file using Python. Specifically, this program imports the “numpy” library to calculate the mean and standard deviation of the array, and then uses “matplotlib” to graph the histogram.
This program constructed using SQL Server contains a hypothetical dataset of 20 securities with various financial variables for each. As a database language, SQL allows us to select specific data as specified by the user, as well as conduct certain calculations on the data already available. In this regard, we use SQL below to illustrate the use of certain queries in manipulating the database and conducting various calculations.
The purpose of a Monte Carlo simulation is to observe a range of potential outcomes based on a numerical simulation. For instance, if an investor chooses to hold an asset with a given level of return and volatility, then the same can be modelled to examine a range of potential gains and losses over a specified period. This particular program calculates a price path of a stock using a random walk for a given level of return (mu) and volatility (vol).
In conducting probability analysis, the two variables that take account of the chance of an event happening are N (number of observations) and λ (lambda – our hit rate/chance of occurrence in a single interval). Based on the law of large numbers, the larger the number of trials; the larger the probability of an event happening even if the probability within a single trial is very low.
Get my free e-book!
Please don’t forget to subscribe to my mailing list to receive your free copy of my e-book; “How To Use Regression Modelling To Analyse Financial Markets”.
Along with this e-book, you will also gain access to customised templates demonstrating how to conduct statistical analysis using the R Programming Language, along with analysis which can be applied across a variety of trading strategies.