Trending December 2023 # A Complete Tutorial On Time Series Modeling In R # Suggested January 2024 # Top 16 Popular

You are reading the article A Complete Tutorial On Time Series Modeling In R updated in December 2023 on the website Bellydancehcm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 A Complete Tutorial On Time Series Modeling In R

23 minutes

⭐⭐⭐⭐⭐

Rating: 5 out of 5.

Introduction

‘Time’ is the most important factor which ensures success in a business. It’s difficult to keep up with the pace of time.  But, technology has developed some powerful methods using which we can ‘see things’ ahead of time. Don’t worry, I am not talking about Time Machine. Let’s be realistic here!

I’m talking about the methods of prediction & forecasting. One such method, which deals with time based data is Time Series Modeling. As the name suggests, it involves working on time (years, days, hours, minutes) based data, to derive hidden insights to make informed decision making.

Time series models are very useful models when you have serially correlated data. Most of business houses work on time series data to analyze sales number for the next year, website traffic, competition position and much more. However, it is also one of the areas, which many analysts do not understand.

So, if you aren’t sure about complete process of time series modeling, this guide would introduce you to various levels of time series modeling and its related techniques.

What Is Time Series Modeling?

Let’s begin from basics.  This includes stationary series, random walks , Rho Coefficient, Dickey Fuller Test of Stationarity. If these terms are already scaring you, don’t worry – they will become clear in a bit and I bet you will start enjoying the subject as I explain it.

Stationary Series

There are three basic criterion for a series to be classified as stationary series:

1. The mean of the series should not be a function of time rather should be a constant. The image below has the left hand graph satisfying the condition whereas the graph in red has a time dependent mean.

2. The variance of the series should not a be a function of time. This property is known as homoscedasticity. Following graph depicts what is and what is not a stationary series. (Notice the varying spread of distribution in the right hand graph)

3. The covariance of the i th term and the (i + m) th term should not be a function of time. In the following graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.

Why do I care about ‘stationarity’ of a time series?

The reason I took up this section first was that until unless your time series is stationary, you cannot build a time series model. In cases where the stationary criterion are violated, the first requisite becomes to stationarize the time series and then try stochastic models to predict this time series. There are multiple ways of bringing this stationarity. Some of them are Detrending, Differencing etc.

Random Walk

This is the most basic concept of the time series. You might know the concept well. But, I found many people in the industry who interprets random walk as a stationary process. In this section with the help of some mathematics, I will make this concept crystal clear for ever. Let’s take an example.

Example: Imagine a girl moving randomly on a giant chess board. In this case, next position of the girl is only dependent on the last position.

Now imagine, you are sitting in another room and are not able to see the girl. You want to predict the position of the girl with time. How accurate will you be? Of course you will become more and more inaccurate as the position of the girl changes. At t=0 you exactly know where the girl is. Next time, she can only move to 8 squares and hence your probability dips to 1/8 instead of 1 and it keeps on going down. Now let’s try to formulate this series :

X(t) = X(t-1) + Er(t)

where Er(t) is the error at time point t. This is the randomness the girl brings at every point in time.

Now, if we recursively fit in all the Xs, we will finally end up to the following equation :

X(t) = X(0) + Sum(Er(1),Er(2),Er(3).....Er(t))

Now, lets try validating our assumptions of stationary series on this random walk formulation:

1. Is the Mean constant?

E[X(t)] = E[X(0)] + Sum(E[Er(1)],E[Er(2)],E[Er(3)].....E[Er(t)])

We know that Expectation of any Error will be zero as it is random.

Hence we get E[X(t)] = E[X(0)] = Constant.

2. Is the Variance constant?

Var[X(t)] = Var[X(0)] + Sum(Var[Er(1)],Var[Er(2)],Var[Er(3)].....Var[Er(t)]) Var[X(t)] = t * Var(Error) = Time dependent.

Hence, we infer that the random walk is not a stationary process as it has a time variant variance. Also, if we check the covariance, we see that too is dependent on time.

Let’s spice up things a bit,

We already know that a random walk is a non-stationary process. Let us introduce a new coefficient in the equation to see if we can make the formulation stationary.

Introduced coefficient: Rho

X(t) = Rho * X(t-1) + Er(t)

Now, we will vary the value of Rho to see if we can make the series stationary. Here we will interpret the scatter visually and not do any test to check stationarity.

Let’s start with a perfectly stationary series with Rho = 0 . Here is the plot for the time series :

Increase the value of Rho to 0.5 gives us following graph:

You might notice that our cycles have become broader but essentially there does not seem to be a serious violation of stationary assumptions. Let’s now take a more extreme case of Rho = 0.9

We still see that the X returns back from extreme values to zero after some intervals. This series also is not violating non-stationarity significantly. Now, let’s take a look at the random walk with rho = 1.

This obviously is an violation to stationary conditions. What makes rho = 1 a special case which comes out badly in stationary test? We will find the mathematical reason to this.

Let’s take expectation on each side of the equation  “X(t) = Rho * X(t-1) + Er(t)”

E[X(t)] = Rho *E[ X(t-1)]

This equation is very insightful. The next X (or at time point t) is being pulled down to Rho * Last value of X.

For instance, if X(t – 1 ) = 1, E[X(t)] = 0.5 ( for Rho = 0.5) . Now, if X moves to any direction from zero, it is pulled back to zero in next step. The only component which can drive it even further is the error term. Error term is equally probable to go in either direction. What happens when the Rho becomes 1? No force can pull the X down in the next step.

Dickey Fuller Test of Stationarity

What you just learnt in the last section is formally known as Dickey Fuller test. Here is a small tweak which is made for our equation to convert it to a Dickey Fuller test:

X(t) = Rho * X(t-1) + Er(t)

We have to test if Rho – 1 is significantly different than zero or not. If the null hypothesis gets rejected, we’ll get a stationary time series.

Stationary testing and converting a series into a stationary series are the most critical processes in a time series modelling. You need to memorize each and every detail of this concept to move on to the next step of time series modelling.

Let’s now consider an example to show you what a time series looks like.

Exploration of Time Series Data in R

Here we’ll learn to handle time series data on R. Our scope will be restricted to data exploring in a time series type of data set and not go to building time series models.

I have used an inbuilt data set of R called AirPassengers. The dataset consists of monthly totals of international airline passengers, 1949 to 1960.

Loading the Data Set

Following is the code which will help you load the data set and spill out a few top level metrics.

[1] “ts”

#This tells you that the data series is in a time series format [1] 1949 1 #This is the start of the time series

[1] 1960 12

#This is the end of the time series

[1] 12

#The cycle of this time series is 12months in a year Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Detailed Metrics #The number of passengers are distributed across the spectrum #This will plot the time series # This will fit in a line

Here are a few more operations you can do:

#This will print the cycle across years. #This will aggregate the cycles and display a year on year trend #Box plot across months will give us a sense on seasonal effect Important Inferences

The year on year trend clearly shows that the #passengers have been increasing without fail.

The variance and the mean value in July and August is much higher than rest of the months.

Even though the mean value of each month is quite different their variance is small. Hence, we have strong seasonal effect with a cycle of 12 months or less.

Exploring data becomes most important in a time series model – without this exploration, you will not know whether a series is stationary or not. As in this case we already know many details about the kind of model we are looking out for.

Let’s now take up a few time series models and their characteristics. We will also take this problem forward and make a few predictions.

Introduction to ARMA Time Series Modeling

ARMA models are commonly used in time series modeling. In ARMA model, AR stands for auto-regression and MA stands for moving average. If these words sound intimidating to you, worry not – I’ll simplify these concepts in next few minutes for you!

We will now develop a knack for these terms and understand the characteristics associated with these models. But before we start, you should remember, AR or MA are not applicable on non-stationary series.

In case you get a non stationary series, you first need to stationarize the series (by taking difference / transformation) and then choose from the available time series models.

First, I’ll explain each of these two models (AR & MA) individually. Next, we will look at the characteristics of these models.

Auto-Regressive Time Series Model

Let’s understanding AR models using the case below:

The current GDP of a country say x(t) is dependent on the last year’s GDP i.e. x(t – 1). The hypothesis being that the total cost of production of products & services in a country in a fiscal year (known as GDP) is dependent on the set up of manufacturing plants / services in the previous year and the newly set up industries / plants / services in the current year. But the primary component of the GDP is the former one.

Hence, we can formally write the equation of GDP as:

x(t) = alpha *  x(t – 1) + error (t)

This equation is known as AR(1) formulation. The numeral one (1) denotes that the next instance is solely dependent on the previous instance.  The alpha is a coefficient which we seek so as to minimize the error function. Notice that x(t- 1) is indeed linked to x(t-2) in the same fashion. Hence, any shock to x(t) will gradually fade off in future.

For instance, let’s say x(t) is the number of juice bottles sold in a city on a particular day. During winters, very few vendors purchased juice bottles. Suddenly, on a particular day, the temperature rose and the demand of juice bottles soared to 1000. However, after a few days, the climate became cold again. But, knowing that the people got used to drinking juice during the hot days, there were 50% of the people still drinking juice during the cold days. In following days, the proportion went down to 25% (50% of 50%) and then gradually to a small number after significant number of days. The following graph explains the inertia property of AR series:

Moving Average Time Series Model

Let’s take another case to understand Moving average time series model.

A manufacturer produces a certain type of bag, which was readily available in the market. Being a competitive market, the sale of the bag stood at zero for many days. So, one day he did some experiment with the design and produced a different type of bag. This type of bag was not available anywhere in the market. Thus, he was able to sell the entire stock of 1000 bags (lets call this as x(t) ). The demand got so high that the bag ran out of stock. As a result, some 100 odd customers couldn’t purchase this bag. Lets call this gap as the error at that time point. With time, the bag had lost its woo factor. But still few customers were left who went empty handed the previous day. Following is a simple formulation to depict the scenario :

x(t) = beta *  error(t-1) + error (t)

If we try plotting this graph, it will look something like this:

Did you notice the difference between MA and AR model? In MA model, noise / shock quickly vanishes with time. The AR model has a much lasting effect of the shock.

Difference Between AR and MA Models Exploiting ACF and PACF Plots

Once we have got the stationary time series, we must answer two primary questions:

The trick to solve these questions is available in the previous section. Didn’t you notice?

The first question can be answered using Total Correlation Chart (also known as Auto – correlation Function / ACF). ACF is a plot of total correlation between different lag functions. For instance, in GDP problem, the GDP at time point t is x(t). We are interested in the correlation of x(t) with x(t-1) , x(t-2) and so on. Now let’s reflect on what we have learnt above.

In a moving average series of lag n, we will not get any correlation between x(t) and x(t – n -1) . Hence, the total correlation chart cuts off at nth lag. So it becomes simple to find the lag for a MA series. For an AR series this correlation will gradually go down without any cut off value. So what do we do if it is an AR series?

Here is the second trick. If we find out the partial correlation of each lag, it will cut off after the degree of AR series. For instance,if we have a AR(1) series,  if we exclude the effect of 1st lag (x (t-1) ), our 2nd lag (x (t-2) ) is independent of x(t). Hence, the partial correlation function (PACF) will drop sharply after the 1st lag. Following are the examples which will clarify any doubts you have on this concept :

The blue line above shows significantly different values than zero. Clearly, the graph above has a cut off on PACF curve after 2nd lag which means this is mostly an AR(2) process.

Clearly, the graph above has a cut off on ACF curve after 2nd lag which means this is mostly a MA(2) process.

Till now, we have covered on how to identify the type of stationary series using ACF & PACF plots. Now, I’ll introduce you to a comprehensive framework to build a time series model.  In addition, we’ll also discuss about the practical applications of time series modelling.

Framework and Application of ARIMA Time Series Modeling

A quick revision, Till here we’ve learnt basics of time series modeling, time series in R and ARMA modeling. Now is the time to join these pieces and make an interesting story.

Overview of the Framework

This framework(shown below) specifies the step by step approach on ‘How to do a Time Series Analysis‘:

As you would be aware, the first three steps have already been discussed above. Nevertheless, the same has been delineated briefly below:

Step 1: Visualize the Time Series

It is essential to analyze the trends prior to building any kind of time series model. The details we are interested in pertains to any kind of trend, seasonality or random behaviour in the series. We have covered this part in the second part of this series.

Step 2: Stationarize the Series

Once we know the patterns, trends, cycles and seasonality , we can check if the series is stationary or not. Dickey – Fuller is one of the popular test to check the same. We have covered this test in the first part of this article series. This doesn’t ends here! What if the series is found to be non-stationary?

There are three commonly used technique to make a time series stationary:

1.  Detrending : Here, we simply remove the trend component from the time series. For instance, the equation of my time series is:

x(t) = (mean + trend * t) + error

We’ll simply remove the part in the parentheses and build model for the rest.

2. Differencing : This is the commonly used technique to remove non-stationarity. Here we try to model the differences of the terms and not the actual term. For instance,

x(t) – x(t-1) = ARMA (p ,  q)

This differencing is called as the Integration part in AR(I)MA. Now, we have three parameters

p : AR

d : I

q : MA

3. Seasonality: Seasonality can easily be incorporated in the ARIMA model directly. More on this has been discussed in the applications part below.

Step 3: Find Optimal Parameters

The parameters p,d,q can be found using  ACF and PACF plots. An addition to this approach is can be, if both ACF and PACF decreases gradually, it indicates that we need to make the time series stationary and introduce a value to “d”.

Step 4: Build ARIMA Model

With the parameters in hand, we can now try to build ARIMA model. The value found in the previous section might be an approximate estimate and we need to explore more (p,d,q) combinations. The one with the lowest BIC and AIC should be our choice. We can also try some models with a seasonal component. Just in case, we notice any seasonality in ACF/PACF plots.

Step 5: Make Predictions

Once we have the final ARIMA model, we are now ready to make predictions on the future time points. We can also visualize the trends to cross validate if the model works fine.

Applications of Time Series Model

Now, we’ll use the same example that we have used above. Then, using time series, we’ll make future predictions. We recommend you to check out the example before proceeding further.

Where did we start?

Following is the plot of the number of passengers with years. Try and make observations on this plot before moving further in the article.

Here are my observations:

1. There is a trend component which grows the passenger year by year.

2. There looks to be a seasonal component which has a cycle less than 12 months.

3. The variance in the data keeps on increasing with time.

We know that we need to address two issues before we test stationary series. One, we need to remove unequal variances. We do this using log of the series. Two, we need to address the trend component. We do this by taking difference of the series. Now, let’s test the resultant series.

adf.test(diff(log(AirPassengers)), alternative="stationary", k=0)

Augmented Dickey-Fuller Test

data: diff(log(AirPassengers)) Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01 alternative hypothesis: stationary

We see that the series is stationary enough to do any kind of time series modelling.

Next step is to find the right parameters to be used in the ARIMA model. We already know that the ‘d’ component is 1 as we need 1 difference to make the series stationary. We do this using the Correlation plots. Following are the ACF plots for the series:

#ACF Plots

acf(log(AirPassengers))

What do you see in the chart shown above?

Clearly, the decay of ACF chart is very slow, which means that the population is not stationary. We have already discussed above that we now intend to regress on the difference of logs rather than log directly. Let’s see how ACF and PACF curve come out after regressing on the difference.

acf(diff(log(AirPassengers))) pacf(diff(log(AirPassengers)))

Clearly, ACF plot cuts off after the first lag. Hence, we understood that value of p should be 0 as the ACF is the curve getting a cut off. While value of q should be 1 or 2. After a few iterations, we found that (0,1,1) as (p,d,q) comes out to be the combination with least AIC and BIC.

Let’s fit an ARIMA model and predict the future 10 years. Also, we will try fitting in a seasonal component in the ARIMA formulation. Then, we will visualize the prediction along with the training data. You can use the following code to do the same :

(fit <- arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12))) pred <- predict(fit, n.ahead = 10*12) ts.plot(AirPassengers,2.718^pred$pred, log = "y", lty = c(1,3)) Practice Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Test the techniques discussed in this post and accelerate your learning in Time Series Analysis with the following Practice Problems:

Conclusion

With this, we come to this end of tutorial on Time Series Modelling. I hope this will help you to improve your knowledge to work on time based data. To reap maximum benefits out of this tutorial, I’d suggest you to practice these R codes side by side and check your progress.

Frequently Asked Questions Related

You're reading A Complete Tutorial On Time Series Modeling In R

Tree Based Algorithms: A Complete Tutorial From Scratch (In R & Python)

Overview

Explanation of tree based algorithms from scratch in R and python

Learn machine learning concepts like decision trees, random forest, boosting, bagging, ensemble methods

Implementation of these tree based algorithms in R and Python

Table of Contents Introduction to Tree Based Algorithms

Tree based algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression).

Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it’s important to learn these algorithms and use them for modeling.

This tutorial is meant to help beginners learn tree based algorithms from scratch. After the successful completion of this tutorial, one is expected to become proficient at using tree based algorithms and build predictive models.

Note: This tutorial requires no prior knowledge of machine learning. However, elementary knowledge of R or Python will be helpful. To get started you can follow full tutorial in R and full tutorial in Python. You can also check out the ‘Introduction to Data Science‘ course covering Python, Statistics and Predictive Modeling.

We will also cover some ensemble techniques using tree-based models below. To learn more about them and other ensemble learning techniques in a comprehensive manner, you can check out the following courses:

1. What is a Decision Tree ? How does it work ?

Decision tree is a type of supervised learning algorithm (having a predefined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.

As mentioned above, decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population. Now the question which arises is, how does it identify the variable and the split? To do this, decision tree uses various algorithms, which we will discuss in the following section.

Types of Decision Trees

Types of decision tree is based on the type of target variable we have. It can be of two types:

Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. Example:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.

Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.

Important Terminology related to Tree based Algorithms

Let’s look at the basic terminology used with Decision trees:

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.

Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

Advantages

Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.

Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. You can refer article (Trick to enhance power of regression model) for one such trick.  It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.

Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.

Data type is not a constraint: It can handle both numerical and categorical variables.

Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.

Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).

Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

2. Regression Trees vs Classification Trees

We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the the bottom & roots are the tops (shown below).

Both the trees work almost similar to each other, let’s look at the primary differences & similarity between classification and regression trees:

Regression trees are used when dependent variable is continuous. Classification trees are used when dependent variable is categorical.

In case of regression tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mean value.

In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mode value.

Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions. For the sake of simplicity, you can think of these regions as high dimensional boxes or boxes.

Both the trees follow a top-down greedy approach known as recursive binary splitting. We call it as ‘top-down’ because it begins from the top of tree when all the observations are available in a single region and successively splits the predictor space into two new branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for best variable available) about only the current split, and not about future splits which will lead to a better tree.

This splitting process is continued until a user defined stopping criteria is reached. For example: we can tell the the algorithm to stop once the number of observations per node becomes less than 50.

In both the cases, the splitting process results in fully grown trees until the stopping criteria is reached. But, the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data. This bring ‘pruning’. Pruning is one of the technique used tackle overfitting. We’ll learn more about it in following section.

3. How does a tree based algorithms decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:

Gini  says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

It works with categorical target variable “Success” or “Failure”.

It performs only Binary splits

Higher the value of Gini higher the homogeneity.

CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).

Calculate Gini for split using weighted Gini score of each node of that split

Example: – Referring to example used above, where we want to segregate the students based on target variable ( playing cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class. Now, I want to identify which split is producing more homogeneous sub-nodes using Gini .

Split on Gender:

Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68

Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55

Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

Similar for Split on Class:

Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51

Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51

Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.

You might often come across the term ‘Gini Impurity’ which is determined by subtracting the gini value from 1. So mathematically we can say,

Gini Impurity = 1-Gini

Chi-Square

It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.

It works with categorical target variable “Success” or “Failure”.

It can perform two or more splits.

Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.

Chi-Square of each node is calculated using formula,

Chi-square = ((Actual – Expected)^2 / Expected)^1/2

It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

Calculate Chi-square for individual node by calculating the deviation for Success and Failure both

Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

Example: Let’s work with above example that we have used to calculate Gini.

Split on Gender:

First we are populating for node Female, Populate the actual value for “Play Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.

Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for both because parent node has probability of 50% and we have applied same probability on Female count(10).

Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).

Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table for calculation.

Follow similar steps for calculating Chi-square value for Male node.

Now add all Chi-square values to calculate Chi-square for split Gender.

Split on Class:

Perform similar steps of calculation for split on Class and you will come up with below table.

Above, you can see that Chi-square also identify the Gender split is more significant compare to Class.

Information Gain:

Look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar. On the other hand, B requires more information to describe it and A requires the maximum information. In other words, we can say that C is a Pure node, B is less Impure and A is more impure.

Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% – 50%), it has entropy of one.

Entropy can be calculated using formula:-

Here p and q is probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:

Calculate entropy of parent node

Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.

Example: Let’s use this method to identify best split for student example.

Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node.

Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for male node,  -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93

Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86

Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for Class X node,  -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.

Entropy for split Class =  (14/30)*0.99 + (16/30)*0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive information gain from entropy as 1- Entropy.

Reduction in Variance

Till now, we have discussed the algorithms for categorical target variable. Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

Calculate variance for each node.

Calculate variance for each split as weighted average of each node variance.

Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow the steps to identify the right split:

Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have 15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-0.5)^2) / 30 = 0.25

Mean of Female node =  (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-0.2)^2) / 10 = 0.16

Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-0.65)^2) / 20 = 0.23

Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30) *0.23 = 0.21

Mean of Class IX node =  (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-0.43)^2) / 14= 0.24

Mean of Class X node =  (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-0.56)^2) / 16 = 0.25

Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25

Above, you can see that Gender split has lower variance compare to parent node, so the split would take place on Gender variable.

Until here, we learnt about the basics of decision trees and the decision making process involved to choose the best splits in building a tree model. As I said, decision tree can be applied both on regression and classification problems. Let’s understand these aspects in detail.

4. What are the key parameters of tree based algorithms and how can we avoid over-fitting in decision trees?

Overfitting is one of the key challenges faced while using tree based algorithms. If there is no limit set of a decision tree, it will give you 100% accuracy on training set because in the worse case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:

Setting constraints on tree size

Tree pruning

Let’s discuss both of these briefly.

Setting Constraints on tree based algorithms

This can be done by using various parameters which are used to define a tree. First, lets look at the general structure of a decision tree:

The parameters used for defining a tree are further explained below. The parameters described below are irrespective of tool. It is important to understand the role of parameters used in tree modeling. These parameters are available in R & Python.

Minimum samples for a node split

Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.

Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

Too high values can lead to under-fitting hence, it should be tuned using CV.

Minimum samples for a terminal node (leaf)

Defines the minimum samples (or observations) required in a terminal node or leaf.

Used to control over-fitting similar to min_samples_split.

Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.

Maximum depth of tree (vertical depth)

The maximum depth of a tree.

Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.

Should be tuned using CV.

Maximum number of terminal nodes

The maximum number of terminal nodes or leaves in a tree.

Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.

Maximum features to consider for split

The number of features to consider while searching for a best split. These will be randomly selected.

As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.

Higher values can lead to over-fitting but depends on case to case.

Pruning in tree based algorithms

As discussed earlier, the technique of setting constraint is a greedy-approach. In other words, it will check for the best split instantaneously and move forward until one of the specified stopping condition is reached. Let’s consider the following case when you’re driving:

There are 2 lanes:

A lane with cars moving at 80km/h

A lane with trucks moving at 30km/h

At this instant, you are the yellow car and you have 2 choices:

Take a left and overtake the other 2 cars quickly

Keep moving in the present lane

Let’s analyze these choice. In the former choice, you’ll immediately overtake the car ahead and reach behind the truck and start moving at 30 km/h, looking for an opportunity to move back right. All cars originally behind you move ahead in the meanwhile. This would be the optimum choice if your objective is to maximize the distance covered in next say 10 seconds. In the later choice, you sale through at same speed, cross trucks and then overtake maybe depending on situation ahead. Greedy you!

This is exactly the difference between normal decision tree & pruning. A decision tree with constraints won’t see the truck ahead and adopt a greedy approach by taking a left. On the other hand if we use pruning, we in effect look at a few steps ahead and make a choice.

So we know pruning is better. But how to implement it in decision tree? The idea is simple.

We first make the decision tree to a large depth.

Then we start at the bottom and start removing leaves which are giving us negative returns when compared from the top.

Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see that the overall gain is +10 and keep both leaves.

Note that sklearn’s decision tree classifier does not currently support pruning. Advanced packages like xgboost have adopted tree pruning in their implementation. But the library rpart in R, provides a function to prune. Good for R users!

5. Are tree based algorithms better than linear models?

“If I can use logistic regression for classification problems and linear regression for regression problems, why is there a need to use trees”? Many of us have this question. And, this is a valid one too.

Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key factors which will help you to decide which algorithm to use:

If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.

If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.

If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

6. Working with tree based algorithms Trees in R and Python

For R users and Python users, decision tree is quite easy to implement. Let’s quickly look at the set of codes that can get you started with this algorithm. For ease of use, I’ve shared standard codes where you’ll need to replace your data set name and variables to get started.

In fact, you can build the decision tree in Python right here! Here’s a live coding window for you to play around the code and generate results:



For R users, there are multiple packages available to implement decision tree such as ctree, rpart, tree etc.

> x <- cbind(x_train,y_train)

# grow tree  > fit <- rpart(

y_train

~

.

,

data

= x,

method=”class”) #Predict Output > predicted=

predict

(

fit

,

x_test

)

In the code above:

y_train – represents dependent variable.

x_train – represents independent variable

x – represents training data.

For Python users, below is the code:

#Import Library #Import other necessary libraries like pandas, numpy...

from

sklearn

import

tree

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion='gini')

# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score

model

.

fit

(

X

,

y

)

model

.score

(

X

,

y

)

#Predict Output predicted= model

.

predict

(

x_test

)

7. What are ensemble methods in tree based algorithms ?

The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of predictive models to achieve a better accuracy and model stability. Ensemble methods are known to impart supreme boost to tree based models.

Like every other model, a tree based algorithm also suffers from the plague of bias and variance. Bias means, ‘how much on an average are the predicted values different from the actual value.’ Variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’.

You build a small tree and you will get a model with low variance and high bias. How do you manage to balance the trade off between bias and variance ?

Normally, as you increase the complexity of your model, you will see a reduction in prediction error due to lower bias in the model. As you continue to make your model more complex, you end up over-fitting your model and your model will start suffering from high variance.

A champion model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.

Some of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In this tutorial, we’ll focus on Bagging and Boosting in detail.

8. What is Bagging? How does it work?

Bagging is an ensemble technique used to reduce the variance of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set. The following figure will make it clearer:

The steps followed in bagging are:

Create Multiple DataSets:

Sampling is done with replacement on the original data and new datasets are formed.

The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model

Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting

Build Multiple Classifiers:

Classifiers are built on each data set.

Generally the same classifier is modeled on each data set and predictions are made.

Combine Classifiers:

The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.

The combined values are generally more robust than a single model.

Note that, here the number of models built is not a hyper-parameters. Higher number of models are always better or may give similar performance than lower numbers. It can be theoretically shown that the variance of the combined predictions are reduced to 1/n (n: number of classifiers) of the original variance, under some assumptions.

There are various implementations of bagging models. Random forest is one of them and we’ll discuss it next.

9. What is Random Forest ? How does it work?

Random Forest is considered to be a panacea of all data science problems. On a funny note, when you can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

How does it work?

In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see comparison between CART and Random Forest here, part1 and part2). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

It works in the following manner. Each tree is planted & grown as follows:

Assume number of cases in the training set is N. Then, sample of these N cases is taken at random but with replacement. This sample will be the training set for growing the tree.

If there are M input variables, a number m<M is specified such that at each node, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.

Each tree is grown to the largest extent possible and  there is no pruning.

Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for classification, average for regression).

To understand more in detail about this algorithm using a case study, please read this article “Introduction to Random forest – Simplified“.

Advantages of Random Forest

This algorithm can solve both type of problems i.e. classification and regression and does a decent estimation at both fronts.

It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

It has methods for balancing errors in data sets where classes are imbalanced.

The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.

Random Forest involves sampling of the input data with replacement called as bootstrap sampling. Here one third of the data is not used for training and can be used to testing. These are called the out of bag samples. Error estimated on these out of bag samples is known as out of bag error. Study of error estimates by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.

It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.

Random Forest can feel like a black box approach for statistical modelers – you have very little control on what the model does. You can at best – try different parameters and random seeds!

Python & R implementation

Random forests have commonly known implementations in R packages and Python scikit-learn. Let’s look at the code of loading random forest model in R and Python below:

R Code

> x <- cbind(x_train,y_train)

# Fitting model

10. What is Boosting? How does it work?

Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

Email has only one image file (promotional image), It’s a SPAM

Email has only link(s), It’s a SPAM

Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM

Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:

Using average/ weighted average

Considering prediction has higher vote

For example:  Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’.

How does it work?

Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An immediate question which should pop in your mind is, ‘How boosting identify weak rules?‘

To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule. This is how the ensemble model is built.

Here’s another question which might haunt you, ‘How do we choose different distribution for each round?’

For choosing the right distribution, here are the following steps:

Step 1:  The base learner takes all the distributions and assign equal weight or attention to each observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

There are many boosting algorithms which impart additional boost to model’s accuracy. In this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient Boosting (GBM) and XGboost.

11. Which is more powerful: GBM or Xgboost?

Regularization:

Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.

In fact, XGBoost is also known as ‘regularized boosting‘ technique.

Parallel Processing:

XGBoost implements parallel processing and is blazingly faster as compared to GBM.

But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.

XGBoost also supports implementation on Hadoop.

High Flexibility

XGBoost allow users to define custom optimization objectives and evaluation criteria.

This adds a whole new dimension to the model and there is no limit to what we can do.

Handling Missing Values

XGBoost has an in-built routine to handle missing values.

User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.

Tree Pruning:

A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.

XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.

Built-in Cross-Validation

XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

Continue on Existing Model

GBM implementation of sklearn also has this feature so they are even on this point.

12. Working with GBM in R and Python

Before we start working, let’s quickly understand the important parameters and the working of this algorithm. This will be helpful for both R and Python users. Below is the overall pseudo-code of GBM algorithm for 2 classes:

1. Initialize the outcome 2. Iterate from 1 to total number of trees 2.1 Update the weights for targets based on previous run (higher for the ones mis-classified) 2.2 Fit the model on selected subsample of data 2.3 Make predictions on the full set of observations 2.4 Update the output with current results taking into account the learning rate 3. Return the final output.

This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will help every beginners to understand this algorithm.

Lets consider the important GBM parameters used to improve model performance in Python:

learning_rate

This determines the impact of each tree on the final outcome (step 2.4). GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates.

Lower values are generally preferred as they make the model robust to the specific characteristics of tree and thus allowing it to generalize well.

Lower values would require higher number of trees to model all the relations and will be computationally expensive.

n_estimators

The number of sequential trees to be modeled (step 2)

Though GBM is fairly robust at higher number of trees but it can still overfit at a point. Hence, this should be tuned using CV for a particular learning rate.

subsample

The fraction of observations to be selected for each tree. Selection is done by random sampling.

Values slightly less than 1 make the model robust by reducing the variance.

Typical values ~0.8 generally work fine but can be fine-tuned further.

Apart from these, there are certain miscellaneous parameters which affect overall functionality:

loss

It refers to the loss function to be minimized in each split.

It can have various values for classification and regression case. Generally the default values work fine. Other values should be chosen only if you understand their impact on the model.

init

This affects initialization of the output.

This can be used if we have made another model whose outcome is to be used as the initial estimates for GBM.

random_state

The random number seed so that same random numbers are generated every time.

This is important for parameter tuning. If we don’t fix the random number, then we’ll have different outcomes for subsequent runs on the same parameters and it becomes difficult to compare models.

It can potentially result in overfitting to a particular random sample selected. We can try running models for different random samples, which is computationally expensive and generally not used.

verbose

The type of output to be printed when the model fits. The different values can be:

0: no output generated (default)

1: output generated for trees in certain intervals

warm_start

This parameter has an interesting application and can help a lot if used judicially.

presort 

 Select whether to presort data for faster splits.

It makes the selection automatically by default but it can be changed if needed.

I know its a long list of parameters but I have simplified it for you in an excel file which you can download from this GitHub repository.

For R users, using caret package, there are 3 main tuning parameters:

n.trees – It refers to number of iterations i.e. tree which will be taken to grow the trees

interaction.depth – It determines the complexity of the tree i.e. total number of splits it has to perform on a tree (starting from a single node)

 shrinkage – It refers to the learning rate. This is similar to learning_rate in python (shown above).

n.minobsinnode – It refers to minimum number of training samples required in a node to perform splitting

GBM in R (with cross validation)

I’ve shared the standard codes in R and Python. At your end, you’ll be required to change the value of dependent variable and data set name used in the codes below. Considering the ease of implementing GBM in R, one can easily perform tasks like cross validation and grid search with this package.

number

=

10

,

#5folds)

n.trees

= 500

,

shrinkage

=

0.1

,

n.minobsinnode

=

1

0

)

method

=

“gbm”

,

trControl

= fitControl,

verbose

=

FALSE

,

tuneGrid

= gbmGrid)

GBM in Python

13. Working with XGBoost in R and Python

R Tutorial: For R users, this is a complete tutorial on XGboost which explains the parameters along with codes in R. Check Tutorial.

Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to get you started. Check Tutorial.

14. Where to practice ?

Practice is the one and true method of mastering any concept. Hence, you need to start practicing if you wish to master these algorithms.

Till here, you’ve got gained significant knowledge on tree based algorithms along with these practical implementation. It’s time that you start working on them. Here are open practice problems where you can participate and check your live rankings on leaderboard:

End Notes

Tree based algorithms are important for every data scientist to learn. In fact, tree models are known to provide the best model performance in the family of whole machine learning algorithms. In this tutorial, we learnt until GBM and XGBoost. And with this, we come to the end of this tutorial.

We discussed about tree based algorithms from scratch. We learnt the important of decision tree and how that simplistic concept is being used in boosting algorithms. For better understanding, I would suggest you to continue practicing these algorithms practically. Also, do keep note of the parameters associated with boosting algorithms. I’m hoping that this tutorial would enrich you with complete knowledge on tree based modeling.

Note – The discussions of this article are going on at AV’s Discuss portal. Join here! You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Related

Jsp Implicit Objects: Complete Tutorial

What is JSP Implicit object?

JSP implicit objects are created during the translation phase of JSP to the servlet.

These objects can be directly used in scriplets that goes in the service method.

They are created by the container automatically, and they can be accessed using objects.

How many Implicit Objects are available in JSP?

There are 9 types of implicit objects available in the container:

Lets study One By One

Out

Out is one of the implicit objects to write the data to the buffer and send output to the client in response

Out object allows us to access the servlet’s output stream

Out is object of javax.servlet.jsp.jspWriter class

While working with servlet, we need printwriter object

Example:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" <% int num1=10;int num2=20; out.println("num1 is " +num1); out.println("num2 is "+num2);

Explanation of the code:

Code Line 11-12– out is used to print into output stream

When we execute the above code, we get the following output:

Output:

In the output, we get the values of num1 and num2

Request

It will be created by container for every request.

It will be used to request the information like parameter, header information , server name, etc.

It uses getParameter() to access the request parameter.

Example:

Implicit_jsp2.jsp(form from which request is sent to guru.jsp)

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"

Guru.jsp (where the action is taken)

Explanation of code:

Code Line 10-13 : In implicit_jsp2.jsp(form) request is sent, hence the variable username is processed and sent to chúng tôi which is action of JSP.

Guru.jsp

Code Line10-11: It is action jsp where the request is processed, and username is taken from form jsp.

When you execute the above code, you get the following output

Output:

Response

“Response” is an instance of class which implements HttpServletResponse interface

Container generates this object and passes to _jspservice() method as parameter

“Response object” will be created by the container for each request.

It represents the response that can be given to the client

The response implicit object is used to content type, add cookie and redirect to response page

Example:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"

Explanation of the code:

Code Line 11: In the response object we can set the content type

Here we are setting only the content type in the response object. Hence, there is no output for this.

Config

“Config” is of the type java.servlet.servletConfig

It is created by the container for each jsp page

It is used to get the initialization parameter in web.xml

Example:

Web.xml (specifies the name and mapping of the servlet)

Implicit_jsp5.jsp (getting the value of servlet name)

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" <% String servletName = config.getServletName();

Explanation of the code:

In web.xml

Code Line 14-17: In chúng tôi we have mapping of servlets to the classes.

Implicit_jsp5.jsp

Code Line 10-11: To get the name of the servlet in JSP, we can use config.getServletName, which will help us to get the name of the servlet.

When you execute the above code you get the following output:

Output:

Servlet name is “GuruServlet” as the name is present in web.xml

Application

Application object (code line 10) is an instance of javax.servlet.ServletContext and it is used to get the context information and attributes in JSP.

Application object is created by container one per application, when the application gets deployed.

Servletcontext object contains a set of methods which are used to interact with the servlet chúng tôi can find information about the servlet container

Example:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"

Explanation of the code:

In the above code, application attribute helps to get the context path of the JSP page.

Session

Session object is used to get, set and remove attributes to session scope and also used to get session information

Example:

Implicit_jsp7(attribute is set)

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"

Implicit_jsp8.jsp (getAttribute)

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" <% String name = (String)session.getAttribute("user"); out.println("User Name is " +name);

Explanation of the code:

Implicit_jsp7.jsp

Code Line 11: we are setting the attribute user in the session variable, and that value can be fetched from the session in whichever jsp is called from that (_jsp8.jsp).

Code Line 12: We are calling another jsp on href in which we will get the value for attribute user which is set.

Implicit_jsp8.jsp

Code Line 11: We are getting the value of user attribute from session object and displaying that value

When you execute the above code, you get the following output:

Output:

PageContext

This object is of the type of pagecontext.

It is used to get, set and remove the attributes from a particular scope

Scopes are of 4 types:

Page

Request

Session

Application

Example:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" <% pageContext.setAttribute("student","gurustudent",pageContext.PAGE_SCOPE); String name = (String)pageContext.getAttribute("student"); out.println("student name is " +name);

Explanation of the code:

Code Line 11: we are setting the attribute using pageContext object, and it has three parameters:

Key

Value

Scope

In the above code, the key is student and value is “gurustudent” while the scope is the page scope. Here the scope is “page” and it can get using page scope only.

Code Line 12: We are getting the value of the attribute using pageContext

When you execute the above code, you get the following output:

Output:

The output will print “student name is gurustudent”.

Page

Page implicit variable holds the currently executed servlet object for the corresponding jsp.

Acts as this object for current jsp page.

Example:

In this example, we are using page object to get the page name using toString method

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" <% String pageName = page.toString();

Explanation of the code:

Code Line 10-11: In this example, we are trying to use the method toString() of the page object and trying to get the string name of theJSP Page.

When you execute the code you get the following output:

Output:

Output is string name of above jsp page

Exception

Exception is the implicit object of the throwable class.

It is used for exception handling in JSP.

The exception object can be only used in error pages.Example:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1" <%int[] num1={1,2,3,4};

Explanation of the code:

Code Line 10-12 – It has an array of numbers, i.e., num1 with four elements. In the output, we are trying to print the fifth element of the array from num1, which is not declared in the array list. So it is used to get exception object of the jsp.

Output:

We are getting ArrayIndexOfBoundsException in the array where we are getting a num1 array of the fifth element.

Complete Tutorial On Text Classification Using Conditional Random Fields Model (In Python)

Introduction

News sites and other online media alone generate tons of text content on an hourly basis. Analyzing patterns in that data can become daunting if you don’t have the right tools. Here we will discuss one such approach, using entity recognition, called Conditional Random Fields (CRF).

This article explains the concept and python implementation of conditional random fields on a self-annotated dataset. This is a really fun concept and I’m sure you’ll enjoy taking this ride with me!

Table of contents

What is Entity Recognition?

Case Study Objective and Understanding Different Approaches

Formulating Conditional Random Fields (CRFs)

Annotating Training Data

Annotations using GATE

Building and Training a CRF Module in Python

What is Entity Recognition?

Entity recognition has seen a recent surge in adoption with the interest in Natural Language Processing (NLP). An entity can generally be defined as a part of text that is of interest to the data scientist or the business. Examples of frequently extracted entities are names of people, address, account numbers, locations etc. These are only simple examples and one could come up with one’s own entity for the problem at hand.

To take a simple application of entity recognition, if there’s any text with “London” in the dataset, the algorithm would automatically categorize or classify that as a location (you must be getting a general idea of where I’m going with this).

Let’s take a simple case study to understand our topic in a better way.

Case Study Objective & Understanding Different Approaches

Suppose that you are part of an analytics team in an insurance company where each day, the claims team receives thousands of emails from customers regarding their claims. The claims operations team goes through each email and updates an online form with the details before acting on them.

You are asked to work with the IT team to automate the process of pre-populating the online form. For this task, the analytics team needs to build a custom entity recognition algorithm.

To identify entities in text, one must be able to identify the pattern. For example, if we need to identify the claim number, we can look at the words around it such as “my id is” or “my number is”, etc. Let us examine a few approaches mentioned below for identifying the patterns.

Regular expressions

: Regular expressions (RegEx) are a form of finite state automaton. They are very helpful in identifying patterns that follow a certain structure. For example, email ID, phone number, etc. can be identified well using RegEx. However, the downside of this approach is that one needs to be aware of all the possible exact words that occur before the claim number. This is not a learning approach, but rather a brute force one

Hidden Markov Model (HMM)

: This is a sequence modelling algorithm that identifies and learns the pattern. Although HMM considers the future observations around the entities for learning a pattern, it assumes that the features are independent of each other. This approach is better than regular expressions as we do not need to model the exact set of word(s). But in terms of performance, it is not known to be the best method for entity recognition

MaxEnt Markov Model (MEMM)

: This is also a sequence modelling algorithm. This does not assume that features are independent of each other and also does not consider future observations for learning the pattern. In terms of performance, it is not known to be the best method for identifying entity relationships either

Conditional Random Fields (CRF)

: This is also a sequence modelling algorithm. This not only assumes that features are dependent on each other, but also considers the future observations while learning a pattern. This combines the best of both HMM and MEMM. In terms of performance, it is considered to be the best method for entity recognition problem

Formulating Conditional Random Fields (CRF)

The bag of words (BoW) approach works well for multiple text classification problems. This approach assumes that presence or absence of word(s) matter more than the sequence of the words. However, there are problems such as entity recognition, part of speech identification where word sequences matter as much, if not more. Conditional Random Fields (CRF) comes to the rescue here as it uses word sequences as opposed to just words.

Let us now understand how CRF is formulated.

Below is the formula for CRF where Y is the hidden state (for example, part of speech) and X is the observed variable (in our example this is the entity or other words around it).

Broadly speaking, there are 2 components to the CRF formula:

Normalization

: You may have observed that there are no probabilities on the right side of the equation where we have the weights and features. However, the output is expected to be a probability and hence there is a need for normalization. The normalization constant Z(x) is a sum of all possible state sequences such that the total becomes 1. You can find more details in the reference section of this article to understand how we arrived at this value.

Weights and Features

: This component can be thought of as the logistic regression formula with weights and the corresponding features. The weight estimation is performed by maximum likelihood estimation and the features are defined by us.

Annotating training data

Now that you are aware of the CRF model, let us curate the training data. The first step to doing this is annotation.  Annotation is a process of tagging the word(s) with the corresponding tag. For simplicity, let us suppose that we only need 2 entities to populate the online form, namely the claimant name and the claim number.

The following is a sample email received as is. Such emails need to be annotated so that the CRF model can be trained. The annotated text needs to be in an XML format. Although you may choose to annotate the documents in your way, I’ll walk you through the use of the GATE architecture to do the same.

Email received:

“Hi,

I am writing this email to claim my insurance amount. My id is abc123 and I claimed it on 1st January 2023. I did not receive any acknowledgement. Please help.

Thanks,

randomperson”

Annotated Email:

Annotations using GATE

Let us understand how to use the General Architecture for Text Engineering (GATE). Please follow the below steps to install GATE.

Install the GATE platform by executing the downloaded installer and following the installation steps appropriately

b. Create annotations on the fly and use them. In this article, we will demonstrate this approach.

The above process will save all the annotated emails in one folder.

Building and Training a CRF Module in Python

First download the pycrf module. For PIP installation, the command is “pip install python-crfsuite” and for conda installation, the command is “conda install -c conda-forge python-crfsuite”

Once the installation is complete, you are ready to train and build your own CRF module. Let”s do this!

#invoke libraries

from bs4 import BeautifulSoup as bs

from bs4.element import Tag

import codecs

import nltk

from nltk import word_tokenize, pos_tag

from sklearn.model_selection import train_test_split

import pycrfsuite

import os, os.path, sys

import glob

from xml.etree import ElementTree

import numpy as np

from sklearn.metrics import classification_report

Let’s define and build a few functions.

#this function appends all annotated files

def append_annotations(files):

   xml_files = glob.glob(files +"/*.xml")

   xml_element_tree = None

   new_data = ""

   for xml_file in xml_files:

       data = ElementTree.parse(xml_file).getroot()

       #print ElementTree.tostring(data)        

       temp = ElementTree.tostring(data)

       new_data += (temp)

   return(new_data)

#this function removes special characters and punctuations

def remov_punct(withpunct):

punctuations = '''!()-[]{};:'",./ [email protected] #$%^&*_~'''

   without_punct = ""

   char = 'nan'

   for char in withpunct:

       if char not in punctuations:

           without_punct = without_punct + char

   return(without_punct)

# functions for extracting features in documents

def extract_features(doc):

   return [word2features(doc, i) for i in range(len(doc))]

def get_labels(doc):

   return [label for (token, postag, label) in doc]

Now we will import the annotated training data.

files_path = "D:/Annotated/"

allxmlfiles = append_annotations(files_path)

soup = bs(allxmlfiles, "html5lib")

#identify the tagged element

docs = []

sents = []

for d in soup.find_all("document"):

  for wrd in d.contents:    

   tags = []

   NoneType = type(None)   

   if isinstance(wrd.name, NoneType) == True:

       withoutpunct = remov_punct(wrd)

       temp = word_tokenize(withoutpunct)

       for token in temp:

           tags.append((token,'NA'))            

   else:

       withoutpunct = remov_punct(wrd)

       temp = word_tokenize(withoutpunct)

       for token in temp:

           tags.append((token,wrd.name))    

   sents = sents + tags

  docs.append(sents) #appends all the individual documents into one list

Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.

data = []

for i, doc in enumerate(docs):

   tokens = [t for t, label in doc]    

   tagged = nltk.pos_tag(tokens)    

   data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

def word2features(doc, i):

   word = doc[i][0]

   postag = doc[i][1]

# Common features for all words. You may add more features here based on your custom use case

features = [

       'bias',

       'word.lower=' + word.lower(),

       'word[-3:]=' + word[-3:],

       'word[-2:]=' + word[-2:],

       'word.isupper=%s' % word.isupper(),

       'word.istitle=%s' % word.istitle(),

       'word.isdigit=%s' % word.isdigit(),

       'postag=' + postag

   ]

# Features for words that are not

at the beginning of a document

       word1 = doc[i-1][0]

       postag1 = doc[i-1][1]

       features.extend([

           '-1:word.lower=' + word1.lower(),

           '-1:word.istitle=%s' % word1.istitle(),

           '-1:word.isupper=%s' % word1.isupper(),

           '-1:word.isdigit=%s' % word1.isdigit(),

           '-1:postag=' + postag1

       ])

   else:

       # Indicate that it is the 'beginning of a document'

       features.append('BOS')

# Features for words that are not

at the end of a document

if i < len(doc)-1:

       word1 = doc[i+1][0]

       postag1 = doc[i+1][1]

       features.extend([

           '+1:word.lower=' + word1.lower(),

           '+1:word.istitle=%s' % word1.istitle(),

           '+1:word.isupper=%s' % word1.isupper(),

           '+1:word.isdigit=%s' % word1.isdigit(),

           '+1:postag=' + postag1

       ])

   else:

       # Indicate that it is the 'end of a document'

       features.append('EOS')

return features

Now we’ll build features and create train and test data frames.

X = [extract_features(doc) for doc in data]

y = [get_labels(doc) for doc in data]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let’s test our model.

tagger = pycrfsuite.Tagger()

tagger.open('crf.model')

y_pred = [tagger.tag(xseq) for xseq in X_test]

You can inspect any predicted value by selecting the corresponding row number “i”.

i = 0

for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):

   print("%s (%s)" % (y, x))

Check the performance of the model.

# Create a mapping of labels to indices

labels = {"claim_number": 1, "claimant": 1,"NA": 0}

# Convert the sequences of tags into a 1-dimensional array

predictions = np.array([labels[tag] for row in y_pred for tag in row])

truths = np.array([labels[tag] for row in y_test for tag in row])

Print out the classification report. Based on the model performance, build better features to improve the performance.

print(classification_report(

   truths, predictions,

   target_names=["claim_number", "claimant","NA"]))

#predict new data

with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile:

   soup_test = bs(infile, "html5lib")

docs = []

sents = []

for d in soup_test.find_all("document"):

  for wrd in d.contents:    

   tags = []

   NoneType = type(None)   

   if isinstance(wrd.name, NoneType) == True:

       withoutpunct = remov_punct(wrd)

       temp = word_tokenize(withoutpunct)

       for token in temp:

           tags.append((token,'NA'))            

   else:

       withoutpunct = remov_punct(wrd)

       temp = word_tokenize(withoutpunct)

       for token in temp:

           tags.append((token,wrd.name))

   #docs.append(tags)

sents = sents + tags # puts all the sentences of a document in one element of the list

docs.append(sents) #appends all the individual documents into one list       

data_test = []

for i, doc in enumerate(docs):

   tokens = [t for t, label in doc]    

   tagged = nltk.pos_tag(tokens)    

   data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

data_test_feats = [extract_features(doc) for doc in data_test]

tagger.open('crf.model')

newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]

# Let's check predicted data

i = 0

for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):

   print("%s (%s)" % (y, x))

By now, you would have understood how to annotate training data, how to use Python to train a CRF model, and finally how to identify entities from new text. Although this algorithm provides some basic set of features, you can come up with your own set of features to improve the accuracy of the model.

End Notes

To summarize, here are the key points that we have covered in this article:

Entities are parts of text that are of interest for the business problem at hand

Sequence of words or tokens matter in identifying entities

Pattern recognition approaches such as Regular Expressions or graph-based models such as Hidden Markov Model and Maximum Entropy Markov Model can help in identifying entities. However, Conditional Random Fields (CRF) is a popular and arguably a better candidate for entity recognition problems

CRF is an undirected graph-based model that considered words that not only occur before the entity but also after it

The training data can be annotated by using GATE architecture

The Python code provided helps in training a CRF model and extracting entities from text

In conclusion, this article should give you a good starting point for your business problem

References

About the Author

Sidharth Macherla – Independent Researcher, Natural Language Processing

Sidharth Macherla has over 12 years of experience in data science and his current area of focus is Natural Language Processing. He has worked across Banking, Insurance, Investment Research and Retail domains.

Related

Time Series Analysis And Forecasting

Introduction

Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent variable. To estimate the target variable in the name of predicting or forecasting, use the time variable as the point of reference. A Time-Series represents a series of time-based orders. It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds. It is an observation from the sequence of discrete time of successive intervals.

The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis (TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal processing, Engineering domain – Control Systems, and Communications Systems. Since TSA involves producing the set of information in a particular sequence, this makes it distinct from spatial and other analyses. We could predict the future using AR, MA, ARMA, and ARIMA models.

Learning Objectives

We will discuss in detail TSA Objectives, Assumptions, and Components (stationary and non-stationary).

We will look at the TSA algorithms.

Finally, we will look at specific use cases in Python.

This article was published as a part of the Data Science Blogathon.

Table of Contents What Is Time Series Analysis?

Definition: If you see, there are many more definitions for TSA. But let’s keep it simple.

A time series is nothing but a sequence of various data points that occurred in a successive order for a given period of time

Objectives of Time Series Analysis:

To understand how time series works and what factors affect a certain variable(s) at different points in time.

Time series analysis will provide the consequences and insights of the given dataset’s features that change over time.

Supporting to derive the predicting the future values of the time series variable.

Assumptions: There is only one assumption in TSA, which is “stationary,” which means that the origin of time does not affect the properties of the process under the statistical factor.

How to Analyze Time Series?

To perform the time series analysis, we have to follow the following steps:

Collecting the data and cleaning it

Preparing Visualization with respect to time vs key feature

Observing the stationarity of the series

Developing charts to understand its nature.

Model building – AR, MA, ARMA and ARIMA

Extracting insights from prediction

Significance of Time Series

TSA is the backbone for prediction and forecasting analysis, specific to time-based problem statements.

Analyzing the historical dataset and its patterns

Understanding and matching the current situation with patterns derived from the previous stage.

Understanding the factor or factors influencing certain variable(s) in different periods.

With the help of “Time Series,” we can prepare numerous time-based analyses and results.

Forecasting: Predicting any value for the future.

Segmentation: Grouping similar items together.

Classification: Classifying a set of items into given classes.

Descriptive analysis: Analysis of a given dataset to find out what is there in it.

Intervention analysis: Effect of changing a given variable on the outcome.

Components of Time Series Analysis

Let’s look at the various components of Time Series Analysis-

Trend: In which there is no fixed interval and any divergence within the given dataset is a continuous timeline. The trend would be Negative or Positive or Null Trend

Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline. Would be bell curve or saw tooth

Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern

Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

What Are the limitations of Time Series Analysis?

Time series has the below-mentioned limitations; we have to take care of those during our data analysis.

Similar to other models, the missing values are not supported by TSA

The data points must be linear in their relationship.

Data transformations are mandatory, so they are a little expensive.

Models mostly work on Uni-variate data.

Data Types of Time Series

Let’s discuss the time series’ data types and their influence. While discussing TS data types, there are two major types – stationary and non-stationary.

Stationary: A dataset should follow the below thumb rules without having Trend, Seasonality, Cyclical, and Irregularity components of the time series.

The mean value of them should be completely constant in the data during the analysis.

The variance should be constant with respect to the time-frame

Covariance measures the relationship between two variables.

Non- Stationary: If either the mean-variance or covariance is changing with respect to time, the dataset is called non-stationary.

Methods to Check Stationarity

During the TSA model preparation workflow, we must assess whether the dataset is stationary or not. This is done using Statistical Tests. There are two tests available to test if the dataset is stationary:

Augmented Dickey-Fuller (ADF) Test

Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

Augmented Dickey-Fuller (ADF) Test or Unit Root Test

The ADF test is the most popular statistical test. It is done with the following assumptions:

Null Hypothesis (H0): Series is non-stationary

Alternate Hypothesis (HA): Series is stationary

p-value <= 0.05 Accept (H1)

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test

These tests are used for testing a NULL Hypothesis (HO) that will perceive the time series as stationary around a deterministic trend against the alternative of a unit root. Since TSA is looking for Stationary Data for its further analysis, we have to ensure that the dataset is stationary.

Converting Non-Stationary Into Stationary

Let’s discuss quickly how to convert non-stationary to stationary for effective time series modeling. There are three methods available for this conversion – detrending, differencing, and transformation.

Detrending

It involves removing the trend effects from the given dataset and showing only the differences in values from the trend. It always allows cyclical patterns to be identified.

Differencing

This is a simple transformation of the series into a new time series, which we use to remove the series dependence on time and stabilize the mean of the time series, so trend and seasonality are reduced during this transformation.

Yt= Yt – Yt-1

Yt=Value with time

Transformation

This includes three different methods they are Power Transform, Square Root, and Log Transfer. The most commonly used one is Log Transfer.

Moving Average Methodology

The Moving Average (MA) (or) Rolling Mean: The value of MA is calculated by taking average data of the time-series within k periods.

Let’s see the types of moving averages:

Simple Moving Average (SMA),

Cumulative Moving Average (CMA)

Exponential Moving Average (EMA)

Simple Moving Average (SMA)

The SMA is the unweighted mean of the previous M or N points. The selection of sliding window data points, depending on the amount of smoothing, is preferred since increasing the value of M or N improves the smoothing at the expense of accuracy.

To understand better, I will use the air temperature dataset.

import pandas as pd from matplotlib import pyplot as plt from statsmodels.graphics.tsaplots import plot_acf df_temperature = pd.read_csv('temperature_TSA.csv', encoding='utf-8') df_temperature.head() df_temperature.info() # set index for year column df_temperature.set_index('Any', inplace=True) df_temperature.index.name = 'year' # Yearly average air temperature - calculation df_temperature['average_temperature'] = df_temperature.mean(axis=1) # drop unwanted columns and resetting the datafreame df_temperature = df_temperature[['average_temperature']] df_temperature.head() # SMA over a period of 10 and 20 years  df_temperature['SMA_10'] = df_temperature.average_temperature.rolling(10, min_periods=1).mean() df_temperature['SMA_20'] = df_temperature.average_temperature.rolling(20, min_periods=1).mean() # Grean = Avg Air Temp, RED = 10 yrs, ORANG colors for the line plot colors = ['green', 'red', 'orange'] # Line plot df_temperature.plot(color=colors, linewidth=3, figsize=(12,6)) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels =['Average air temperature', '10-years SMA', '20-years SMA'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Cumulative Moving Average (CMA)

The CMA is the unweighted mean of past values till the current time.

# CMA Air temperature df_temperature['CMA'] = df_temperature.average_temperature.expanding().mean() # green -Avg Air Temp and Orange -CMA colors = ['green', 'orange'] # line plot df_temperature[['average_temperature', 'CMA']].plot(color=colors, linewidth=3, figsize=(12,6)) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels =['Average Air Temperature', 'CMA'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Exponential Moving Average (EMA)

EMA is mainly used to identify trends and filter out noise. The weight of elements is decreased gradually over time. This means It gives weight to recent data points, not historical ones. Compared with SMA, the EMA is faster to change and more sensitive.

It has a value between 0,1.

Represents the weighting applied to the very recent period.

Let’s apply the exponential moving averages with a smoothing factor of 0.1 and 0.3 in the given dataset.

# EMA Air Temperature # Let's smoothing factor - 0.1 df_temperature['EMA_0.1'] = df_temperature.average_temperature.ewm(alpha=0.1, adjust=False).mean() # Let's smoothing factor - 0.3 df_temperature['EMA_0.3'] = df_temperature.average_temperature.ewm(alpha=0.3, adjust=False).mean() # green - Avg Air Temp, red- smoothing factor - 0.1, yellow - smoothing factor - 0.3 colors = ['green', 'red', 'yellow'] df_temperature[['average_temperature', 'EMA_0.1', 'EMA_0.3']].plot(color=colors, linewidth=3, figsize=(12,6), alpha=0.8) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels=['Average air temperature', 'EMA - alpha=0.1', 'EMA - alpha=0.3'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Time Series Analysis in Data Science and Machine Learning

When dealing with TSA in Data Science and Machine Learning, there are multiple model options are available. In which the Autoregressive–Moving-Average (ARMA) models with [p, d, and q].

q== moving average lags

Before we get to know about Arima, first, you should understand the below terms better.

Auto-Correlation Function (ACF)

Partial Auto-Correlation Function (PACF)

Auto-Correlation Function (ACF)

ACF is used to indicate how similar a value is within a given time series and the previous value. (OR) It measures the degree of the similarity between a given time series and the lagged version of that time series at the various intervals we observed.

Python Statsmodels library calculates autocorrelation. This is used to identify a set of trends in the given dataset and the influence of former observed values on the currently observed values.

Partial Auto-Correlation (PACF)

PACF is similar to Auto-Correlation Function and is a little challenging to understand. It always shows the correlation of the sequence with itself with some number of time units per sequence order in which only the direct effect has been shown, and all other intermediary effects are removed from the given time series.

Auto-Correlation and Partial Auto-Correlation plot_acf(df_temperature) plt.show() plot_acf(df_temperature, lags=30) plt.show()

Observation: The previous temperature influences the current temperature, but the significance of that influence decreases and slightly increases from the above visualization along with the temperature with regular time intervals.

Types of Auto-Correlation

Interpret ACF and PACF plots

ACFPACFPerfect ML -ModelPlot declines graduallyPlot drops instantlyAuto Regressive chúng tôi drops instantlyPlot declines graduallyMoving Average modelPlot decline graduallyPlot Decline graduallyARMAPlot drop instantlyPlot drop instantlyYou wouldn’t perform any model

Remember that both ACF and PACF require stationary time series for analysis.

Now, we will learn about the Auto-Regressive model.

What Is an Auto-Regressive Model?

An auto-regressive model is a simple model that predicts future performance based on past performance. It is mainly used for forecasting when there is some correlation between values in a given time series and the values that precede and succeed (back and forth).

An AR model is a Linear Regression model that uses lagged variables as input. The Linear Regression model can be easily built using the scikit-learn library by indicating the input. Statsmodels library is used to provide autoregression model-specific functions where you have to specify an appropriate lag value and train the model. It is provided in the AutoTeg class to get the results using simple steps.

Creating the model AutoReg()

Call fit() to train it on our dataset.

Returns an AutoRegResults object.

Once fit, make a prediction by calling the predict () function

The equation for the AR model (Let’s compare Y=mX+c)

Yt =C+b1 Yt-1+ b2 Yt-2+……+ bp Yt-p+ Ert

Key Parameters

p=past values

Yt=Function of different past values

Ert=errors in time

C=intercept

Lets’s check whether the given data set or time series is random or not

from matplotlib import pyplot from pandas.plotting import lag_plot lag_plot(df_temperature) pyplot.show()

Observation: Yes, it looks random and scattered.

Implementation of Auto-Regressive Model #import libraries from matplotlib import pyplot from statsmodels.tsa.ar_model import AutoReg from sklearn.metrics import mean_squared_error from math import sqrt # load csv as dataset #series = read_csv('daily-min-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True) # split dataset for test and training X = df_temperature.values train, test = X[1:len(X)-7], X[len(X)-7:] # train autoregression model = AutoReg(train, lags=20) model_fit = model.fit() print('Coefficients: %s' % model_fit.params) # Predictions predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False) for i in range(len(predictions)): print('predicted=%f, expected=%f' % (predictions[i], test[i])) rmse = sqrt(mean_squared_error(test, predictions)) print('Test RMSE: %.3f' % rmse) # plot results pyplot.plot(test) pyplot.plot(predictions, color='red') pyplot.show()

Output:

predicted=15.893972, expected=16.275000 predicted=15.917959, expected=16.600000 predicted=15.812741, expected=16.475000 predicted=15.787555, expected=16.375000 predicted=16.023780, expected=16.283333 predicted=15.940271, expected=16.525000 predicted=15.831538, expected=16.758333 Test RMSE: 0.617

Observation: Expected (blue) Against Predicted (red). The forecast looks good on the 4th and the deviation on the 6th day.

Implementation of Moving Average (Weights – Simple Moving Average) import numpy as np alpha= 0.3 n = 10 w_sma = np.repeat(1/n, n) colors = ['green', 'yellow'] # weights - exponential moving average alpha=0.3 adjust=False w_ema = [(1-ALPHA)**i if i==N-1 else alpha*(1-alpha)**i for i in range(n)] pd.DataFrame({'w_sma': w_sma, 'w_ema': w_ema}).plot(color=colors, kind='bar', figsize=(8,5)) plt.xticks([]) plt.yticks(fontsize=10) plt.legend(labels=['Simple moving average', 'Exponential moving average (α=0.3)'], fontsize=10) # title and labels plt.title('Moving Average Weights', fontsize=10) plt.ylabel('Weights', fontsize=10) Understanding ARMA and ARIMA

ARMA is a combination of the Auto-Regressive and Moving Average models for forecasting. This model provides a weakly stationary stochastic process in terms of two polynomials, one for the Auto-Regressive and the second for the Moving Average.

ARMA is best for predicting stationary series. ARIMA was thus developed to support both stationary as well as non-stationary series.

AR+I+MA= ARIMA

Understand the signature of ARIMA

Implementation Steps for ARIMA

Step 1: Plot a time series format

Step 2: Difference to make stationary on mean by removing the trend

Step 3: Make stationary by applying log transform.

Step 4: Difference log transform to make as stationary on both statistic mean and variance

Step 5: Plot ACF & PACF, and identify the potential AR and MA model

Step 6: Discovery of best fit ARIMA model

Step 7: Forecast/Predict the value using the best fit ARIMA model

Step 8: Plot ACF & PACF for residuals of the ARIMA model, and ensure no more information is left.

Implementation of ARIMA in Python

We have already discussed steps 1-5 which will remain the same; let’s focus on the rest here.

from statsmodels.tsa.arima_model import ARIMA model = ARIMA(df_temperature, order=(0, 1, 1)) results_ARIMA = model.fit() results_ARIMA.summary() results_ARIMA.forecast(3)[0] Output: array([16.47648941, 16.48621826, 16.49594711]) results_ARIMA.plot_predict(start=200) plt.show() Process Flow (Re-Gap)

In recent years, the use of Deep Learning for Time Series Analysis and Forecasting has increased to resolve problem statements that couldn’t be handled using Machine Learning techniques. Let’s discuss this briefly.

Recurrent Neural Networks (RNN) is the most traditional and accepted architecture fitment for Time-Series forecasting-based problems.

RNN is organized into successive layers and divided into

Input

Hidden

Output

Each layer has equal weight, and every neuron has to be assigned to fixed time steps. Do remember that every one of them is fully connected with a hidden layer (Input and Output) with the same time steps, and the hidden layers are forwarded and time-dependent in direction.

Components of RNN

Input: The function vector of x(t)​ is the input at time step t.

Hidden:

The function vector h(t)​ is the hidden state at time t,

This is a kind of memory of the established network;

This has been calculated based on the current input x(t) and the previous-time step’s hidden-state h(t-1):

Output: The function vector y(t) ​is the output at time step t.

Weights : Weights: In the RNNs, the input vector connected to the hidden layer neurons at time t is by a weight matrix of U (Please refer to the above picture),

Internally weight matrix W is formed by the hidden layer neurons of time t-1 and t+1. Following this, the hidden layer with to the output vector y(t) of time t by a V (weight matrix); all the weight matrices U, W, and V are constant for each time step.

Conclusion

A time series is constructed by data that is measured over time at evenly spaced intervals. I hope this comprehensive guide has helped you all understand the time series, its flow, and how it works. Although the TSA is widely used to handle data science problems, it has certain limitations, such as not supporting missing values. Note that the data points must be linear in their relationship for Time Series Analysis to be done.

Key Takeaways

Time series is a sequence of various data points that occurred in a successive order for a given period of time.

Trend, Seasonality, Cyclical, and Irregularity are components of TSA.

Frequently Asked Questions Related

Components Of Time Series Analysis

Definition of Components of time series analysis

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Components of time series analysis

Now that we already know that arrangement of data points in agreement to the chronological order of occurrence is known as a time series. And also, the time series analysis is the relationship between 2 variables out of which one is the time and the other is the quantitative variable. There are varied uses of time series, which we will just glance at before we know the components of the time series analysis so that while we study the time series, it becomes evident on to how the components is able to solve the time series analysis.

Time series analysis is performed to predict the future behavior of any quantitative variable on the basis of the past behavior. For example, umbrellas getting sold on mostly rainy seasons than other seasons, although umbrellas still get sold in other time periods. So maybe in order to predict the future behavior, more umbrellas will be sold during the rainy seasons!

While evaluating the performance of the business with respect to the expected or planed one, time series analysis helps a great deal in order to take informed decisions to make it better.

Time series also enables business analysts to compare changes in different values at different times or places.

1. Long term movements or Trend

This component looks into the movement of attributes at a long-term window of time frame and mostly tries to understand the increment or decrement of the quantitative value that is attached to the behavior. This is more like an average tendency of the parameter that is in measurement. The tendencies that are observed can be increasing, decreasing or stable at different sections of the time period. And on this basis, we can make the trend a linear one and a non-linear one. In the linear trend we just talk about continuously increasing or continuously decreasing whereas in the non-linear we can segment the time period into different frames and populate the trend! There are many ways by which non-linear trends can be included in the analysis. We can either take higher order of the variable in hand, which is realistically non-interpretable or a better approach than that is the piecewise specification of the function, where each of the piecewise function has a linear and collectively makes a non-linear trend at an overall level.

2. Short term movements

In contrast to the long-term movements, this component looks into the shorter period of time to get the behavior of the quantitative variable during this time frame. This movement in the time series sometimes repeats itself over certain period of time or even in regular spasmodic manner. These movements over a shorter time frame give rise to 2 sub-components namely,

Seasonality: These are the variations that are seen in the variable in study for the forces than spans over for lesser than a year. These movements are mainly present in the data where the record in with a shorter duration of difference, like daily, weekly, monthly. The example we talked about the sale of umbrella is more during the rainy season is a case of seasonality. Sale of ACs during the summertime is again a seasonality effect. There are some man-made conventions that affect seasonality like festivals, occasions etc.

Cyclic Variations: Even though this component is a short-term movement analysis of time series, but it is rather longer than the seasonality, where the span of similar variations to be seen is more than a year. The completion of all the steps in that movement is crucial to say that the variation is a cyclic one. Sometimes we even refer them to as a business cycle. For example, the product lifecycle is a case of cycle variation, where a product goes through the steps of the life cycle, that is Introduction, growth, maturity, decline and just before the product reaches below a threshold of decline, we look for re-launch of the product with some newer features.

3. Irregular variation or Random variations

Finally, we now know that the Trend, seasonality, cyclic and residuals totally constitutes of the time series analysis and these components may take form of additive model or multiplicative model, depending on the use cases!

Conclusion

With this, we come to an end of the components of time series analysis article, where in we looked at each of these components in great details and got familiar with them before moving to the usage of these components in a time series analysis!

Recommended Articles

This is a guide to Components of time series analysis. Here we discuss the different components that constitute the time series analysis. You may also have a look at the following articles to learn more –

Update the detailed information about A Complete Tutorial On Time Series Modeling In R on the Bellydancehcm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!