Trending February 2024 # Basic Understanding Of Time Series Modelling With Auto Arimax # Suggested March 2024 # Top 10 Popular

You are reading the article Basic Understanding Of Time Series Modelling With Auto Arimax updated in February 2024 on the website Bellydancehcm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Basic Understanding Of Time Series Modelling With Auto Arimax

This article was published as a part of the Data Science Blogathon.

Introduction

Data Science associates with a huge variety of problems in our daily life. One major problem we see every day include examining a situation over time. Time series forecast is extensively used in various scenarios like sales, weather, prices, etc…, where the underlying values of concern are a range of data points estimated over a period of time. This article strives to provide the essential structure of some of the algorithms for solving these classes of problems. We will explore various methods for time series forecasts. We all would have heard about ARIMA models used in modern time series forecasts. In this article, we will thoroughly go through an understanding of ARIMA and how the Auto ARIMAX model can be used on a stock market dataset to forecast results.

Understanding ARIMA and Auto ARIMAX

Traditionally, everyone uses ARIMA when it comes to time series prediction. It stands for ‘Auto-Regressive Integrated Moving Average’, a set of models that defines a given time series based on its initial values, lags, and lagged forecast errors, so that equation is used to forecast forecasted values.

We have ‘non-seasonal time series that manifests patterns and is not a stochastic white noise that can be molded with ARIMA models.

An ARIMA model is delineated by three terms: p, d, q where,

p is a particular order of the AR term

q is a specific order of the MA term

d is the number of differences wanted to make the time series stationary

If a time series has seasonal patterns, then you require to add seasonal terms, and it converts to SARIMA, which stands for ‘Seasonal ARIMA’.

The ‘Auto Regressive’ in ARIMA indicates a linear regression model that employs its lags as predictors. Linear regression models work best if the predictors are not correlated and remain independent of each other. We want to make them stationary, and the standard approach is to differentiate them. This means subtracting the initial value from the current value. Concerning how complex the series gets, more than one difference may be required.

Hence, the value of d is the merest number of differences necessitated to address the series stationary. In case we already have a stationary time series, we proceed with d as zero.

”Auto Regressive” (AR) term is indicated by ”p”. This relates to the number of lags of Y to be adopted as predictors. ”Moving Average” (MA) term is associated with “q”. This relates to the number of lagged prediction errors that should conform to the ARIMA Model.

An Auto-Regressive (AR only) model has Yt that depends exclusively on its lags. Such, Yt is a function of the ‘lags of Yt’.

Furthermore, a Moving Average (MA only) model has Yt that depends particularly on the lagged forecast errors.

The time series differencing in an ARIMA model is differenced at least once to make sure it is stationary and we combine the AR and MA terms. Hence, The equation becomes:

We have continued operating through the method of manually fitting various models and determining which one is best. Therefore, we transpire to automate this process. It uses the data and fits several models in a different order before associating the characteristics. Nevertheless, the processing rate increases considerably when we seek to fit the complicated models. This is how we move for Auto-ARIMA models.

Implementation of Auto ARIMAX:

We will now look at a model called ‘auto-arima’, which is an auto_arima module from the pmdarima package. We can use pip install to install our module.

!pip install pmdarima

The dataset applied is stock market data of the Nifty-50 index of NSE (National Stock Exchange) India across the last twenty years. The well-known VWAP (Volume Weighted Average Price) is the target variable to foretell. VWAP is a trading benchmark used by tradesmen that supply the average price the stock has traded during the day, based on volume and price.

df.set_index(“Date”, drop=False, inplace=True) df.head()

df.VWAP.plot(figsize=(14, 7))

Almost all time series problems will ought external characteristics or internal feature engineering to improve the model.

We add some essential features like lag values of available numeric features widely accepted for time series problems. Considering we need to foretell the stock price for a day, we cannot use the feature values of the same day since they will be unavailable at actual inference time. We require to use statistics like the mean, the standard deviation of their lagged values. The three sets of lagged values are used, one previous day, one looking back seven days and another looking back 30 days as a proxy for last week and last month metrics.

During boosting models, it is very beneficial to attach DateTime features like an hour, day, month, as appropriate to implement the model knowledge about the time element in the data. For time sequence models, it is not explicitly expected to pass this information, but we could do so, and we will discuss in this article so that all models are analysed on the exact identical set of features.

The data is split in both train and test along with its features.

train: We have  26th May 2008 to 31st December 2023 data.

valid: We have  1st January 2023 to 31st December 2023 data.

The most suitable ARIMA model is ARIMA(2, 0, 1) which holds the lowest AIC.

Conclusion:

In this article, we explored details regarding ARIMA and understood how auto ARIMAX was applied to a time series dataset. We implemented the model and got a score of about 147.086 as RMSE and 104.019 as MAE as the final result.

Reference:

About Me: I am a Research Student interested in the field of Deep Learning and Natural Language Processing and currently pursuing post-graduation in Artificial Intelligence.

Image Source

Feel free to connect with me on:

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Related

You're reading Basic Understanding Of Time Series Modelling With Auto Arimax

Components Of Time Series Analysis

Definition of Components of time series analysis

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Components of time series analysis

Now that we already know that arrangement of data points in agreement to the chronological order of occurrence is known as a time series. And also, the time series analysis is the relationship between 2 variables out of which one is the time and the other is the quantitative variable. There are varied uses of time series, which we will just glance at before we know the components of the time series analysis so that while we study the time series, it becomes evident on to how the components is able to solve the time series analysis.

Time series analysis is performed to predict the future behavior of any quantitative variable on the basis of the past behavior. For example, umbrellas getting sold on mostly rainy seasons than other seasons, although umbrellas still get sold in other time periods. So maybe in order to predict the future behavior, more umbrellas will be sold during the rainy seasons!

While evaluating the performance of the business with respect to the expected or planed one, time series analysis helps a great deal in order to take informed decisions to make it better.

Time series also enables business analysts to compare changes in different values at different times or places.

1. Long term movements or Trend

This component looks into the movement of attributes at a long-term window of time frame and mostly tries to understand the increment or decrement of the quantitative value that is attached to the behavior. This is more like an average tendency of the parameter that is in measurement. The tendencies that are observed can be increasing, decreasing or stable at different sections of the time period. And on this basis, we can make the trend a linear one and a non-linear one. In the linear trend we just talk about continuously increasing or continuously decreasing whereas in the non-linear we can segment the time period into different frames and populate the trend! There are many ways by which non-linear trends can be included in the analysis. We can either take higher order of the variable in hand, which is realistically non-interpretable or a better approach than that is the piecewise specification of the function, where each of the piecewise function has a linear and collectively makes a non-linear trend at an overall level.

2. Short term movements

In contrast to the long-term movements, this component looks into the shorter period of time to get the behavior of the quantitative variable during this time frame. This movement in the time series sometimes repeats itself over certain period of time or even in regular spasmodic manner. These movements over a shorter time frame give rise to 2 sub-components namely,

Seasonality: These are the variations that are seen in the variable in study for the forces than spans over for lesser than a year. These movements are mainly present in the data where the record in with a shorter duration of difference, like daily, weekly, monthly. The example we talked about the sale of umbrella is more during the rainy season is a case of seasonality. Sale of ACs during the summertime is again a seasonality effect. There are some man-made conventions that affect seasonality like festivals, occasions etc.

Cyclic Variations: Even though this component is a short-term movement analysis of time series, but it is rather longer than the seasonality, where the span of similar variations to be seen is more than a year. The completion of all the steps in that movement is crucial to say that the variation is a cyclic one. Sometimes we even refer them to as a business cycle. For example, the product lifecycle is a case of cycle variation, where a product goes through the steps of the life cycle, that is Introduction, growth, maturity, decline and just before the product reaches below a threshold of decline, we look for re-launch of the product with some newer features.

3. Irregular variation or Random variations

Finally, we now know that the Trend, seasonality, cyclic and residuals totally constitutes of the time series analysis and these components may take form of additive model or multiplicative model, depending on the use cases!

Conclusion

With this, we come to an end of the components of time series analysis article, where in we looked at each of these components in great details and got familiar with them before moving to the usage of these components in a time series analysis!

Recommended Articles

This is a guide to Components of time series analysis. Here we discuss the different components that constitute the time series analysis. You may also have a look at the following articles to learn more –

Mathematical Modelling: Modelling The Spread Of Diseases With Sird Model

This article was published as a part of the Data Science Blogathon

Introduction

According to Haines and Crounch, mathematical modelling is a process in which real-life situations and relations in these situations are expressed by using mathematics. In simpler terminologies, mathematical modelling is the process of describing systems (activities) with mathematics. Mathematical modelling is the process of using mathematics to model real-world processes and occurrences.

Mathematical modelling is used virtually in every sector, in the manufacturing industry mathematical modelling is used to model heat and mass transfer of fluids flow, the transformation of materials, e.t.c. The construction industry is not spared from the beauty of mathematical modelling, mathematical modelling is used to optimize the amount of port in structures, calculating the stress that will be imposed on buildings and how to counterbalance it. You probably must have seen the tallest building in the world either virtually or physically, you will be in awe if you were to see all the mathematical models that were used to model the building.

Burj Khalifa (Tallest Building in the world). Source

                                 .

Football athletes use mathematical modelling to score goals, for football lovers you probably must have seen how Messi, Rolando, and other popular footballers use free kicks to score goals. The free-kick goals can be modelled with mathematics, by modelling the angle of trajectory, the drag e.t.c.

Modelling free-kick with Mathematics. Source 

The astronomy industry heavily relies on mathematical modelling, mathematics is used to model the movement of spacecraft and other orbital objects. Katherine Johnson, a former mathematician at NASA used her mathematical prowess to help put an astronaut into orbit around the earth. Her mathematical skills were also used to deploy a man on the moon.

Photos of Katherine Johnson. Source 

I can continue to list the sacrosanct role of mathematics in our world, but because of time constraints, I will stop here. The reality is that the world can exist without the English Language but the world can’t exist without Mathematics.

This article will walk you through the processes of modelling disease spread with mathematical models. You might be wondering can mathematics really model disease spread? The answer is yes, mathematics is very important in the health sector. According to TheConversation “Mathematical models are used to create a simplified representation of infection spread in a population and to understand how an infection may progress in the future. These predictions can help us effectively use public health resources such as hospital space or a vaccination programme. For example, knowing how many people in a population are likely to become infected can tell hospitals how much space and resources they will need to allocate for treatment.” Source

What it takes to mathematically model any disease 

At the beginning of an epidemic, there exist, people who will be infected, prone to getting infected and those who might recover from the disease or die as a result of the disease. Those who were initially prone to the disease will get infected if they come in contact with infected people and those who will die will originate from the infected people. Mathematicians have been trying to successfully find a way to mathematically model the relationship between those who are prone to be infected, those who are infected and those who will recover from the disease. In 1927, Kermack & McKendrick came up with what is called the Susceptible, Infected and Recovered (SIR) Mathematical model. The SIR model assumes that for any given disease, there exist 3 categories of people those who are Susceptible (Prone to contracting the disease but are yet to be infected), those who are Infected and those who have been Removed(recovered) (either by death or with the aid of drugs). The SIR model has been of help to mathematicians and has made modelling disease spread easy.

To mathematically model any disease using the SIR model, you will need to assume that the population remains constant i.e ( No birth takes place, nobody migrates into the population, no natural death ( with an exception of death from the disease)). The SIR model models diseases by taking into cognizance that, the movement of people from the Susceptible into the Infected state and from the Infected State into the Removed state is defined by some constants. These constants are the tripod that the SIR model sits on, and that is what will be discussed soonest. You will agree with me that, for any disease to spread there must be contact between susceptible people and infected people or person( disease carriers).

 Assuming for a particular epidemic, there exist 1000 Susceptible people and 3 persons that are infected. Take, for instance, every day 1 person gets infected due to the contact between Susceptible and Infected people. You will agree with me that, on the fifth day, 8 people will be infected and the number of susceptible will be 995. We might want to assume that 2 persons or 3 persons get infected, one thing here is that we are just making assumptions that might not be mathematically accurate. Hence the need to use the SIR model to mathematically and accurately model the spread of the disease.

The SIR model models the number of people who are infected by assuming that everyone in the susceptible category has an equal probability of being infected by a constant fraction which is called the contact rate (infection rate). The number of people that are infected is computed by multiplying the contact rate with the number of infected people and the Susceptible after which the population number is used to divide the result i.e (contact rate * S * I)/N. S-Susceptible, I– Infected, and N– Total Population Number.  The contact rate will be a fraction of the population which is computed by analyzing the number of contacts made with infected people per day. The SIR model also models the number of people who will be removed by a certain fraction which is called the recovery rate. The number of people that will be removed is computed by multiplying the recovery rate with the number of infected people i.e recovery rate * infected people.

SIR Mathematical Model Source

ds/dt = the rate of change of the susceptible over time

dI/dt is the rate of change of infected over time

dR/dt is the rate of change of removed over time

The equation simply states that susceptible people will be reduced over time based on the contact rate (beta), the number of susceptible, the number of infected, and the total population (N). You will notice the presence of the negative sign, this is to show the fraction of people that will be lost from the susceptible category. The fraction of people that are lost from the susceptible category will be added to the infected category, hence the presence of the positive sign in the infected equation. Recall that the removed people originate from the infected category and the number of people that are removed is based on the removal rate multiplied by the number of infected people (gamma * I). Those that are removed will be a loss to the infected people hence the need to subtract the number of removed from the number of infected. The removed people will be gain to the removed category, hence the positive sign for the removed category.

Multiplying both sides with dt will give

dS is the rate of change i.e the difference between the old susceptible and the new susceptible ( Snew– Sold). The number of susceptible, infected, and Recovered for the next day can be modeled by moving the old Susceptible, Infected, and Recovered numbers to the other side of the equation to give.

SIR Model. Image by Author

The above equation can be used to model the number of susceptible, infected, and recovered for the next day. The number of infected people in a day depends on the contact rate(Beta) and the recovery rate (gamma).

Other Types of Mathematical Models Used to Model Diseases

 Apart from the SIR model, several varieties of mathematical models can be used to model diseases. Other models that were derived from the SIR models are the SEIR model, SIRV model, SIRD model e.t.c. The SEIR model models disease based on four-category which are the Susceptible, Exposed (Susceptible people that are exposed to infected people), Infected, and Recovered(Removed). The Susceptible, Infected, Recovered(Removed) and Vaccinated(SIRV) is another type of mathematical model that can be used to model diseases. The focus of this article is on the SIRD or SIID model which is Susceptible, Infected, Removed(Recovered with immunity), and Dead or Susceptible, Infected, Immune, and Dead model. 

The SIID or SIRD model is an extension of an addition of two assumptions which are recovery with immunity and Death. For the rest of this article, I will interchange SIRD for SIID, both refer to the same acronym.

SIRD Model. Source

You will notice that the difference between the SIR and the SIRD model is the addition of the dD/dt which is the death rate per time. The SIID model models the death rate by considering a constant called the mortality rate(mu), which is the rate at which infected people die. The number of people who are dead is based on the product of the mortality rate with the number of Infected people.  You will agree with me that the number of people who were infected and died must be removed from the number of infected people. If we remove the number of dead people, then our rate of change of infection over time will be modified to accommodate the loss due to death, which will give this.

SIRD image showing the mortality rate. Image by chúng tôi that is coloured with yellow is the mortality rate.

Simulating Diseases with SIRD(SIID) (Practical) 

Given the above information that immunity exists and people die as a result of the disease, it means we will use the SIID model to model the disease. Let’s assume the number of people who are infected by the disease is 3, the number of dead and recovered is zero, the infection rate(beta) is 0.5, the recovery rate(gamma) is 0.035 and the death rate(mu) is 0.005. Note that the infection rate, recovery rate, and death rate were gotten from here, but you can try any number.

SIRD Model modified from SIR. Image by Author. 

The Susceptible number for the next day can be computed by using this method

Snew = Sold – (beta * Sold* Iold )/N

Sold = N – Iold = 1000-3 = 997 (i.e the susceptible number for the current day is the difference between the total population and the number of infected people in the current day)

beta = 0.5

Iold  = 3

N = 1000 (The total Population)

Snew = 997 – (0.5 * 997 *3 )/1000

Snew = 997 – 1.4955

Snew = 995.5045

The total number of Susceptible for the next day is approximately 995.5

Let us compute the rest, the next day number of infected can be computed with this method

Inew = Iold + (((beta * Sold * Iold)/N) – (gamma * Iold) – (mu * Iold))

gamma ( recovery rate) = 0.035

mu (death rate) = 0.005

Inew = 3 + ((0.5 * 997 * 3)/1000) – (0.035 * 3) – (0.005 * 3))

Inew = 3 + (1.4955 – 0.105 – 0.015)

Inew = 3 + 1.3755

Inew = 4.3755

The number of people that will be infected the next day is approximately 4.4

Modeling the number of people that would have recovered with immunity the next day, that can be modelled with this equation.

Rnew = Rold + gamma * Iold

Rold = 0

Rnew = 0 + 0.035 * 3

Rnew = 0 + 0.105

Rnew = 0.105

The number of people who would have recovered with immunity the next day is approximately 0.11

Lastly, modelling the number of people who would be dead the next day, this method can be used which is the application of the last equation

Dnew = Dold + mu * Iold

Dold = 0 + 0.005 *3

Dold = 0 + 0.015

Dold = 0.015

These steps can be repeated to model the number of susceptible, infected, recovered and dead for the next 2 days and more days. What if the steps can be automated, instead of manually computing the numbers. Python Programming language will be used to automate the process and plot the result.

Modelling Disease with Python Programming Prerequisites

To follow along, you will need to have python and preferably Jupyter notebook installed on your system. You can use this link to download anaconda, anaconda comes with a Jupyter notebook and python. You can use this video to familiarize yourself with the Jupyter notebook and how to install it.

Now that you have Jupyter notebook installed, you are good to go. Let us fire down

# importing neccessary libraries import matplotlib.pyplot as plt %matplotlib inline # defining the variables total_population = 1000 total_infected = 3 total_susceptible = total_population - total_infected total_recovered = 0 total_dead = 0 # Number of days to simulate disease simulation_days = 500 # list to store the numbers of recovered people with immunity over time # the first element will be the initial number of people that has recovered with immunity recovered_list = [total_recovered] #list to store the number of dead people over time dead_list = [total_dead] infected_list = [total_infected] susceptible_list = [total_population] infection_rate = 0.5 recovery_rate = 0.035 death_rate = 0.005 #using the range function to simulate for 500 days which is the simulation days for days in range(1,simulation_days): num_infected_daily = (infection_rate * total_infected * susceptible)/total_population # get the susceptible number for next day total_susceptible = total_susceptible - num_infected_daily num_recovered_daily = recovery_rate * total_infected num_dead_daily = death_rate * infected total_infected = total_infected + (num_infected_daily - num_recovered_daily - num_dead_daily) total_recovered = total_recovered + num_recovered_daily total_dead = total_dead + num_dead_daily susceptible_list.append(total_susceptible) # adding to the list of susceptible people infected_list.append(total_infected) recovered_list.append(total_infected) dead_list.append(total_dead)

Now that we have simulated Konvid-18 for 500 days, we can now visualize our result.

Visualizing the result

# Using chúng tôi to plot plt.plot(range(0,simulation_days),susceptible_list,color='blue',label='Susceptible') plt.plot(range(0,simulation_days),infected_list,color='red',label='Infected') plt.plot(range(0,simulation_days),recovered_list,color='green',label='Recovered) plt.plot(range(0,simulation_days),dead_list,color='orange',label = 'Dead') plt.legend() #add the labels to the plot plt.title('Konvid-18 Disease Simulation in JavaGo city') plt.xlabel('Days') plt.ylabel('Total Population') plt.show()

After running the above code, the image below will be displayed.

Visualization Result. Image by Author

Deductions Conclusion 

The article has shown you the importance of mathematical models, how to model diseases with the SIRD model, how to automate the process for days, and how to visualize it. The article introduced you to the SIRD model, there are other mathematical models that you can explore further and dive deeper into like the SEIR, SIS, SIRV e.t.c. The article also didn’t cover the mathematics of deriving the contact ratio, recovery rate, and death rate, you can explore these concepts further. I hope you have realized the importance of mathematics in the healthcare industry.

I created a demo web app for further exploration, the web app was developed with streamlit. You can access the web app with this link and check the source code with this link.

You can connect with me on LinkedIn,

References/More Resources

(3) The MATH of Epidemics

The media shown in this article on SIRD Model are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Time Series Analysis And Forecasting

Introduction

Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent variable. To estimate the target variable in the name of predicting or forecasting, use the time variable as the point of reference. A Time-Series represents a series of time-based orders. It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds. It is an observation from the sequence of discrete time of successive intervals.

The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis (TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal processing, Engineering domain – Control Systems, and Communications Systems. Since TSA involves producing the set of information in a particular sequence, this makes it distinct from spatial and other analyses. We could predict the future using AR, MA, ARMA, and ARIMA models.

Learning Objectives

We will discuss in detail TSA Objectives, Assumptions, and Components (stationary and non-stationary).

We will look at the TSA algorithms.

Finally, we will look at specific use cases in Python.

This article was published as a part of the Data Science Blogathon.

Table of Contents What Is Time Series Analysis?

Definition: If you see, there are many more definitions for TSA. But let’s keep it simple.

A time series is nothing but a sequence of various data points that occurred in a successive order for a given period of time

Objectives of Time Series Analysis:

To understand how time series works and what factors affect a certain variable(s) at different points in time.

Time series analysis will provide the consequences and insights of the given dataset’s features that change over time.

Supporting to derive the predicting the future values of the time series variable.

Assumptions: There is only one assumption in TSA, which is “stationary,” which means that the origin of time does not affect the properties of the process under the statistical factor.

How to Analyze Time Series?

To perform the time series analysis, we have to follow the following steps:

Collecting the data and cleaning it

Preparing Visualization with respect to time vs key feature

Observing the stationarity of the series

Developing charts to understand its nature.

Model building – AR, MA, ARMA and ARIMA

Extracting insights from prediction

Significance of Time Series

TSA is the backbone for prediction and forecasting analysis, specific to time-based problem statements.

Analyzing the historical dataset and its patterns

Understanding and matching the current situation with patterns derived from the previous stage.

Understanding the factor or factors influencing certain variable(s) in different periods.

With the help of “Time Series,” we can prepare numerous time-based analyses and results.

Forecasting: Predicting any value for the future.

Segmentation: Grouping similar items together.

Classification: Classifying a set of items into given classes.

Descriptive analysis: Analysis of a given dataset to find out what is there in it.

Intervention analysis: Effect of changing a given variable on the outcome.

Components of Time Series Analysis

Let’s look at the various components of Time Series Analysis-

Trend: In which there is no fixed interval and any divergence within the given dataset is a continuous timeline. The trend would be Negative or Positive or Null Trend

Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline. Would be bell curve or saw tooth

Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern

Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

What Are the limitations of Time Series Analysis?

Time series has the below-mentioned limitations; we have to take care of those during our data analysis.

Similar to other models, the missing values are not supported by TSA

The data points must be linear in their relationship.

Data transformations are mandatory, so they are a little expensive.

Models mostly work on Uni-variate data.

Data Types of Time Series

Let’s discuss the time series’ data types and their influence. While discussing TS data types, there are two major types – stationary and non-stationary.

Stationary: A dataset should follow the below thumb rules without having Trend, Seasonality, Cyclical, and Irregularity components of the time series.

The mean value of them should be completely constant in the data during the analysis.

The variance should be constant with respect to the time-frame

Covariance measures the relationship between two variables.

Non- Stationary: If either the mean-variance or covariance is changing with respect to time, the dataset is called non-stationary.

Methods to Check Stationarity

During the TSA model preparation workflow, we must assess whether the dataset is stationary or not. This is done using Statistical Tests. There are two tests available to test if the dataset is stationary:

Augmented Dickey-Fuller (ADF) Test

Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

Augmented Dickey-Fuller (ADF) Test or Unit Root Test

The ADF test is the most popular statistical test. It is done with the following assumptions:

Null Hypothesis (H0): Series is non-stationary

Alternate Hypothesis (HA): Series is stationary

p-value <= 0.05 Accept (H1)

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test

These tests are used for testing a NULL Hypothesis (HO) that will perceive the time series as stationary around a deterministic trend against the alternative of a unit root. Since TSA is looking for Stationary Data for its further analysis, we have to ensure that the dataset is stationary.

Converting Non-Stationary Into Stationary

Let’s discuss quickly how to convert non-stationary to stationary for effective time series modeling. There are three methods available for this conversion – detrending, differencing, and transformation.

Detrending

It involves removing the trend effects from the given dataset and showing only the differences in values from the trend. It always allows cyclical patterns to be identified.

Differencing

This is a simple transformation of the series into a new time series, which we use to remove the series dependence on time and stabilize the mean of the time series, so trend and seasonality are reduced during this transformation.

Yt= Yt – Yt-1

Yt=Value with time

Transformation

This includes three different methods they are Power Transform, Square Root, and Log Transfer. The most commonly used one is Log Transfer.

Moving Average Methodology

The Moving Average (MA) (or) Rolling Mean: The value of MA is calculated by taking average data of the time-series within k periods.

Let’s see the types of moving averages:

Simple Moving Average (SMA),

Cumulative Moving Average (CMA)

Exponential Moving Average (EMA)

Simple Moving Average (SMA)

The SMA is the unweighted mean of the previous M or N points. The selection of sliding window data points, depending on the amount of smoothing, is preferred since increasing the value of M or N improves the smoothing at the expense of accuracy.

To understand better, I will use the air temperature dataset.

import pandas as pd from matplotlib import pyplot as plt from statsmodels.graphics.tsaplots import plot_acf df_temperature = pd.read_csv('temperature_TSA.csv', encoding='utf-8') df_temperature.head() df_temperature.info() # set index for year column df_temperature.set_index('Any', inplace=True) df_temperature.index.name = 'year' # Yearly average air temperature - calculation df_temperature['average_temperature'] = df_temperature.mean(axis=1) # drop unwanted columns and resetting the datafreame df_temperature = df_temperature[['average_temperature']] df_temperature.head() # SMA over a period of 10 and 20 years  df_temperature['SMA_10'] = df_temperature.average_temperature.rolling(10, min_periods=1).mean() df_temperature['SMA_20'] = df_temperature.average_temperature.rolling(20, min_periods=1).mean() # Grean = Avg Air Temp, RED = 10 yrs, ORANG colors for the line plot colors = ['green', 'red', 'orange'] # Line plot df_temperature.plot(color=colors, linewidth=3, figsize=(12,6)) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels =['Average air temperature', '10-years SMA', '20-years SMA'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Cumulative Moving Average (CMA)

The CMA is the unweighted mean of past values till the current time.

# CMA Air temperature df_temperature['CMA'] = df_temperature.average_temperature.expanding().mean() # green -Avg Air Temp and Orange -CMA colors = ['green', 'orange'] # line plot df_temperature[['average_temperature', 'CMA']].plot(color=colors, linewidth=3, figsize=(12,6)) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels =['Average Air Temperature', 'CMA'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Exponential Moving Average (EMA)

EMA is mainly used to identify trends and filter out noise. The weight of elements is decreased gradually over time. This means It gives weight to recent data points, not historical ones. Compared with SMA, the EMA is faster to change and more sensitive.

It has a value between 0,1.

Represents the weighting applied to the very recent period.

Let’s apply the exponential moving averages with a smoothing factor of 0.1 and 0.3 in the given dataset.

# EMA Air Temperature # Let's smoothing factor - 0.1 df_temperature['EMA_0.1'] = df_temperature.average_temperature.ewm(alpha=0.1, adjust=False).mean() # Let's smoothing factor - 0.3 df_temperature['EMA_0.3'] = df_temperature.average_temperature.ewm(alpha=0.3, adjust=False).mean() # green - Avg Air Temp, red- smoothing factor - 0.1, yellow - smoothing factor - 0.3 colors = ['green', 'red', 'yellow'] df_temperature[['average_temperature', 'EMA_0.1', 'EMA_0.3']].plot(color=colors, linewidth=3, figsize=(12,6), alpha=0.8) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels=['Average air temperature', 'EMA - alpha=0.1', 'EMA - alpha=0.3'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Time Series Analysis in Data Science and Machine Learning

When dealing with TSA in Data Science and Machine Learning, there are multiple model options are available. In which the Autoregressive–Moving-Average (ARMA) models with [p, d, and q].

q== moving average lags

Before we get to know about Arima, first, you should understand the below terms better.

Auto-Correlation Function (ACF)

Partial Auto-Correlation Function (PACF)

Auto-Correlation Function (ACF)

ACF is used to indicate how similar a value is within a given time series and the previous value. (OR) It measures the degree of the similarity between a given time series and the lagged version of that time series at the various intervals we observed.

Python Statsmodels library calculates autocorrelation. This is used to identify a set of trends in the given dataset and the influence of former observed values on the currently observed values.

Partial Auto-Correlation (PACF)

PACF is similar to Auto-Correlation Function and is a little challenging to understand. It always shows the correlation of the sequence with itself with some number of time units per sequence order in which only the direct effect has been shown, and all other intermediary effects are removed from the given time series.

Auto-Correlation and Partial Auto-Correlation plot_acf(df_temperature) plt.show() plot_acf(df_temperature, lags=30) plt.show()

Observation: The previous temperature influences the current temperature, but the significance of that influence decreases and slightly increases from the above visualization along with the temperature with regular time intervals.

Types of Auto-Correlation

Interpret ACF and PACF plots

ACFPACFPerfect ML -ModelPlot declines graduallyPlot drops instantlyAuto Regressive chúng tôi drops instantlyPlot declines graduallyMoving Average modelPlot decline graduallyPlot Decline graduallyARMAPlot drop instantlyPlot drop instantlyYou wouldn’t perform any model

Remember that both ACF and PACF require stationary time series for analysis.

Now, we will learn about the Auto-Regressive model.

What Is an Auto-Regressive Model?

An auto-regressive model is a simple model that predicts future performance based on past performance. It is mainly used for forecasting when there is some correlation between values in a given time series and the values that precede and succeed (back and forth).

An AR model is a Linear Regression model that uses lagged variables as input. The Linear Regression model can be easily built using the scikit-learn library by indicating the input. Statsmodels library is used to provide autoregression model-specific functions where you have to specify an appropriate lag value and train the model. It is provided in the AutoTeg class to get the results using simple steps.

Creating the model AutoReg()

Call fit() to train it on our dataset.

Returns an AutoRegResults object.

Once fit, make a prediction by calling the predict () function

The equation for the AR model (Let’s compare Y=mX+c)

Yt =C+b1 Yt-1+ b2 Yt-2+……+ bp Yt-p+ Ert

Key Parameters

p=past values

Yt=Function of different past values

Ert=errors in time

C=intercept

Lets’s check whether the given data set or time series is random or not

from matplotlib import pyplot from pandas.plotting import lag_plot lag_plot(df_temperature) pyplot.show()

Observation: Yes, it looks random and scattered.

Implementation of Auto-Regressive Model #import libraries from matplotlib import pyplot from statsmodels.tsa.ar_model import AutoReg from sklearn.metrics import mean_squared_error from math import sqrt # load csv as dataset #series = read_csv('daily-min-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True) # split dataset for test and training X = df_temperature.values train, test = X[1:len(X)-7], X[len(X)-7:] # train autoregression model = AutoReg(train, lags=20) model_fit = model.fit() print('Coefficients: %s' % model_fit.params) # Predictions predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False) for i in range(len(predictions)): print('predicted=%f, expected=%f' % (predictions[i], test[i])) rmse = sqrt(mean_squared_error(test, predictions)) print('Test RMSE: %.3f' % rmse) # plot results pyplot.plot(test) pyplot.plot(predictions, color='red') pyplot.show()

Output:

predicted=15.893972, expected=16.275000 predicted=15.917959, expected=16.600000 predicted=15.812741, expected=16.475000 predicted=15.787555, expected=16.375000 predicted=16.023780, expected=16.283333 predicted=15.940271, expected=16.525000 predicted=15.831538, expected=16.758333 Test RMSE: 0.617

Observation: Expected (blue) Against Predicted (red). The forecast looks good on the 4th and the deviation on the 6th day.

Implementation of Moving Average (Weights – Simple Moving Average) import numpy as np alpha= 0.3 n = 10 w_sma = np.repeat(1/n, n) colors = ['green', 'yellow'] # weights - exponential moving average alpha=0.3 adjust=False w_ema = [(1-ALPHA)**i if i==N-1 else alpha*(1-alpha)**i for i in range(n)] pd.DataFrame({'w_sma': w_sma, 'w_ema': w_ema}).plot(color=colors, kind='bar', figsize=(8,5)) plt.xticks([]) plt.yticks(fontsize=10) plt.legend(labels=['Simple moving average', 'Exponential moving average (α=0.3)'], fontsize=10) # title and labels plt.title('Moving Average Weights', fontsize=10) plt.ylabel('Weights', fontsize=10) Understanding ARMA and ARIMA

ARMA is a combination of the Auto-Regressive and Moving Average models for forecasting. This model provides a weakly stationary stochastic process in terms of two polynomials, one for the Auto-Regressive and the second for the Moving Average.

ARMA is best for predicting stationary series. ARIMA was thus developed to support both stationary as well as non-stationary series.

AR+I+MA= ARIMA

Understand the signature of ARIMA

Implementation Steps for ARIMA

Step 1: Plot a time series format

Step 2: Difference to make stationary on mean by removing the trend

Step 3: Make stationary by applying log transform.

Step 4: Difference log transform to make as stationary on both statistic mean and variance

Step 5: Plot ACF & PACF, and identify the potential AR and MA model

Step 6: Discovery of best fit ARIMA model

Step 7: Forecast/Predict the value using the best fit ARIMA model

Step 8: Plot ACF & PACF for residuals of the ARIMA model, and ensure no more information is left.

Implementation of ARIMA in Python

We have already discussed steps 1-5 which will remain the same; let’s focus on the rest here.

from statsmodels.tsa.arima_model import ARIMA model = ARIMA(df_temperature, order=(0, 1, 1)) results_ARIMA = model.fit() results_ARIMA.summary() results_ARIMA.forecast(3)[0] Output: array([16.47648941, 16.48621826, 16.49594711]) results_ARIMA.plot_predict(start=200) plt.show() Process Flow (Re-Gap)

In recent years, the use of Deep Learning for Time Series Analysis and Forecasting has increased to resolve problem statements that couldn’t be handled using Machine Learning techniques. Let’s discuss this briefly.

Recurrent Neural Networks (RNN) is the most traditional and accepted architecture fitment for Time-Series forecasting-based problems.

RNN is organized into successive layers and divided into

Input

Hidden

Output

Each layer has equal weight, and every neuron has to be assigned to fixed time steps. Do remember that every one of them is fully connected with a hidden layer (Input and Output) with the same time steps, and the hidden layers are forwarded and time-dependent in direction.

Components of RNN

Input: The function vector of x(t)​ is the input at time step t.

Hidden:

The function vector h(t)​ is the hidden state at time t,

This is a kind of memory of the established network;

This has been calculated based on the current input x(t) and the previous-time step’s hidden-state h(t-1):

Output: The function vector y(t) ​is the output at time step t.

Weights : Weights: In the RNNs, the input vector connected to the hidden layer neurons at time t is by a weight matrix of U (Please refer to the above picture),

Internally weight matrix W is formed by the hidden layer neurons of time t-1 and t+1. Following this, the hidden layer with to the output vector y(t) of time t by a V (weight matrix); all the weight matrices U, W, and V are constant for each time step.

Conclusion

A time series is constructed by data that is measured over time at evenly spaced intervals. I hope this comprehensive guide has helped you all understand the time series, its flow, and how it works. Although the TSA is widely used to handle data science problems, it has certain limitations, such as not supporting missing values. Note that the data points must be linear in their relationship for Time Series Analysis to be done.

Key Takeaways

Time series is a sequence of various data points that occurred in a successive order for a given period of time.

Trend, Seasonality, Cyclical, and Irregularity are components of TSA.

Frequently Asked Questions Related

A Complete Tutorial On Time Series Modeling In R

23 minutes

⭐⭐⭐⭐⭐

Rating: 5 out of 5.

Introduction

‘Time’ is the most important factor which ensures success in a business. It’s difficult to keep up with the pace of time.  But, technology has developed some powerful methods using which we can ‘see things’ ahead of time. Don’t worry, I am not talking about Time Machine. Let’s be realistic here!

I’m talking about the methods of prediction & forecasting. One such method, which deals with time based data is Time Series Modeling. As the name suggests, it involves working on time (years, days, hours, minutes) based data, to derive hidden insights to make informed decision making.

Time series models are very useful models when you have serially correlated data. Most of business houses work on time series data to analyze sales number for the next year, website traffic, competition position and much more. However, it is also one of the areas, which many analysts do not understand.

So, if you aren’t sure about complete process of time series modeling, this guide would introduce you to various levels of time series modeling and its related techniques.

What Is Time Series Modeling?

Let’s begin from basics.  This includes stationary series, random walks , Rho Coefficient, Dickey Fuller Test of Stationarity. If these terms are already scaring you, don’t worry – they will become clear in a bit and I bet you will start enjoying the subject as I explain it.

Stationary Series

There are three basic criterion for a series to be classified as stationary series:

1. The mean of the series should not be a function of time rather should be a constant. The image below has the left hand graph satisfying the condition whereas the graph in red has a time dependent mean.

2. The variance of the series should not a be a function of time. This property is known as homoscedasticity. Following graph depicts what is and what is not a stationary series. (Notice the varying spread of distribution in the right hand graph)

3. The covariance of the i th term and the (i + m) th term should not be a function of time. In the following graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.

Why do I care about ‘stationarity’ of a time series?

The reason I took up this section first was that until unless your time series is stationary, you cannot build a time series model. In cases where the stationary criterion are violated, the first requisite becomes to stationarize the time series and then try stochastic models to predict this time series. There are multiple ways of bringing this stationarity. Some of them are Detrending, Differencing etc.

Random Walk

This is the most basic concept of the time series. You might know the concept well. But, I found many people in the industry who interprets random walk as a stationary process. In this section with the help of some mathematics, I will make this concept crystal clear for ever. Let’s take an example.

Example: Imagine a girl moving randomly on a giant chess board. In this case, next position of the girl is only dependent on the last position.

Now imagine, you are sitting in another room and are not able to see the girl. You want to predict the position of the girl with time. How accurate will you be? Of course you will become more and more inaccurate as the position of the girl changes. At t=0 you exactly know where the girl is. Next time, she can only move to 8 squares and hence your probability dips to 1/8 instead of 1 and it keeps on going down. Now let’s try to formulate this series :

X(t) = X(t-1) + Er(t)

where Er(t) is the error at time point t. This is the randomness the girl brings at every point in time.

Now, if we recursively fit in all the Xs, we will finally end up to the following equation :

X(t) = X(0) + Sum(Er(1),Er(2),Er(3).....Er(t))

Now, lets try validating our assumptions of stationary series on this random walk formulation:

1. Is the Mean constant?

E[X(t)] = E[X(0)] + Sum(E[Er(1)],E[Er(2)],E[Er(3)].....E[Er(t)])

We know that Expectation of any Error will be zero as it is random.

Hence we get E[X(t)] = E[X(0)] = Constant.

2. Is the Variance constant?

Var[X(t)] = Var[X(0)] + Sum(Var[Er(1)],Var[Er(2)],Var[Er(3)].....Var[Er(t)]) Var[X(t)] = t * Var(Error) = Time dependent.

Hence, we infer that the random walk is not a stationary process as it has a time variant variance. Also, if we check the covariance, we see that too is dependent on time.

Let’s spice up things a bit,

We already know that a random walk is a non-stationary process. Let us introduce a new coefficient in the equation to see if we can make the formulation stationary.

Introduced coefficient: Rho

X(t) = Rho * X(t-1) + Er(t)

Now, we will vary the value of Rho to see if we can make the series stationary. Here we will interpret the scatter visually and not do any test to check stationarity.

Let’s start with a perfectly stationary series with Rho = 0 . Here is the plot for the time series :

Increase the value of Rho to 0.5 gives us following graph:

You might notice that our cycles have become broader but essentially there does not seem to be a serious violation of stationary assumptions. Let’s now take a more extreme case of Rho = 0.9

We still see that the X returns back from extreme values to zero after some intervals. This series also is not violating non-stationarity significantly. Now, let’s take a look at the random walk with rho = 1.

This obviously is an violation to stationary conditions. What makes rho = 1 a special case which comes out badly in stationary test? We will find the mathematical reason to this.

Let’s take expectation on each side of the equation  “X(t) = Rho * X(t-1) + Er(t)”

E[X(t)] = Rho *E[ X(t-1)]

This equation is very insightful. The next X (or at time point t) is being pulled down to Rho * Last value of X.

For instance, if X(t – 1 ) = 1, E[X(t)] = 0.5 ( for Rho = 0.5) . Now, if X moves to any direction from zero, it is pulled back to zero in next step. The only component which can drive it even further is the error term. Error term is equally probable to go in either direction. What happens when the Rho becomes 1? No force can pull the X down in the next step.

Dickey Fuller Test of Stationarity

What you just learnt in the last section is formally known as Dickey Fuller test. Here is a small tweak which is made for our equation to convert it to a Dickey Fuller test:

X(t) = Rho * X(t-1) + Er(t)

We have to test if Rho – 1 is significantly different than zero or not. If the null hypothesis gets rejected, we’ll get a stationary time series.

Stationary testing and converting a series into a stationary series are the most critical processes in a time series modelling. You need to memorize each and every detail of this concept to move on to the next step of time series modelling.

Let’s now consider an example to show you what a time series looks like.

Exploration of Time Series Data in R

Here we’ll learn to handle time series data on R. Our scope will be restricted to data exploring in a time series type of data set and not go to building time series models.

I have used an inbuilt data set of R called AirPassengers. The dataset consists of monthly totals of international airline passengers, 1949 to 1960.

Loading the Data Set

Following is the code which will help you load the data set and spill out a few top level metrics.

[1] “ts”

#This tells you that the data series is in a time series format [1] 1949 1 #This is the start of the time series

[1] 1960 12

#This is the end of the time series

[1] 12

#The cycle of this time series is 12months in a year Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Detailed Metrics #The number of passengers are distributed across the spectrum #This will plot the time series # This will fit in a line

Here are a few more operations you can do:

#This will print the cycle across years. #This will aggregate the cycles and display a year on year trend #Box plot across months will give us a sense on seasonal effect Important Inferences

The year on year trend clearly shows that the #passengers have been increasing without fail.

The variance and the mean value in July and August is much higher than rest of the months.

Even though the mean value of each month is quite different their variance is small. Hence, we have strong seasonal effect with a cycle of 12 months or less.

Exploring data becomes most important in a time series model – without this exploration, you will not know whether a series is stationary or not. As in this case we already know many details about the kind of model we are looking out for.

Let’s now take up a few time series models and their characteristics. We will also take this problem forward and make a few predictions.

Introduction to ARMA Time Series Modeling

ARMA models are commonly used in time series modeling. In ARMA model, AR stands for auto-regression and MA stands for moving average. If these words sound intimidating to you, worry not – I’ll simplify these concepts in next few minutes for you!

We will now develop a knack for these terms and understand the characteristics associated with these models. But before we start, you should remember, AR or MA are not applicable on non-stationary series.

In case you get a non stationary series, you first need to stationarize the series (by taking difference / transformation) and then choose from the available time series models.

First, I’ll explain each of these two models (AR & MA) individually. Next, we will look at the characteristics of these models.

Auto-Regressive Time Series Model

Let’s understanding AR models using the case below:

The current GDP of a country say x(t) is dependent on the last year’s GDP i.e. x(t – 1). The hypothesis being that the total cost of production of products & services in a country in a fiscal year (known as GDP) is dependent on the set up of manufacturing plants / services in the previous year and the newly set up industries / plants / services in the current year. But the primary component of the GDP is the former one.

Hence, we can formally write the equation of GDP as:

x(t) = alpha *  x(t – 1) + error (t)

This equation is known as AR(1) formulation. The numeral one (1) denotes that the next instance is solely dependent on the previous instance.  The alpha is a coefficient which we seek so as to minimize the error function. Notice that x(t- 1) is indeed linked to x(t-2) in the same fashion. Hence, any shock to x(t) will gradually fade off in future.

For instance, let’s say x(t) is the number of juice bottles sold in a city on a particular day. During winters, very few vendors purchased juice bottles. Suddenly, on a particular day, the temperature rose and the demand of juice bottles soared to 1000. However, after a few days, the climate became cold again. But, knowing that the people got used to drinking juice during the hot days, there were 50% of the people still drinking juice during the cold days. In following days, the proportion went down to 25% (50% of 50%) and then gradually to a small number after significant number of days. The following graph explains the inertia property of AR series:

Moving Average Time Series Model

Let’s take another case to understand Moving average time series model.

A manufacturer produces a certain type of bag, which was readily available in the market. Being a competitive market, the sale of the bag stood at zero for many days. So, one day he did some experiment with the design and produced a different type of bag. This type of bag was not available anywhere in the market. Thus, he was able to sell the entire stock of 1000 bags (lets call this as x(t) ). The demand got so high that the bag ran out of stock. As a result, some 100 odd customers couldn’t purchase this bag. Lets call this gap as the error at that time point. With time, the bag had lost its woo factor. But still few customers were left who went empty handed the previous day. Following is a simple formulation to depict the scenario :

x(t) = beta *  error(t-1) + error (t)

If we try plotting this graph, it will look something like this:

Did you notice the difference between MA and AR model? In MA model, noise / shock quickly vanishes with time. The AR model has a much lasting effect of the shock.

Difference Between AR and MA Models Exploiting ACF and PACF Plots

Once we have got the stationary time series, we must answer two primary questions:

The trick to solve these questions is available in the previous section. Didn’t you notice?

The first question can be answered using Total Correlation Chart (also known as Auto – correlation Function / ACF). ACF is a plot of total correlation between different lag functions. For instance, in GDP problem, the GDP at time point t is x(t). We are interested in the correlation of x(t) with x(t-1) , x(t-2) and so on. Now let’s reflect on what we have learnt above.

In a moving average series of lag n, we will not get any correlation between x(t) and x(t – n -1) . Hence, the total correlation chart cuts off at nth lag. So it becomes simple to find the lag for a MA series. For an AR series this correlation will gradually go down without any cut off value. So what do we do if it is an AR series?

Here is the second trick. If we find out the partial correlation of each lag, it will cut off after the degree of AR series. For instance,if we have a AR(1) series,  if we exclude the effect of 1st lag (x (t-1) ), our 2nd lag (x (t-2) ) is independent of x(t). Hence, the partial correlation function (PACF) will drop sharply after the 1st lag. Following are the examples which will clarify any doubts you have on this concept :

The blue line above shows significantly different values than zero. Clearly, the graph above has a cut off on PACF curve after 2nd lag which means this is mostly an AR(2) process.

Clearly, the graph above has a cut off on ACF curve after 2nd lag which means this is mostly a MA(2) process.

Till now, we have covered on how to identify the type of stationary series using ACF & PACF plots. Now, I’ll introduce you to a comprehensive framework to build a time series model.  In addition, we’ll also discuss about the practical applications of time series modelling.

Framework and Application of ARIMA Time Series Modeling

A quick revision, Till here we’ve learnt basics of time series modeling, time series in R and ARMA modeling. Now is the time to join these pieces and make an interesting story.

Overview of the Framework

This framework(shown below) specifies the step by step approach on ‘How to do a Time Series Analysis‘:

As you would be aware, the first three steps have already been discussed above. Nevertheless, the same has been delineated briefly below:

Step 1: Visualize the Time Series

It is essential to analyze the trends prior to building any kind of time series model. The details we are interested in pertains to any kind of trend, seasonality or random behaviour in the series. We have covered this part in the second part of this series.

Step 2: Stationarize the Series

Once we know the patterns, trends, cycles and seasonality , we can check if the series is stationary or not. Dickey – Fuller is one of the popular test to check the same. We have covered this test in the first part of this article series. This doesn’t ends here! What if the series is found to be non-stationary?

There are three commonly used technique to make a time series stationary:

1.  Detrending : Here, we simply remove the trend component from the time series. For instance, the equation of my time series is:

x(t) = (mean + trend * t) + error

We’ll simply remove the part in the parentheses and build model for the rest.

2. Differencing : This is the commonly used technique to remove non-stationarity. Here we try to model the differences of the terms and not the actual term. For instance,

x(t) – x(t-1) = ARMA (p ,  q)

This differencing is called as the Integration part in AR(I)MA. Now, we have three parameters

p : AR

d : I

q : MA

3. Seasonality: Seasonality can easily be incorporated in the ARIMA model directly. More on this has been discussed in the applications part below.

Step 3: Find Optimal Parameters

The parameters p,d,q can be found using  ACF and PACF plots. An addition to this approach is can be, if both ACF and PACF decreases gradually, it indicates that we need to make the time series stationary and introduce a value to “d”.

Step 4: Build ARIMA Model

With the parameters in hand, we can now try to build ARIMA model. The value found in the previous section might be an approximate estimate and we need to explore more (p,d,q) combinations. The one with the lowest BIC and AIC should be our choice. We can also try some models with a seasonal component. Just in case, we notice any seasonality in ACF/PACF plots.

Step 5: Make Predictions

Once we have the final ARIMA model, we are now ready to make predictions on the future time points. We can also visualize the trends to cross validate if the model works fine.

Applications of Time Series Model

Now, we’ll use the same example that we have used above. Then, using time series, we’ll make future predictions. We recommend you to check out the example before proceeding further.

Where did we start?

Following is the plot of the number of passengers with years. Try and make observations on this plot before moving further in the article.

Here are my observations:

1. There is a trend component which grows the passenger year by year.

2. There looks to be a seasonal component which has a cycle less than 12 months.

3. The variance in the data keeps on increasing with time.

We know that we need to address two issues before we test stationary series. One, we need to remove unequal variances. We do this using log of the series. Two, we need to address the trend component. We do this by taking difference of the series. Now, let’s test the resultant series.

adf.test(diff(log(AirPassengers)), alternative="stationary", k=0)

Augmented Dickey-Fuller Test

data: diff(log(AirPassengers)) Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01 alternative hypothesis: stationary

We see that the series is stationary enough to do any kind of time series modelling.

Next step is to find the right parameters to be used in the ARIMA model. We already know that the ‘d’ component is 1 as we need 1 difference to make the series stationary. We do this using the Correlation plots. Following are the ACF plots for the series:

#ACF Plots

acf(log(AirPassengers))

What do you see in the chart shown above?

Clearly, the decay of ACF chart is very slow, which means that the population is not stationary. We have already discussed above that we now intend to regress on the difference of logs rather than log directly. Let’s see how ACF and PACF curve come out after regressing on the difference.

acf(diff(log(AirPassengers))) pacf(diff(log(AirPassengers)))

Clearly, ACF plot cuts off after the first lag. Hence, we understood that value of p should be 0 as the ACF is the curve getting a cut off. While value of q should be 1 or 2. After a few iterations, we found that (0,1,1) as (p,d,q) comes out to be the combination with least AIC and BIC.

Let’s fit an ARIMA model and predict the future 10 years. Also, we will try fitting in a seasonal component in the ARIMA formulation. Then, we will visualize the prediction along with the training data. You can use the following code to do the same :

(fit <- arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12))) pred <- predict(fit, n.ahead = 10*12) ts.plot(AirPassengers,2.718^pred$pred, log = "y", lty = c(1,3)) Practice Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Test the techniques discussed in this post and accelerate your learning in Time Series Analysis with the following Practice Problems:

Conclusion

With this, we come to this end of tutorial on Time Series Modelling. I hope this will help you to improve your knowledge to work on time based data. To reap maximum benefits out of this tutorial, I’d suggest you to practice these R codes side by side and check your progress.

Frequently Asked Questions Related

Basic To Advanced Unix Commands With Example

Introduction to Unix Commands

The following article provides an outline for Unix Commands. An operating system offering both Graphical User Interface (GUI) and Command Line Interface (CLI) based interaction developed by Dennis Ritchie, Ken Thompson, Brian Kernighan, Joe Ossanna, and Douglas Mcllroy at Bell laboratory in the year 1970 known as a multi-tasking operating system allowing multiple users to work on the operating system simultaneously and provides commands for the users to interact with the application through Command Line Interface (CLI) like ls command, clear command, mkdir command, rmdir command, cat command, vi commands, rm command, mv command, su command, chmod command, sudo command, etc. which can be used to perform complex tasks.

Start Your Free Software Development Course

Web development, programming languages, Software testing & others

What is Unix?

Professionals working with servers and individuals learning about command-line-based operating systems prefer it. Several complex and large applications use Unix to run due to its feature to manage the processes easily. Compared with the Windows OS, it is a bit fast and offers a good user experience.

1. Ls Command

This Unix command shows all the files and folders at your current location. The blue text just before the dollar sign indicates the current location or directory in the command-line interface. Here the current location is the Desktop.

2. Clear Command

The command used to clear the screen is “clear”. It doesn’t delete anything written on the screen but makes the current line look like the first line.

The below picture shows the before and after images while using the clear command.

3. Mkdir Command

This Unix command makes a new directory at your current location. In the below image, we are at the Desktop and using the mkdir command to create a directory named “newdir” there. The directory is typically displayed in blue color.

4. Rmdir Command

The command used to remove a directory is “rmdir”. In the below image, you can see that in the second line, the newdir is present, but after we executed the rmdir command, it deleted the newdir folder.

5. Cat Command

Cat command is used to read the data written on any file. You can use the command to append data to a file and overwrite its contents as well. We have seen that we have a file names chúng tôi in the desktop location. Let’s use the cat command to display the contents of the file.

6. Vi Command

Vi command is the most useful command used to fetch the data written on any file on the terminal and let us make the changes simultaneously. Regardless of the size and type of the file, we can edit those using the Vi command if they have text written on it. Here we will add extra data in the chúng tôi file.

7. Rm Command

The rm command is used to delete the files at your current location. In our case, we are at the Desktop with the chúng tôi file; Now, we will try to delete that file using the rm command. The second line shows chúng tôi present there, but after running the rm command, that file has been removed.

8. Mv Command

The mv command can be used for two purposes, for renaming and for moving files or folders. Here we will rename the chúng tôi file to newpage.html. Please note that if you try to move the file to the same folder, it will rename it; if you try to move it to another directory, it will get moved there.

9. Su Command

Su command is used when we need to switch the user. In the picture below, we can observe that the current user is Vishal. Once the “su” command is used to log in as the root user, the username will indeed change. The red text on the left side of the dollar sign displays the username.

10. Chmod Command

We use the chmod command to change the permissions of a file. Here we have the chúng tôi file. The file has read and run permission to the owner, the group, and others. We will use the chmod command to give all permission to everyone.

11. Sudo Command

Only the root user has the authorization to execute certain commands. Here we will be executing a command that could lead to making some changes in the system, and hence it couldn’t be executed with other users. We have to use the Sudo command to make it work.

Tips and Tricks to Use Unix Commands

Despite the limited number of commands, they can be utilized with multiple arguments to accomplish complex tasks. For instance, you can use the ls command to check the available files and directories at your current location. Additionally, using the -an argument with ls can reveal all the hidden files at the same location.

Every command has some arguments allocated to it that could be used with those particular commands. To check which arguments are for any specific command, you can use –help the argument. In the below image, we will see all the arguments that could be used with the chmod command. The keywords or arguments must be followed by — while writing in the command line.

Conclusion – Unix Commands

Unix is an operating system popular for its command-line interface. It comprises numerous commands that facilitate users’ interaction with the hardware. The command in Unix is the mean of communication while working through the terminal. In addition to CLI, it also offers a graphical user interface that adds more beauty to the pre-existing features of Unix.

Recommended Articles

We hope that this EDUCBA information on “Unix Commands” was beneficial to you. You can view EDUCBA’s recommended articles for more information.

Update the detailed information about Basic Understanding Of Time Series Modelling With Auto Arimax on the Bellydancehcm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!