Trending February 2024 # Perform Regression Analysis With Pytorch Seamlessly! # Suggested March 2024 # Top 7 Popular

You are reading the article Perform Regression Analysis With Pytorch Seamlessly! updated in February 2024 on the website Bellydancehcm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Perform Regression Analysis With Pytorch Seamlessly!

Introduction

This is probably the 1000th article that is going to talk about implementing regression analysis using PyTorch. so how is it different?

Well, before I answer that let me write the series of events that led to this article. So, when I started learning regression in PyTorch, I was excited but I had so many whys and why nots that I got frustrated at one point. So, I thought why not start from scratch- understand the deep learning framework a little better and then delve deep into the complex concepts like CNN, RNN, LSTM, etc. and the easiest way to do so is taking a familiar dataset and explore as much as you can so that you understand the basic building blocks and the key working principle.

I have tried to explain the modules that are imported, why certain steps are mandatory, and how we evaluate a regression model in PyTorch.

So people, if you have just started or looking for answers as I did, then you are definitely in the right place. 🙂

Okay, so let’s start with the imports first.

Imports

import torch import torch.nn as nn

the chúng tôi modules help us create and train neural networks. So definitely we need that. Let’s move on:

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import MinMaxScaler

They look familiar, right? We need them cause we have to do some preprocessing on the dataset we will be using.

Data

The dataset I am going to use is Celsius to Fahrenheit data which can be found here: link

Data preprocessing step 1: separate out the feature and the label



Data preprocessing step 2: standardize the data as the values are very large and varied

So if you don’t do that for this particular case, then later while training the model, you will likely get inf or nan for loss values, meaning, the model cannot perform backpropagation properly and will result in a faulty model.

sc = MinMaxScaler() sct = MinMaxScaler() X_train=sc.fit_transform(X_train.reshape(-1,1)) y_train =sct.fit_transform(y_train.reshape(-1,1))

we have to make sure that X_train and y_train are 2-d.

Okay, so far so good. Now let’s enter into the world of tensors

Data preprocessing Step 3: Convert the numpy arrays to tensors

X_train = torch.from_numpy(X_train.astype(np.float32)).view(-1,1) y_train = torch.from_numpy(y_train.astype(np.float32)).view(-1,1)

The view takes care of the 2d thing in tensor as reshape does in numpy.

input_size = 1 output_size = 1

input= celsius

output = fahrenheit

Define layer

class LinearRegressionModel(torch.nn.Module): def __init__(self): super(LinearRegressionModel, self).__init__() self.linear = torch.nn.Linear(1, 1) # One in and one out def forward(self, x): y_pred = self.linear(x) return y_pred

Or we could have simply done this(since it is just a single layer)

model = nn.Linear(input_size , output_size)

In both cases, we are using nn.Linear to create our first linear layer, this basically does a linear transformation on the data, say for a straight line it will be as simple as y = w*x, where y is the label and x, the feature. Of course, w is the weight. In our data, celsius and fahrenheit follow a linear relation, so we are happy with one layer but in some cases where the relationship is non-linear, we add additional steps to take care of the non-linearity, say for example add a sigmoid function.

Define loss and optimizer

learning_rate = 0.0001 l = nn.MSELoss() optimizer = torch.optim.SGD(model.parameters(), lr =learning_rate )

as you can see, the loss function, in this case, is “mse” or “mean squared error”. Our goal will be to reduce the loss and that can be done using an optimizer, in this case, stochastic gradient descent. That SGD needs initial model parameters or weights and a learning rate.

Okay, now let’s start the training.

Training

num_epochs = 100

for epoch in range(num_epochs): #forward feed y_pred = model(X_train.requires_grad_()) #calculate the loss loss= l(y_pred, y_train) #backward propagation: calculate gradients loss.backward() #update the weights optimizer.step() #clear out the gradients from the last step loss.backward() optimizer.zero_grad() print('epoch {}, loss {}'.format(epoch, loss.item()))

forward feed: in this phase, we are just calculating the y_pred by using some initial weights and the feature values.

loss phase: after the y_pred, we need to measure how much prediction error happened. We are using mse to measure that.

backpropagation: in this phase gradients are calculated.

step: the weights are now updated.

zero_grad: finally, clear the gradients from the last step and make room for the new ones.

Evaluation

predicted = model(X_train).detach().numpy()

detach() is saying that we do not need to store gradients anymore so detach that from the tensor. Now, let’s visualize the model quality with the first 100 data points.

plt.scatter(X_train.detach().numpy()[:100] , y_train.detach().numpy()[:100]) plt.plot(X_train.detach().numpy()[:100] , predicted[:100] , "red") plt.xlabel("Celcius") plt.ylabel("Farenhite") plt.show()

Notice, how with the increase in the number of the epochs, the predictions are getting better. There are multiple other strategies to optimize the network, for example changing the learning rate, weight initialization techniques, and so on.

Lastly, try with a known celsius value and see if the model is able to predict the fahrenheit value correctly. The values are transformed, so make sure you do a sc.inverse_transform() and sct.inverse_transform() to get the actual values.

References:

About the Author

Prior to that I have worked with Cray Inc. in Seattle and United Educators in Washington DC as a Data Scientist.

I have also done my Master’s in Data Science from Indiana University, Bloomington.

Related

You're reading Perform Regression Analysis With Pytorch Seamlessly!

Learn To Predict Using Linear Regression In R With Ease (Updated 2023)

Introduction

Can you predict a company’s revenue by analyzing the budget it allocates to its marketing team? Yes, you can. Do you know how to predict using linear regression in R? Not yet? Well, let me show you how. In this article, we will discuss one of the simplest machine-learning techniques, linear regression. Regression is almost a 200-year-old tool that is still effective in data science. It is one of the oldest statistical tools used in machine learning predictive analysis.

Learning Objectives

Understand the definition and significance of Linear regression.

Explore the various applications of linear regression.

Learn to implement linear regression algorithms through the sample codes in R found in this tutorial.

This article was published as a part of the Data Science Blogathon.

What Is Linear Regression?

Simple linear regression analysis is a technique to find the association between two variables. The two variables involved are the dependent variable (response variable), which responds to the change of the independent variable (predictor variable). Note that we are not calculating the dependency of the dependent variable on the independent variable, but just the association.

For example, a firm is investing some amount of money in the marketing of a product, and it has also collected sales data throughout the years. Now, by analyzing the correlation between the marketing budget and the sales data, we can predict next year’s sales if the company allocates a certain amount of money to the marketing department. The above idea of prediction sounds magical, but it’s pure statistics. The linear regression algorithm is basically fitting a straight line to our dataset using the least squares method so that we can predict future events. One limitation of linear regression is that it is sensitive to outliers. The best-fit line would be of the form:

Practical Application of Linear Regression Using R

Let’s try to understand the practical application of linear regression in R with another example.

Let’s say we have a dataset of the blood pressure and age of a certain group of people. With the help of this data, we can train a simple linear regression model in R, which will be able to predict blood pressure at ages that are not present in our dataset.

You can download the Dataset from below:

Equation of the regression line in our dataset.

Now let’s see how to do this

Step 1: Import the Dataset

Import the dataset of Age vs. Blood Pressure, a CSV file using function chúng tôi ) in R, and store this dataset into a data frame bp.

bp <- read.csv("bp.csv") Step 2: Create the Data Frame for Predicting Values

Create a data frame that will store Age 53. This data frame will help us predict blood pressure at Age 53 after creating a linear regression model.

p <-  as.data.frame(53) colnames(p) <- "Age" Step 3: Create a Scatter Plot using the ggplot2 Library

Taking the help of the ggplot2 library in R, we can see that there is a correlation between Blood Pressure and Age, as we can see that the increase in Age is followed by an increase in blood pressure.

We can also use the plot function In R for scatterplot and abline function to plot straight lines.

It is quite evident from the graph that the distribution on the plot is scattered in a manner that we can fit a straight line through the data points.

Step 4: Calculate the Correlation Between Age and Blood Pressure

We can also verify our above analysis that there is a correlation between Blood Pressure and Age by taking the help of the cor( ) function in R, which is used to calculate the correlation between two variables.

cor(bp$BP,bp$Age)

[1] 0.6575673

Step 5: Create a Linear Regression Model

[1] 0.6575673

Now, with the help of the lm( ) function, we are going to make a linear model. lm( ) function has two attributes first is a formula where we will use “BP ~ Age” because Age is an independent variable and Blood Pressure is a dependent variable, and the second is data, where we will give the name of the data frame containing data which is in this case, is data frame bp. The model fits the data as follows:

model <- lm(BP ~ Age, data = bp) Summary of Our Linear Regression Model summary(model)

Output:

## ## Call: ## lm(formula = BP ~ Age, data = bp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -21.724 -6.994 -0.520 2.931 75.654 ## ## Coefficients: ## (Intercept) 98.7147 10.0005 9.871 1.28e-10 *** ## Age 0.9709 0.2102 4.618 7.87e-05 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 17.31 on 28 degrees of freedom ## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121 ## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05 Interpretation of the Model ## Coefficients: ## (Intercept) 98.7147 10.0005 9.871 1.28e-10 *** ## Age 0.9709 0.2102 4.618 7.87e-05 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 B0 = 98.7147 (Y- intercept) B1 = 0.9709 (Age coefficient) BP = 98.7147 + 0.9709 Age

It means a change in one unit in Age will bring 0.9709 units to change in Blood Pressure.

Standard Error

The standard error is variability to expect in coefficient, which captures sampling variability, so the variation in intercept can be up to 10.0005, and the variation in Age will be 0.2102, not more.

T value

The T value is the coefficient divided by the standard error. It is basically how big the estimate is relative to the error. The bigger the coefficient relative to standard error, the bigger the t score. The t score comes with a p-value because a distribution p-value is how statistically significant the variable is to the model for a confidence level of 95%. We will compare this value with alpha which will be 0.05, so in our case, the p-values of both intercept and Age are less than alpha (alpha = 0.05). This implies that both are statistically significant to our model.

We can calculate the confidence interval using the confint(model, level=.95) method.

Residual Standard Error

Residual standard error or the standard error of the model is basically the average error for the model, which is 17.31 in our case, and it means that our model can be off by an average of 17.31 while predicting the blood pressure. The lesser the error, the better the model while predicting.

Multiple R-squared

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Adjusted R-squared

Suppose we add variables, no matter if it’s significant in prediction or not. In that case, the value of the R-squared will increase, which is the reason adjusted R-squared is used because if the variable added isn’t significant for the prediction of the model, the value of the adjusted R-squared will reduce. It is one of the most helpful tools to avoid overfitting the model.

F – statistics

F – statistics is the ratio of the mean square of the model and the mean square of the error. In other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is, the better the model is doing compared to the error.

One is the degree of freedom of the numerator of the F – statistic, and 28 is the degree of freedom of the errors.

Step 6: Run a Sample Test

Now, let’s try using our model to predict the value of blood pressure for someone at age 53.

BP = 98.7147 + 0.9709 Age

The above formula will be used to calculate blood pressure at the age of 53, and this will be achieved by using the predict function( ). First, we will write the name of the linear regression model, separated by a comma, giving the value of the new data set at p as the Age 53 is earlier saved in data frame p.

predict(model, newdata = p)

Output:

## 1

## 150.1708

So, the predicted value of blood pressure is 150.17 at age 53

As we have predicted Blood Pressure with the association of Age, now there can be more than one independent variable involved, which shows a correlation with a dependent variable. This is called Multiple Regression.

Multiple Linear Regression Model

Multi-Linear regression analysis is a statistical technique to find the association of multiple independent variables with the dependent variable. For example, revenue generated by a company is dependent on various factors, including market size, price, promotion, competitor’s price, etc. basically Multiple linear regression model establishes a linear relationship between a dependent variable and multiple independent variables.

Taking another example of the Wine dataset and with the help of AGST, HarvestRain, we are going to predict the price of wine. Here AGST and HarvestRain are fitted values.

Here’s how we can build a multiple linear regression model.

Step 1: Import the Dataset

Using the function chúng tôi ), import both data sets chúng tôi and wine_test.csv, into the data frame wine and wine_test, respectively.

wine <- read.csv("wine.csv") wine_test <- read.csv("wine_test.csv")

You can download the dataset below.

Step 2: Find the Correlation Between Different Variables

Using the cor( ) function and round( ) function, we can round off the correlation between all variables of the dataset wine to two decimal places.

round(cor(wine),2) Output: Year Price WinterRain AGST HarvestRain Age FrancePop ## Year 1.00 -0.45 0.02 -0.25 0.03 -1.00 0.99 ## Price -0.45 1.00 0.14 0.66 -0.56 0.45 -0.47 ## WinterRain 0.02 0.14 1.00 -0.32 -0.28 -0.02 0.00 ## AGST -0.25 0.66 -0.32 1.00 -0.06 0.25 -0.26 ## HarvestRain 0.03 -0.56 -0.28 -0.06 1.00 -0.03 0.04 ## Age -1.00 0.45 -0.02 0.25 -0.03 1.00 -0.99 ## FrancePop 0.99 -0.47 0.00 -0.26 0.04 -0.99 1.00 Step 3: Create Scatter Plots Using ggplot2 Library

Create a scatter plot using the library ggplot2 in R. This clearly shows that AGST and the Price of the wine are highly correlated. Similarly, the scatter plot between HarvestRain and the Price of wine also shows their correlation.

ggplot(wine,aes(x = AGST, y = Price)) + geom_point() +geom_smooth(method = "lm")

ggplot(wine,aes(x = HarvestRain, y = Price)) + geom_point() +geom_smooth(method = "lm")

Step 4: Create a Multilinear Regression Model model1 <- lm(Price ~ AGST + HarvestRain,data = wine) summary(model1)

Output:

## ## Call: ## lm(formula = Price ~ AGST + HarvestRain, data = wine) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.88321 -0.19600 0.06178 0.15379 0.59722 ## ## Coefficients: ## (Intercept) -2.20265 1.85443 -1.188 0.247585 ## AGST 0.60262 0.11128 5.415 1.94e-05 *** ## HarvestRain -0.00457 0.00101 -4.525 0.000167 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3674 on 22 degrees of freedom ## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808 ## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06 Interpretation of the Model ## Coefficients: ## (Intercept) -2.20265 1.85443 -1.188 0.247585 ## AGST 0.60262 0.11128 5.415 1.94e-05 *** ## HarvestRain -0.00457 0.00101 -4.525 0.000167 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 B0 = 98.7147 (Y- intercept) B1 = 0.9709 (Age coefficient) Price = -2.20265 + 0.60262 AGST - 0.00457 HarvestRain

It means that a change in one unit in AGST will bring 0.60262 units to change in Price, and one unit change in HarvestRain will bring 0.00457 units to change in Price.

Standard Error

The standard error is variability to expect in coefficient, which captures sampling variability, so the variation in intercept can be up to 1.85443, the variation in AGST will be 0.11128, and the variation in HarvestRain is 0.00101, not more.

In this case, the p-value of intercept, AGST, and HarvestRain are less than alpha (alpha = 0.05), which implies that all are statistically significant to our model.

Residual Standard Error

The residual standard error or the standard error of the model is 0.3674 in our case, which means that our model can be off by an average of 0.3674 while predicting the Price of wines. The lesser the error, the better the model while predicting. We have also looked at the residuals, which need to follow a normal distribution.

Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))

Two is the degree of freedom of the numerator of the F – statistic, and 22 is the degree of freedom of the errors.

Step 5: Predict the Values for Our Test Set prediction <- predict(model1, newdata = wine_test)

Predicted values with the test data set

wine_test

## Year Price WinterRain AGST HarvestRain Age FrancePop ## 1 1979 6.9541 717 16.1667 122 4 54835.83 ## 2 1980 6.4979 578 16.0000 74 3 55110.24

prediction

## 1 2 ## 6.982126 7.101033 Conclusion

Linear regression is a versatile model which is suitable for many situations. As we can see from the available datasets, we can create a simple linear regression model or multiple linear regression model and train that model to accurately predict new events or future outcomes if enough data is available.

Key Takeaways

Simple linear regression analysis is a statistical technique to find the association between an independent and a dependent variable.

Multiple linear regression analysis is a technique to find the association of multiple independent variables with a single dependent variable.

Both of these methods are widely used to design ML models in R for various applications.

Frequently Asked Questions

Q1. What does LM () do in R?

A. The lm() function is used to fit the linear regression model to the data in R language.

Q2. How do you find the correlation coefficient in R?

A. You can find the correlation coefficient in R by using the cor( ) function.

Q3. What are the slope and the intercept in linear regression?

A. The slope indicates the rate of change in the dependent variable per unit change in the independent variable. The y-intercept indicates the dependent variable when the independent variable is 0.

Related

Power Bi Copilot: Enhancing Data Analysis With Ai Integration

Are you ready to elevate your data analysis capabilities? Then let’s delve into the realm of Power BI Copilot and its AI integration. This tool isn’t just another addition to your data analysis toolkit; it’s akin to having an intelligent assistant, always ready to help you navigate through your data.

In this article, we’ll explain how Power BI Copilot works and how you can leverage it to empower you and your organization.

Let’s get started!

Copilot is an AI tool that provides suggestions for code completion. The tool is powered by Codex, an AI system developed by OpenAI that can generate code from a user’s natural language prompts.

Copilot already has Git integration in GitHub Codespaces, where it can be used as a tool for writing tests, fixing bugs, and autocompleting snippets from plain English prompts like “Create a function that checks the time” or “Sort the following information into an alphabetical list.”

The addition of Copilot in Power BI has infused the power of large language models into Power BI. Generative AI can help users get more out of their data. All you have to do is describe the visuals or the insights you want, and Copilot will do the rest.

With Copilot, you can:

Create and tailor Power BI reports and gain insights in minutes

Generate and refine DAX calculations

Ask questions about your data

Create narrative summaries

All of the above can be done using conversational language. Power BI already had some AI features, such as the quick measure suggestions that help you come up with DAX measures using natural language, but Copilot takes it to the next level.

With Copilot, you can say goodbye to the tedious and time-consuming task of sifting through data and hello to instant, actionable insights. It’s the ultimate assistant for uncovering and sharing insights faster than ever before.

Some of its key features include:

Automated report generation: Copilot can automatically generate well-designed dashboards, data narratives, and interactive elements, reducing manual report creation time and effort.

Conversational language interface: You can describe your data requests and queries using simple, conversational language, making it easier to interact with your data and obtain insights.

Real-time analytics: Power BI users can harness Copilot’s real-time analytics capabilities to visualize data and respond quickly to changes and trends.

Alright, now that we’ve gone over some of the key features of Power BI Copilot, let’s go over how it can benefit your workflow in the next section.

Looking at Power BI Copilot’s key features, it’s easy to see how the tool has the potential to enhance your data analysis experience and business decision-making process.

Some benefits include:

Faster insights: With the help of generative AI, Copilot allows you to quickly uncover valuable insights from your data, saving time and resources.

Ease of use: The conversational language interface makes it easy for business users with varying levels of technical expertise to interact effectively with the data.

Reduced time to market: Using Copilot in Power Automate can reduce the time to develop workflows and increase your organization’s efficiency.

Using Power BI Copilot’s features in your production environments will enable you to uncover meaningful insights from your data more efficiently and make well-informed decisions for your organization. However, the product is not without its limitations, as you’ll see in the next section.

Copilot for Microsoft Power BI is a new product that was announced together with Microsoft Fabric in May 2023. However, it’s still in private preview mode and hasn’t yet been released to the public. There is no official public release date, but it’ll likely be launched before 2024.

Some other limitations of Copilot include:

Quality of suggestions: Copilot is trained in all programming languages available on public repositories. However, the quality of the suggestions may depend on the volume of the available training dataset for that language. Suggestions for niche programming languages (APL, Erlang, Haskell, etc.) won’t be as good as those of popular languages like Python, Java, C++, etc.

Doesn’t understand context like a human: While the AI has been trained to understand context, it is still not as capable as a human developer in fully understanding the high-level objectives of a complex project. It may fail to provide appropriate suggestions in some complicated scenarios.

Lack of creative problem solving: Unlike a human developer, the tool cannot come up with innovative solutions or creatively solve problems. It can only suggest code based on what it has been trained on.

Possible legal and licensing issues: As Copilot uses code snippets from open-source projects, there are questions about the legal implications of using these snippets in commercial projects, especially if the original code was under a license that required derivative works to be open source as well.

Inefficient for large codebases: The tool is not optimized for navigating and understanding large codebases. It’s most effective at suggesting code for small tasks.

While Power BI Copilot offers a compelling platform for data analytics and visualization, its limitations shouldn’t be overlooked. You have to balance the undeniable benefits of Copilot with its constraints and align the tool with your unique operational needs.

As we mentioned in the previous section, Copilot for Power BI was announced at the same time as Microsoft Fabric, so naturally, there’s a lot of confusion about whether Fabric is replacing Power BI or whether Power BI is now a Microsoft Fabric product.

Microsoft Fabric is a unified data foundation that’s bringing together several data analysis tools under one umbrella. It’s not replacing Power BI; instead, it’s meant to enhance your Power BI experience.

Power BI is now one of the main products available under the Microsoft Fabric tenant setting. Some other components that fall under the Fabric umbrella include:

Data Factory: This component brings together the best of Power Query and Azure Data Factory. With Data Factory, you can integrate your data pipelines right inside Fabric and access a variety of data estates.

Synapse Data Engineering: Synapse-powered data engineering gives data professionals an easy way to collaborate on projects that involve data science, business intelligence, data integration, and data warehousing.

Synapse Data Science: Synapse Data Science is designed for data scientists and other data professionals who work with large data models and want industry-leading SQL performance. It brings machine-learning tools, collaborative code authoring, and low-code tools to Fabric.

Synapse Data Warehousing: For data warehousing professionals, Synapse Data Warehouse brings the next-gen of data warehousing capabilities to Fabric with open data formats, cross-querying, and automatic scaling.

Synapse Real-Time Analytics: This component simplifies data integration for large organizations and enables business users to gain quick access to data insights through auto-generated visualizations and automatic data streaming, partitioning, and indexing.

OneLake: The “OneDrive for data,” OneLake is a multi-cloud data lake where you can store all an organization’s data. It’s a lake-centric SaaS solution with universal compute capacities to enable multiple developer collaboration.

Through Fabric, Microsoft is bringing the capabilities of machine learning models to its most popular data science tools. There are other components, like Data Activator, which are still in private preview and are not yet available in Fabric.

Microsoft Fabric is available to all Power BI Premium users with a free 60-day trial. To get started, go to the Power BI admin portal and opt-in to start the free trial.

In a world brimming with data, Copilot might just be the ‘wingman’ you need to make your data speak volumes. It’s turning Power BI into a human-centered analytics product that enables both data engineers and non-technical users to explore data using AI models.

Whether you’re a small business trying to make sense of customer data or a multinational figuring out global trends, give Copilot a whirl and let it take your data analysis to the next level. Happy analyzing!

To learn more about how to use Power BI with ChatGPT to supercharge your organization’s reports, check out the playlist below:

Copilot in Power BI is still in private preview, but it will become available to Power BI customers soon. With this tool, users can use natural language queries to write DAX formulas, auto-generate complete reports using Power BI data, and add visualizations to existing reports.

To use Copilot in Power BI, all you have to do is write a question or request describing what you want, such as “Help me build a report summarizing the profile of customers who have visited our homepage.” If you want Copilot to give you suggestions, type “/” in the query box.

Once Copilot for Power BI comes out of private preview, it’ll be available at no extra cost to all Power BI license holders (pro or premium).

Learning Database For Data Science Tutorial – Perform Mongodb Indexing Using Pymongo

You can’t get away from learning about databases in data science. In fact, we need to become quite familiar with how to handle databases, how to quickly execute queries, etc. as data science professionals. There’s just no way around it!

There are two things you should know – learn all you can about database management and then figure out how to efficiently go about it. Trust me, you will go a long way in the data science domain.

MongoDB is a popular NOSQL database that is designed for ease of development and scaling. It is able to handle huge volumes of data with great efficiency (a must-have feature in today’s data-driven world). So how does MongoDB go about doing this?

One word – Indexing!

Just like any other database, MongoDB also provides support for indexing data. Indexing in MongoDB allows us to fetch the documents at a much faster rate thereby improving the performance of the database. This is significantly important when the database is handling huge amounts of data, probably in terabytes!

MongoDB has different index types with various properties that can handle complex queries. The indexes can be created and dropped as and when required and on any type of field in the document.

So in this article, we will look at indexing in MongoDB and when you should use it appropriately.

If you are a beginner in MongoDB, I suggest going through the below articles:

Table of Contents

What is Indexing?

Connecting to MongoDB Atlas

Accessing Data with PyMongo

Indexing in MongoDB

MongoDB Collection Indexes

Explain Query Result

Create a Single Field Index

Dropping Indexes

Compound Indexes

Multikey Indexes

Text Indexes

Geospatial Indexes

Index Properties

Unique Indexes

Partial Indexes

What is Indexing?

I’m sure you’ve done this – instantly jump to the relevant page in a book just by looking at its index. That is what indexing feels like in databases as well. Just by looking at the index, we instantly hop to the appropriate location in memory without having to look over all the other data in the database.

Indexing allows us to reduce the number of documents our database needs to scan through to look for the specific document thereby improving the performance of the database by leaps and bounds.

For example, if you have a database storing information about employees in a company and you frequently query the database based on their department field, then it is a wise thing to create an index on the department field. The database will arrange the department’s field values in order. Now, whenever you try to query based on the department, the database will simply go over the index first, jump to the location where the relevant records are, and retrieve the documents. This, in simple terms, is how indexing works.

However, in order to use indexes appropriately, you need to be aware of the queries you will be performing on the database. Otherwise creating random indexes will just be a waste of resources.

If in the example above we had created indexes on the name of the employee instead of the department, then it would have been of no use as the database would still have to go over all the documents. Therefore, it is important to know your queries for indexing.

Now, let’s look at how we can perform indexing in MongoDB. But first, we need to load our dataset.

Connecting to MongoDB Atlas

MongoDB Atlas is a global cloud database service. Using it, we can deploy fully managed MongoDB across AWS, Azure, or GCP and create truly flexible and scalable databases.

We will be using the sample dataset available with MongoDB Atlas clusters. So before we get our hands dirty with indexing in MongoDB, we need to first create a MongoDB Atlas account and then create a free cluster to access the sample dataset.

Creating a MongoDB Atlas account is pretty straightforward. All you have to do is register your account here and log in to your account

Then you need to deploy a Free Tier Cluster. These never expire but you can only create one per project

Now you need to whitelist your IP address so that you can connect to the cluster and access the database

Next, you need to create a user to access the cluster. Provide the username and password

Now you need to connect to your cluster

Note: My Python version is 3.7.4, however selecting the “3.6 or later” option from the “Version ” dropdown list gave me an error. Selecting the “3.4 or later” worked fined. You can try that if you get an error.

Once you have the connection string, fire up your Jupyter notebook because we will be needing the PyMongo library!

Accessing Data with PyMongo

View the code on Gist.

We will be working with the “sample_restaurants” database here. Load the database using the following command:

View the code on Gist.

You can have a look at all the collections within this database:

View the code on Gist.

You can count the number of documents within each collection:

View the code on Gist.

Here is a look at the documents from each collection:

Indexing in MongoDB

Without indexes, a MongoDB database has to scan every document in a collection to look for the relevant documents. This is called COLLSCAN, or a collection scan. However, using indexes greatly reduces the number of documents MongoDB needs to scan. Hence, using indexes makes the execution of queries quite efficient, especially if the database stores a large number of documents, for which MongoDB is very popular.

MongoDB offers a number of different index types that it supports and the various properties it provides for each of these indexes. So, let’s explore these indexes and when best we should use them!

MongoDB Collection Indexes

Every MongoDB collection has a default index: _id. This is created during the creation of the collection. It makes sure that no two documents in the collection contain duplicate values.

To know how many indexes there are in a collection and other related information about them, we use the index_information() function.

If we execute it right now, it will return only the _id index, since it is the only index present in the collection so far. In fact, even if you drop the custom indexes from the collection, you will still be left with this default index.

Here is the index from the restaurant collection:

View the code on Gist.

{'_id_': {'v': 2, 'key': [('_id', 1)], 'ns': 'sample_restaurants.restaurants'}}

It returns the result as a dictionary. The key is the index name, _ id _ for the _id index. The values are dictionaries containing information about the index.

Similarly, if we retrieve the index for the neighborhoods collection, we will get the default index:

View the code on Gist.

{'_id_': {'v': 2, 'key': [('_id', 1)], 'ns': 'sample_restaurants.neighborhoods'}}

Explain Results

When we run a query in MongoDB, we can actually determine a lot of information about the query using the explain() function.

It returns a document that contains information about the query plans and the execution statistics. We are interested in the execution statistics.

Execution statistics contain a lot of information about how MongoDB was able to retrieve the query results. But first, let’s have a look at the document returned by the explain() function:

Here, focus on the executionStats key which contains the execution statistics. Below are the important keys to focus on:

explain.executionStats.nReturned returns the number of documents that match the query

explain.executionStats.executionTimeMillis returns the total time in milliseconds required for query plan selection and query execution

explain.executionStats.totalKeysExamined returns the number of index entries scanned

explain.executionStats.totalDocsExamined returns the number of documents examined during query execution. These are not documents that are returned

explain.executionStats stages are descriptive of the operation. COLLSCAN represents a collection scan while IXSCAN, which we will see later, represents an index key scan.

In this query, you will notice that “executionStats.executionStages.stage” is using COLLSCAN. This means that the default _id index was used to scan all the collections. Also, notice that explain.executionStats.nReturned and explain.executionStats.totalDocsExamined is the same as the number of documents in the collection. This is because we are using the default index. However, these values will change when we use indexes.

So, without further ado, let’s create our first index in MongoDB!

Create a Single Field Index

Suppose we want to find all the records of the restaurants that offer American cuisine. Let’s try finding that using the MongoDB find() function:

Now, we can use the explain() function to elicit the statistics about this query. Specifically, we will look at the executionStats key:

As you can see, it took 10 milliseconds and MongoDB had to scan all the 25,359 documents while it returned only 6,185 documents because it was using a collection scan. Can we make this querying faster?

Yes, we can do that using indexes.

Since we are querying on the cuisine key, we can create a new index on it. This we can do by using the create_index() function:

As you can see here, we have two indexes now – the default index _id and the custom index cuisine_1. Since we have created the index of a single field, it is called a Single Field Index.

Now, let’s use this index for the previous query and have a look at the statistics:

Notice the “executionStats.totalDocsExamined” values for both the queries. In the former query where we were using the default index, all the 25,359 records had to be scanned to look for the relevant records. But when we created the index, only 6,183 records were scanned. This considerably reduces the amount of time to query from the database as indicated by the “executionStats.executionTimeMillis” values.

Also, notice that the “executionStats.executionStages.stage” is IXSCAN here instead of COLLSCAN because MongoDB has scanned the index key we generated.

You can even name your index while creation. The create_index() has the name argument where you can provide your custom name.

Let’s create a new index because creating the same index again with a different name will give you an error. So, I will create a new index using the borough key and name it borough_index (so much for originality!):

Here, notice that the new index is referenced as “borough_index”, just like we wanted it to. Now, let’s see how to drop indexes before we move on to compound indexes in MongoDB.

Dropping Indexes

We can drop indexes with the same ease with which we created them. You will have to provide the name of the index you want to drop in the drop_index() function:

Here, we have dropped the custom index “cuisine_1”. You can also drop all the custom indexes for a collection with a single command using the drop_indexes() function:

You can only drop all the custom indexes that we had created.

The default _id index can never be dropped. If you try to do that, you will get an error:

Now, let’s check out how to create compound indexes in MongoDB!

Compound Indexes

Let’s say you query the collection on more than one field, then instead of creating a single index, you can create a compound index that contains all the fields in the query.

MongoDB imposes a limit of 32 fields for any compound index.

Now, if we wanted to query the collection to retrieve documents based on the cuisine in a specific borough, we can create a compound key that contains these two fields. First, let’s have a look at all the unique cuisines and boroughs in the collection.

We can run the query before creating the index to compare the performance of the database:

As expected, we are using COLLSCAN and had to scan all the 25,359 documents while we returned only 152 relevant documents that matched the query.

Now we can create a compound index using the borough and cuisine fields:

The index will contain references to documents sorted first by the values of the borough field and, within each value of the borough field, sorted by values of the cuisine field. That is, “Bronx”, “Afghan”; “Bronx”, “African”; etc.

You must have noticed that we are now providing the fields as tuples where the first value is the name of the field while the second value is oblivious to us. Well, the second value determines the index direction. This means we can order the items based on their index value. Since here we are using ASCENDING order for both the fields, the borough will be ordered in A-Z alphabetical order and within each borough value, the cuisine will also be ordered in A-Z alphabetical order. For example:

“Bronx”, “Afghan”;…;”Bronx”, “Vegetarian”

.

.

.

“Staten Island”, “Afghan”;…;”Staten Island”, “Vegetarian”

If however, we had ASCENDING for the borough and DESCENDING for the cuisine, the index would have been ordered as A-Z for borough , within it cuisine in Z-A alphabetical order as shown below:

“Bronx”, “Vegetarian”;…;”Bronx”, “Afghan”

.

.

.

“Staten Island”, “Vegetarian”;…;”Staten Island”, “Afghan”

The code for such an index is:

# # Create Compound index # db.restaurants.create_index([('borough',pymongo.ASCENDING), # ('cuisine',pymongo.DESCENDING)], # name='cuisine_borough') # # Get indexes # pprint(db.restaurants.index_information())

Now, let us try to run the same query as before and notice the difference:

Here, MongoDB used the “borough_cuisine” index that we created to retrieve the result in 1 millisecond and it only had to scan 152 documents, which is equal to the number of documents returned as well. Therefore, we were able to optimize the query by creating the compound index.

But, in addition to supporting queries that match on all the index fields, compound indexes can also support queries that match the prefix subsets of the compound index fields. That is, we can query on just borough in addition to querying on borough combined with cuisine as we have done so far.

Let’s query to find the restaurants in “Manhattan” using the index:

As you can see, MongoDB is using the “borough_cuisine” index to scan for the relevant documents.

Next, let’s see how to create multikey indexes in MongoDB.

Multikey Indexes

We can even create indexes on fields that contain array values. These are called Multikey indexes. MongoDB creates an index key for each element in the array. This is created automatically by MongoDB, so you do not have to explicitly specify it.

These Multikey indexes can be constructed over arrays that hold both scalar values (neither an embedded document nor an array value like strings, numbers) and nested documents.

But before creating a multikey index, let’s first drop the indexes we have created so far.

View the code on Gist.

Now, as mentioned before, either you can create a multikey index for basic indexes like:

{ _id: 1, a: [ 1, 2 ], b: [ 1, 2 ]}

Or create multikey indexes on array fields that contain nested objects like:

{_id:1,  a:[{‘score’:4, ‘grade’:’A’},{‘score’:2, ‘grade’:’B’}], b: “ABC”}

Now, since our collection here has nested objects in array format, we will be using that to create the multikey index. We will create a multikey index for the “grades.grade” field in the restaurant collection.

First, we will look at the execution statistics for the query we want to run. I want to find out the restaurants that have had a grade Z in their review.

As you can notice, MongoDB is scanning all the collection documents and returned the results in 28 milliseconds. Now, let’s try to create a multikey index on this field and notice the change in query optimization.

Execution statistics for the query.

As you can notice, MongoDB was able to retrieve the documents in just 4 milliseconds as opposed to 28 milliseconds initially. Also, it only had to examine 1337 documents.

Text Indexes

Now, regular expressions are useful for matching exact value within a text field. But if you are looking to match a specific word within a text field, then you ought to use the Text index.

Let’s have a look at one document from the restaurant collection.

If you wanted to retrieve all the restaurants that have the word “Kitchen” in their name, you can easily do that using the text index.

Now if we wanted to retrieve all the restaurants with the keyword “Kitchen” in their name, we can simply write the following query.

You can even search for multiple keywords in the name. The following query will return those documents that have either “Chinese” or “Kitchen” in their name.

You can even negate certain documents that contain a specific keyword. For example, the following query retrieves all the documents that have the keyword “Chinese” in them but not the keyword “Restaurant”.

There is a lot more you can do with text indexes in MongoDB and I implore you to check it out.

Geospatial Indexes

Location-based data is commonly used these days because of the proliferation of mobile devices. This means that finding the closest places to a location is a very common query that you will need to perform in today’s time. To handle such queries efficiently, MongoDB provides the geospatial indexes for querying the coordinates.

Let’s look at the documents present in our collections.

The restaurant’s collection contains the coordinates of the restaurant.

While neighborhoods collection contains the dimension of the neighborhood of the restaurant. We can create geospatial indexes for both fields. First, let’s create for the restaurant coordinates.

The 2d index is used to create index for a two-dimensional plane.

$near returns documents from nearest to farthest on a geospatial index.

View the code on Gist.

25357

The following query returns documents that are at least 10 meters from and at most 1000 meters from the specified coordinates, sorted from nearest to farthest.

View the code on Gist.

11

Here we returned only 11 documents based on the proximity to the given location. Now let’s create geospatial coordinates for the neighborhoods collection.

The coordinates in the neighborhood collection are earth-like objects. You must have noticed the shape of the neighborhood was also provided with the coordinates. Therefore we need to create a 2dsphere index that supports queries that calculate geometries on an earth-like sphere.

0

Since there are no documents that exist within our specified geospatial shape, therefore 0 documents were returned.

Now let’s talk about some properties of these index types we have looked at so far.

Index properties

Just like there are many different indexes in MongoDB, there are many properties for these indexes as well. Here we look at some of those properties.

Unique Indexes

So far, the indexes we created were not unique. Meaning, there could be more than a single document for a given index value. But the Unique index guarantees that, for a given field, every document in the collection will have a unique value. For example, if you want to make sure no two documents can have the same value in the “username” field, you can create a unique index.

A unique index that you are probably already familiar with is the index on “_id”, which is created whenever you create a collection.

Find the neighborhoods record we can create a unique index for the neighborhood name.

Now if we try to insert a duplicate value for the “name” field, we will get an error.

We can even create a unique compound index.

We can create for cuisine and borough from the restaurant’s collection. Although it is very likely that we will get a duplicate key error because it is very likely to have two restaurants in the same borough serving the same cuisine food.

Similarly, unique multikey indexes can also be created.

We get the same duplicate key error because there are already duplicate values present for our index field.

Alright, now let’s look at partial indexes.

Partial Indexes

Sometimes we might not want to index all the documents in the collection, but just a subset of documents. That is where Partial indexes come into the picture.

Partial indexes only index those documents in the collection that meet a specified filter expression. This lowers the storage requirement and reduces the overhead cost involved in the creation and maintenance of the index.

Suppose we only want to query documents based on the cuisine served in those restaurants that have a review with ‘grade.score’ greater than 5. In that case, instead of creating the index on the entire ‘grades.score’ field, we can create a partial index.

Let’s first have a look at the execution statistics for the query before creating the index.

To use the partial index property, we need to use the partialFilterExpression argument of the create_index() and provide the filter condition that will specify the document subset.

We will create an index for the cuisine field and since we will be querying restaurants with a score of greater than 5, we can provide that as our filter condition.

The following query uses the index since the query predicate includes the condition grades.score: { $gt: 7 } that matches a subset of documents matched by the index filter expression grades.score: { $gt: 5 }:

It took only 1 millisecond to retrieve the documents.

The following query does not use the index because grades.score: { $gt: 2 } while the index has the filter grades.score: { $gt: 5 }. That is, this query will match more documents than are indexed therefore it will not use the index we created.

View the code on Gist.

As you can see, the query uses the COLLSCAN instead of an index to retrieve the documents.

The following query also cannot use the partial index because the query predicate does not include the filter expression and using the index would return an incomplete result set.

As you can notice, COLLSCAN was used and all the documented were checked for the query.

End Notes

I hope this article gave you a good idea about how important indexing is in databases and how MongoDB just makes it really easy to create indexes for every query pattern possible.

An important point to remember while creating indexes is that you should create your index according to your query. Also, when creating indexes make sure that indexes themselves are not too large that they can’t fit in the RAM. This will make sure that the database does not have to read indexes from the disk which will definitely improve the performance. And finally, create your indexes wisely so that the indexes you create enable the database to scan minimum documents.

In case you want to learn more about querying data, I recommend the following course – Structured Query Language (SQL) for Data Science.

I encourage you to try out creating indexes on your MongoDB collections and witness the magic yourself.

Time Series Analysis And Forecasting

Introduction

Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent variable. To estimate the target variable in the name of predicting or forecasting, use the time variable as the point of reference. A Time-Series represents a series of time-based orders. It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds. It is an observation from the sequence of discrete time of successive intervals.

The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis (TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal processing, Engineering domain – Control Systems, and Communications Systems. Since TSA involves producing the set of information in a particular sequence, this makes it distinct from spatial and other analyses. We could predict the future using AR, MA, ARMA, and ARIMA models.

Learning Objectives

We will discuss in detail TSA Objectives, Assumptions, and Components (stationary and non-stationary).

We will look at the TSA algorithms.

Finally, we will look at specific use cases in Python.

This article was published as a part of the Data Science Blogathon.

Table of Contents What Is Time Series Analysis?

Definition: If you see, there are many more definitions for TSA. But let’s keep it simple.

A time series is nothing but a sequence of various data points that occurred in a successive order for a given period of time

Objectives of Time Series Analysis:

To understand how time series works and what factors affect a certain variable(s) at different points in time.

Time series analysis will provide the consequences and insights of the given dataset’s features that change over time.

Supporting to derive the predicting the future values of the time series variable.

Assumptions: There is only one assumption in TSA, which is “stationary,” which means that the origin of time does not affect the properties of the process under the statistical factor.

How to Analyze Time Series?

To perform the time series analysis, we have to follow the following steps:

Collecting the data and cleaning it

Preparing Visualization with respect to time vs key feature

Observing the stationarity of the series

Developing charts to understand its nature.

Model building – AR, MA, ARMA and ARIMA

Extracting insights from prediction

Significance of Time Series

TSA is the backbone for prediction and forecasting analysis, specific to time-based problem statements.

Analyzing the historical dataset and its patterns

Understanding and matching the current situation with patterns derived from the previous stage.

Understanding the factor or factors influencing certain variable(s) in different periods.

With the help of “Time Series,” we can prepare numerous time-based analyses and results.

Forecasting: Predicting any value for the future.

Segmentation: Grouping similar items together.

Classification: Classifying a set of items into given classes.

Descriptive analysis: Analysis of a given dataset to find out what is there in it.

Intervention analysis: Effect of changing a given variable on the outcome.

Components of Time Series Analysis

Let’s look at the various components of Time Series Analysis-

Trend: In which there is no fixed interval and any divergence within the given dataset is a continuous timeline. The trend would be Negative or Positive or Null Trend

Seasonality: In which regular or fixed interval shifts within the dataset in a continuous timeline. Would be bell curve or saw tooth

Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern

Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

What Are the limitations of Time Series Analysis?

Time series has the below-mentioned limitations; we have to take care of those during our data analysis.

Similar to other models, the missing values are not supported by TSA

The data points must be linear in their relationship.

Data transformations are mandatory, so they are a little expensive.

Models mostly work on Uni-variate data.

Data Types of Time Series

Let’s discuss the time series’ data types and their influence. While discussing TS data types, there are two major types – stationary and non-stationary.

Stationary: A dataset should follow the below thumb rules without having Trend, Seasonality, Cyclical, and Irregularity components of the time series.

The mean value of them should be completely constant in the data during the analysis.

The variance should be constant with respect to the time-frame

Covariance measures the relationship between two variables.

Non- Stationary: If either the mean-variance or covariance is changing with respect to time, the dataset is called non-stationary.

Methods to Check Stationarity

During the TSA model preparation workflow, we must assess whether the dataset is stationary or not. This is done using Statistical Tests. There are two tests available to test if the dataset is stationary:

Augmented Dickey-Fuller (ADF) Test

Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

Augmented Dickey-Fuller (ADF) Test or Unit Root Test

The ADF test is the most popular statistical test. It is done with the following assumptions:

Null Hypothesis (H0): Series is non-stationary

Alternate Hypothesis (HA): Series is stationary

p-value <= 0.05 Accept (H1)

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test

These tests are used for testing a NULL Hypothesis (HO) that will perceive the time series as stationary around a deterministic trend against the alternative of a unit root. Since TSA is looking for Stationary Data for its further analysis, we have to ensure that the dataset is stationary.

Converting Non-Stationary Into Stationary

Let’s discuss quickly how to convert non-stationary to stationary for effective time series modeling. There are three methods available for this conversion – detrending, differencing, and transformation.

Detrending

It involves removing the trend effects from the given dataset and showing only the differences in values from the trend. It always allows cyclical patterns to be identified.

Differencing

This is a simple transformation of the series into a new time series, which we use to remove the series dependence on time and stabilize the mean of the time series, so trend and seasonality are reduced during this transformation.

Yt= Yt – Yt-1

Yt=Value with time

Transformation

This includes three different methods they are Power Transform, Square Root, and Log Transfer. The most commonly used one is Log Transfer.

Moving Average Methodology

The Moving Average (MA) (or) Rolling Mean: The value of MA is calculated by taking average data of the time-series within k periods.

Let’s see the types of moving averages:

Simple Moving Average (SMA),

Cumulative Moving Average (CMA)

Exponential Moving Average (EMA)

Simple Moving Average (SMA)

The SMA is the unweighted mean of the previous M or N points. The selection of sliding window data points, depending on the amount of smoothing, is preferred since increasing the value of M or N improves the smoothing at the expense of accuracy.

To understand better, I will use the air temperature dataset.

import pandas as pd from matplotlib import pyplot as plt from statsmodels.graphics.tsaplots import plot_acf df_temperature = pd.read_csv('temperature_TSA.csv', encoding='utf-8') df_temperature.head() df_temperature.info() # set index for year column df_temperature.set_index('Any', inplace=True) df_temperature.index.name = 'year' # Yearly average air temperature - calculation df_temperature['average_temperature'] = df_temperature.mean(axis=1) # drop unwanted columns and resetting the datafreame df_temperature = df_temperature[['average_temperature']] df_temperature.head() # SMA over a period of 10 and 20 years  df_temperature['SMA_10'] = df_temperature.average_temperature.rolling(10, min_periods=1).mean() df_temperature['SMA_20'] = df_temperature.average_temperature.rolling(20, min_periods=1).mean() # Grean = Avg Air Temp, RED = 10 yrs, ORANG colors for the line plot colors = ['green', 'red', 'orange'] # Line plot df_temperature.plot(color=colors, linewidth=3, figsize=(12,6)) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels =['Average air temperature', '10-years SMA', '20-years SMA'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Cumulative Moving Average (CMA)

The CMA is the unweighted mean of past values till the current time.

# CMA Air temperature df_temperature['CMA'] = df_temperature.average_temperature.expanding().mean() # green -Avg Air Temp and Orange -CMA colors = ['green', 'orange'] # line plot df_temperature[['average_temperature', 'CMA']].plot(color=colors, linewidth=3, figsize=(12,6)) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels =['Average Air Temperature', 'CMA'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Exponential Moving Average (EMA)

EMA is mainly used to identify trends and filter out noise. The weight of elements is decreased gradually over time. This means It gives weight to recent data points, not historical ones. Compared with SMA, the EMA is faster to change and more sensitive.

It has a value between 0,1.

Represents the weighting applied to the very recent period.

Let’s apply the exponential moving averages with a smoothing factor of 0.1 and 0.3 in the given dataset.

# EMA Air Temperature # Let's smoothing factor - 0.1 df_temperature['EMA_0.1'] = df_temperature.average_temperature.ewm(alpha=0.1, adjust=False).mean() # Let's smoothing factor - 0.3 df_temperature['EMA_0.3'] = df_temperature.average_temperature.ewm(alpha=0.3, adjust=False).mean() # green - Avg Air Temp, red- smoothing factor - 0.1, yellow - smoothing factor - 0.3 colors = ['green', 'red', 'yellow'] df_temperature[['average_temperature', 'EMA_0.1', 'EMA_0.3']].plot(color=colors, linewidth=3, figsize=(12,6), alpha=0.8) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.legend(labels=['Average air temperature', 'EMA - alpha=0.1', 'EMA - alpha=0.3'], fontsize=14) plt.title('The yearly average air temperature in city', fontsize=20) plt.xlabel('Year', fontsize=16) plt.ylabel('Temperature [°C]', fontsize=16) Time Series Analysis in Data Science and Machine Learning

When dealing with TSA in Data Science and Machine Learning, there are multiple model options are available. In which the Autoregressive–Moving-Average (ARMA) models with [p, d, and q].

q== moving average lags

Before we get to know about Arima, first, you should understand the below terms better.

Auto-Correlation Function (ACF)

Partial Auto-Correlation Function (PACF)

Auto-Correlation Function (ACF)

ACF is used to indicate how similar a value is within a given time series and the previous value. (OR) It measures the degree of the similarity between a given time series and the lagged version of that time series at the various intervals we observed.

Python Statsmodels library calculates autocorrelation. This is used to identify a set of trends in the given dataset and the influence of former observed values on the currently observed values.

Partial Auto-Correlation (PACF)

PACF is similar to Auto-Correlation Function and is a little challenging to understand. It always shows the correlation of the sequence with itself with some number of time units per sequence order in which only the direct effect has been shown, and all other intermediary effects are removed from the given time series.

Auto-Correlation and Partial Auto-Correlation plot_acf(df_temperature) plt.show() plot_acf(df_temperature, lags=30) plt.show()

Observation: The previous temperature influences the current temperature, but the significance of that influence decreases and slightly increases from the above visualization along with the temperature with regular time intervals.

Types of Auto-Correlation

Interpret ACF and PACF plots

ACFPACFPerfect ML -ModelPlot declines graduallyPlot drops instantlyAuto Regressive chúng tôi drops instantlyPlot declines graduallyMoving Average modelPlot decline graduallyPlot Decline graduallyARMAPlot drop instantlyPlot drop instantlyYou wouldn’t perform any model

Remember that both ACF and PACF require stationary time series for analysis.

Now, we will learn about the Auto-Regressive model.

What Is an Auto-Regressive Model?

An auto-regressive model is a simple model that predicts future performance based on past performance. It is mainly used for forecasting when there is some correlation between values in a given time series and the values that precede and succeed (back and forth).

An AR model is a Linear Regression model that uses lagged variables as input. The Linear Regression model can be easily built using the scikit-learn library by indicating the input. Statsmodels library is used to provide autoregression model-specific functions where you have to specify an appropriate lag value and train the model. It is provided in the AutoTeg class to get the results using simple steps.

Creating the model AutoReg()

Call fit() to train it on our dataset.

Returns an AutoRegResults object.

Once fit, make a prediction by calling the predict () function

The equation for the AR model (Let’s compare Y=mX+c)

Yt =C+b1 Yt-1+ b2 Yt-2+……+ bp Yt-p+ Ert

Key Parameters

p=past values

Yt=Function of different past values

Ert=errors in time

C=intercept

Lets’s check whether the given data set or time series is random or not

from matplotlib import pyplot from pandas.plotting import lag_plot lag_plot(df_temperature) pyplot.show()

Observation: Yes, it looks random and scattered.

Implementation of Auto-Regressive Model #import libraries from matplotlib import pyplot from statsmodels.tsa.ar_model import AutoReg from sklearn.metrics import mean_squared_error from math import sqrt # load csv as dataset #series = read_csv('daily-min-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True) # split dataset for test and training X = df_temperature.values train, test = X[1:len(X)-7], X[len(X)-7:] # train autoregression model = AutoReg(train, lags=20) model_fit = model.fit() print('Coefficients: %s' % model_fit.params) # Predictions predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False) for i in range(len(predictions)): print('predicted=%f, expected=%f' % (predictions[i], test[i])) rmse = sqrt(mean_squared_error(test, predictions)) print('Test RMSE: %.3f' % rmse) # plot results pyplot.plot(test) pyplot.plot(predictions, color='red') pyplot.show()

Output:

predicted=15.893972, expected=16.275000 predicted=15.917959, expected=16.600000 predicted=15.812741, expected=16.475000 predicted=15.787555, expected=16.375000 predicted=16.023780, expected=16.283333 predicted=15.940271, expected=16.525000 predicted=15.831538, expected=16.758333 Test RMSE: 0.617

Observation: Expected (blue) Against Predicted (red). The forecast looks good on the 4th and the deviation on the 6th day.

Implementation of Moving Average (Weights – Simple Moving Average) import numpy as np alpha= 0.3 n = 10 w_sma = np.repeat(1/n, n) colors = ['green', 'yellow'] # weights - exponential moving average alpha=0.3 adjust=False w_ema = [(1-ALPHA)**i if i==N-1 else alpha*(1-alpha)**i for i in range(n)] pd.DataFrame({'w_sma': w_sma, 'w_ema': w_ema}).plot(color=colors, kind='bar', figsize=(8,5)) plt.xticks([]) plt.yticks(fontsize=10) plt.legend(labels=['Simple moving average', 'Exponential moving average (α=0.3)'], fontsize=10) # title and labels plt.title('Moving Average Weights', fontsize=10) plt.ylabel('Weights', fontsize=10) Understanding ARMA and ARIMA

ARMA is a combination of the Auto-Regressive and Moving Average models for forecasting. This model provides a weakly stationary stochastic process in terms of two polynomials, one for the Auto-Regressive and the second for the Moving Average.

ARMA is best for predicting stationary series. ARIMA was thus developed to support both stationary as well as non-stationary series.

AR+I+MA= ARIMA

Understand the signature of ARIMA

Implementation Steps for ARIMA

Step 1: Plot a time series format

Step 2: Difference to make stationary on mean by removing the trend

Step 3: Make stationary by applying log transform.

Step 4: Difference log transform to make as stationary on both statistic mean and variance

Step 5: Plot ACF & PACF, and identify the potential AR and MA model

Step 6: Discovery of best fit ARIMA model

Step 7: Forecast/Predict the value using the best fit ARIMA model

Step 8: Plot ACF & PACF for residuals of the ARIMA model, and ensure no more information is left.

Implementation of ARIMA in Python

We have already discussed steps 1-5 which will remain the same; let’s focus on the rest here.

from statsmodels.tsa.arima_model import ARIMA model = ARIMA(df_temperature, order=(0, 1, 1)) results_ARIMA = model.fit() results_ARIMA.summary() results_ARIMA.forecast(3)[0] Output: array([16.47648941, 16.48621826, 16.49594711]) results_ARIMA.plot_predict(start=200) plt.show() Process Flow (Re-Gap)

In recent years, the use of Deep Learning for Time Series Analysis and Forecasting has increased to resolve problem statements that couldn’t be handled using Machine Learning techniques. Let’s discuss this briefly.

Recurrent Neural Networks (RNN) is the most traditional and accepted architecture fitment for Time-Series forecasting-based problems.

RNN is organized into successive layers and divided into

Input

Hidden

Output

Each layer has equal weight, and every neuron has to be assigned to fixed time steps. Do remember that every one of them is fully connected with a hidden layer (Input and Output) with the same time steps, and the hidden layers are forwarded and time-dependent in direction.

Components of RNN

Input: The function vector of x(t)​ is the input at time step t.

Hidden:

The function vector h(t)​ is the hidden state at time t,

This is a kind of memory of the established network;

This has been calculated based on the current input x(t) and the previous-time step’s hidden-state h(t-1):

Output: The function vector y(t) ​is the output at time step t.

Weights : Weights: In the RNNs, the input vector connected to the hidden layer neurons at time t is by a weight matrix of U (Please refer to the above picture),

Internally weight matrix W is formed by the hidden layer neurons of time t-1 and t+1. Following this, the hidden layer with to the output vector y(t) of time t by a V (weight matrix); all the weight matrices U, W, and V are constant for each time step.

Conclusion

A time series is constructed by data that is measured over time at evenly spaced intervals. I hope this comprehensive guide has helped you all understand the time series, its flow, and how it works. Although the TSA is widely used to handle data science problems, it has certain limitations, such as not supporting missing values. Note that the data points must be linear in their relationship for Time Series Analysis to be done.

Key Takeaways

Time series is a sequence of various data points that occurred in a successive order for a given period of time.

Trend, Seasonality, Cyclical, and Irregularity are components of TSA.

Frequently Asked Questions Related

Components Of Time Series Analysis

Definition of Components of time series analysis

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Components of time series analysis

Now that we already know that arrangement of data points in agreement to the chronological order of occurrence is known as a time series. And also, the time series analysis is the relationship between 2 variables out of which one is the time and the other is the quantitative variable. There are varied uses of time series, which we will just glance at before we know the components of the time series analysis so that while we study the time series, it becomes evident on to how the components is able to solve the time series analysis.

Time series analysis is performed to predict the future behavior of any quantitative variable on the basis of the past behavior. For example, umbrellas getting sold on mostly rainy seasons than other seasons, although umbrellas still get sold in other time periods. So maybe in order to predict the future behavior, more umbrellas will be sold during the rainy seasons!

While evaluating the performance of the business with respect to the expected or planed one, time series analysis helps a great deal in order to take informed decisions to make it better.

Time series also enables business analysts to compare changes in different values at different times or places.

1. Long term movements or Trend

This component looks into the movement of attributes at a long-term window of time frame and mostly tries to understand the increment or decrement of the quantitative value that is attached to the behavior. This is more like an average tendency of the parameter that is in measurement. The tendencies that are observed can be increasing, decreasing or stable at different sections of the time period. And on this basis, we can make the trend a linear one and a non-linear one. In the linear trend we just talk about continuously increasing or continuously decreasing whereas in the non-linear we can segment the time period into different frames and populate the trend! There are many ways by which non-linear trends can be included in the analysis. We can either take higher order of the variable in hand, which is realistically non-interpretable or a better approach than that is the piecewise specification of the function, where each of the piecewise function has a linear and collectively makes a non-linear trend at an overall level.

2. Short term movements

In contrast to the long-term movements, this component looks into the shorter period of time to get the behavior of the quantitative variable during this time frame. This movement in the time series sometimes repeats itself over certain period of time or even in regular spasmodic manner. These movements over a shorter time frame give rise to 2 sub-components namely,

Seasonality: These are the variations that are seen in the variable in study for the forces than spans over for lesser than a year. These movements are mainly present in the data where the record in with a shorter duration of difference, like daily, weekly, monthly. The example we talked about the sale of umbrella is more during the rainy season is a case of seasonality. Sale of ACs during the summertime is again a seasonality effect. There are some man-made conventions that affect seasonality like festivals, occasions etc.

Cyclic Variations: Even though this component is a short-term movement analysis of time series, but it is rather longer than the seasonality, where the span of similar variations to be seen is more than a year. The completion of all the steps in that movement is crucial to say that the variation is a cyclic one. Sometimes we even refer them to as a business cycle. For example, the product lifecycle is a case of cycle variation, where a product goes through the steps of the life cycle, that is Introduction, growth, maturity, decline and just before the product reaches below a threshold of decline, we look for re-launch of the product with some newer features.

3. Irregular variation or Random variations

Finally, we now know that the Trend, seasonality, cyclic and residuals totally constitutes of the time series analysis and these components may take form of additive model or multiplicative model, depending on the use cases!

Conclusion

With this, we come to an end of the components of time series analysis article, where in we looked at each of these components in great details and got familiar with them before moving to the usage of these components in a time series analysis!

Recommended Articles

This is a guide to Components of time series analysis. Here we discuss the different components that constitute the time series analysis. You may also have a look at the following articles to learn more –

Update the detailed information about Perform Regression Analysis With Pytorch Seamlessly! on the Bellydancehcm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!