Trending November 2023 # Topic Modeling Using Latent Dirichlet Allocation (Lda) # Suggested December 2023 # Top 13 Popular

You are reading the article Topic Modeling Using Latent Dirichlet Allocation (Lda) updated in November 2023 on the website Bellydancehcm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Topic Modeling Using Latent Dirichlet Allocation (Lda)

Introduction

The internet is a wealth of knowledge and information, which may confuse readers and make them use more time and energy looking for accurate information about particular areas of interest. To recognize and analyze content in online social networks (OSNs), there is a need for more effective techniques and tools, especially for those who employ user-generated content (UGC) as a source of data.

In NLP(Natural Language Processing), Topic Modeling identifies and extracts abstract topics from large collections of text documents. It uses algorithms such as Latent Dirichlet Allocation (LDA) to identify latent topics in the text and represent documents as a mixture of these topics. Some uses of topic modeling include:

Text classification and document organization

Recommendation systems to suggest similar content

News categorization and information retrieval systems

Customer service and support to categorize customer inquiries.

Latent Dirichlet Allocation, a statistical and visual concept, is used to find connections between many documents in a corpus. The Variational Exception Maximization (VEM) technique is used to get the highest probability estimate from the full corpus of text.

Learning Objectives

This project aims to perform topic modeling on a dataset of news headlines to show the topics that stand out and uncover patterns and trends in the news.

The second objective of this project will be to have a  visual representation of the dominant topics, which news aggregators, journalists, and individuals can use to gain a broad understanding of the current news landscape quickly.

Understanding the topic modeling pipeline and being able to implement it.

This article was published as a part of the Data Science Blogathon.

Table of Contents Important Libraries in Topic Modeling Project

In a topic modeling project, knowledge of the following libraries plays important roles:

Gensim: It is a library for unsupervised topic modeling and document indexing. It provides efficient algorithms for modeling latent topics in large-scale text collections, such as those generated by search engines or online platforms.

NLTK: The Natural Language Toolkit (NLTK) is a library for working with human language data. It provides tools for tokenizing, stemming, and lemmatizing text and for performing part-of-speech tagging, named entity recognition, and sentiment analysis.

Matplotlib: It is a plotting library for Python. It is used for visualizing the results of topic models, such as the distribution of topics over documents or the relationships between words and topics.

Scikit-learn: It is a library for machine learning in Python. It provides a wide range of algorithms for modeling topics, including Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and others.

Pandas: It is a library for data analysis in Python. It provides data structures and functions for working with structured data, such as the results of topic models, in a convenient and efficient manner.

Dataset Description of the Topic Modeling Project

The dataset used is from Kaggle’s A million News Headlines. The data contains 1.2 million rows and 2 columns namely “publish date” and “headline text”. “Headline text” column contains news headlines and “publish date” column contains the date the headline was published.

Step 1: Importing Necessary Dependencies

The code below imports the libraries(listed in the introduction section above) needed for our project.

import pandas as pd import matplotlib.pyplot as plt import nltk from nltk.tokenize import word_tokenize from chúng tôi import WordNetLemmatizer from nltk.corpus import stopwords import gensim from gensim.corpora import Dictionary from gensim.models import LdaModel from gensim.matutils import corpus2csc from sklearn.feature_extraction.text import CountVectorizer from wordcloud import WordCloud import matplotlib.pyplot as plt Step 2: Importing and Reading Dataset #loading the file from its local path into a dataframe df=pd.read_csv(r"pathabcnews-date-text.csvabcnews-date-text.csv") df

Python Code:



Output:

Step 3: Data Preprocessing

The code below selects the first 1,000,000 rows in the dataset and drops the rest of the columns except the “headline text” column and then names the new data-frame ‘data.’

data = df.sample(n=100000, axis=0) #to select only a million rows to use in our dataset data= data['headline_text'] #to extract the headline_text column and give it the variable name data

Next, we perform lemmatization and removal of stop-words from the data.

Lemmatization reduces words to the base root, reducing the dimensionality and complexity of the textual data. We assign WordNetLemmatizer() to the variable. This is important to improve the algorithm’s performance and helps the algorithm focus on the meaning of the words rather than the surface form.

Stop-words are common words like “the” and “a” that often appear in text data but do not carry lots of meaning. Removing them helps reduce the data’s complexity, speeds up the algorithm, and makes it easier to find meaningful patterns.

# lemmatization and removing stopwords #downloading dependencies nltk.download('punkt') nltk.download('wordnet') nltk.download('stopwords') lemmatizer = WordNetLemmatizer() stop_words = set(stopwords.words("english")) #function to lemmatize and remove stopwords from the text data def preprocess(text): text = text.lower() words = word_tokenize(text) words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words] return words #applying the function to the dataset data = data.apply(preprocess) data

Output:

Step 4: Training the Model

The number of topics is set to 5 (which can be set to as many topics as one wants to extract from the data), the number of passes is 20, and the alpha and eta are set to “auto.” This lets the model estimate the appropriate values. You can experiment with different parameters to see the impact on results.

The code below processes the data to remove words that appear in fewer than 5 documents and those that appear in more than 50% of the data. This ensures that the model does not include words that appear less in the data or more in the data. For example, news headlines in a country will have a lot of mentions of that country which will alter the effectiveness of our model. Then we create a corpus from the filtered data. We then select the number of topics and train the Lda-model, get the topics from the model using ‘show topics’, and then print the topics.

# Create a dictionary from the preprocessed data dictionary = Dictionary(data) # Filter out words that appear in fewer than 5 documents or more than 50% of the documents dictionary.filter_extremes(no_below=5, no_above=0.5) bow_corpus = [dictionary.doc2bow(text) for text in data] # Train the LDA model num_topics = 5 ldamodel = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=20, alpha='auto', eta='auto') # Get the topics topics = ldamodel.show_topics(num_topics=num_topics, num_words=10, log=False, formatted=False) # Print the topics for topic_id, topic in topics: print("Topic: {}".format(topic_id)) print("Words: {}".format([word for word, _ in topic]))

Output:

Step 5: Plotting a Word Cloud for the Topics

Word cloud is a data visualization tool used to visualize the most frequently occurring words in a large amount of text data and can be useful in understanding the topics present in data. It’s important in text data analysis, and it provides valuable insights into the structure and content of the data.

Word cloud is a simple but effective way of visualizing the content of large amounts of text data. It displays the most frequent words in a graphical format, allowing the user to easily identify the key topics and themes present in the data. The size of each word in the word cloud represents its frequency of occurrence so that the largest words in the cloud correspond to the most commonly occurring words in the data.

This visualization tool can be a valuable asset in text data analysis, providing an easy-to-understand representation of the data’s content. For example, a word cloud can be used to quickly identify the dominant topics in a large corpus of news articles, customer reviews, or social media posts. This information can then guide further analysis, such as sentiment analysis or topic modeling, or inform decision-making, such as product development or marketing strategy.

The code below plots word clouds using topic words from the topic id using matplotlib.

# Plotting a wordcloud of the topics for topic_id, topic in enumerate(lda_model.print_topics(num_topics=num_topics, num_words=20)): topic_words = " ".join([word.split("*")[1].strip() for word in topic[1].split(" + ")]) wordcloud = WordCloud(width=800, height=800, random_state=21, max_font_size=110).generate(topic_words) plt.figure() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.title("Topic: {}".format(topic_id)) plt.show()

Output:

Topic 0 and 1

Topic 2, 3 and 4

Conclusion

Topic modeling is a powerful tool for analyzing and understanding large collections of text data. Topic modeling works by discovering latent topics and the relationships between words and documents, can help uncover hidden patterns and trends in text data and provide valuable insights into the underlying structure of text data.

The combination of powerful libraries such as Gensim, NLTK, Matplotlib, scikit-learn, and Pandas make it easier to perform topic modeling and gain insights from text data. As the amount of text data generated by individuals, organizations, and society continues to grow, the importance of topic modeling and its role in data analysis and understanding is only set to increase.

The code can be found in my github repository.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

You're reading Topic Modeling Using Latent Dirichlet Allocation (Lda)

Foundations Of Real Estate Financial Modeling

Foundations of Real Estate Financial Modeling

Overview of key concepts

Written by

Tim Vipond

Published April 2, 2023

Updated July 7, 2023

The Foundations of Real Estate Financial Modeling

This guide will outline the foundations of real estate financial modeling and the key concepts you need to get started building your own models for development projects.

In order to get started, we will begin by defining some of the key terms you’ll need to know before building your model.

Real Estate Terms and Definitions

LTV – “loan to value” – the amount of debt financing a lender will provide as a percent of the market value (e.g., 80%)

LTC – “loan to cost”  – the amount of debt financing a lender will provide as a percent of the cost of a development (e.g., 70%)

NOI – “net operating income” – gross rental revenue less operating expenses (property taxes, insurance, repairs & maintenance, capital expenditures, etc.)

Cap Rate – net operating income divided by the value of the property, expressed as a percentage (e.g., 4.5%)

Amortization period – the number of months/years the principal repayments of a loan are spread out over. The total length of time it will take you to pay off your mortgage (e.g., 30 years).

Structures and Joint Ventures (JVs)

Most developments are structured as a joint venture between General Partners (GPs) and Limited Partners (LPs).

Key points about GPs:

Responsible for all management decisions

Fiduciary duty to act for the benefit of the limited partners

Fully liable for its actions

May have guarantees as security on borrowing

Key points about LPs:

Limited refers to “limited liability”

Have priority on liquidation, ahead of the GPs

Provide capital to fund the development project

Have no control over the management of the fund/project

Assumptions Section of the Financial Model

As covered in CFI’s real estate financial modeling course, the key assumptions that will be input into the model include:

Schedule

Property Stats

Development Costs

Purchase and Sale

These are discussed in great detail in our actual course.

Development Cash Flow Model

To set the foundations of real estate financial modeling, it is important to cover the key sections that will be built based on project assumptions.

The key sections in the development model include:

Absorption (timing and pace of sales)

Revenue

Commissions

Warranty

Land acquisition (capital cost)

Pre-construction costs

Construction costs

Financing and interest expense

Levered Free Cash Flow

Output and Pro Forma

Once the model is built, it’s important to create a one-page summary document or Pro Forma that can be shared with bankers, investors, partners, and anyone else who needs to analyze the deal.

This output pro forma should include the following information:

Property stats

Schedule – summary key dates

Financing assumptions

Sales assumptions ($ total / per unit / per SF)

Budget ($ total / per unit / per SF)

Returns (IRRs)

Return on cost

Return on sales

Sensitivity analysis

Example of a Real Estate Financial Model

This is an example of the one-page output from our real estate financial modeling course. As you can see, it clearly displays all the information listed above and makes it easy for someone to evaluate the deal.

Here is an example of the actual inner workings of the model, where you can see absorption by month for the development project, which builds up to revenue and, ultimately, cash flow.

These foundations of real estate financial modeling are covered along with much more detail in our online course.

Cap Rate and Net Operating Income (NOI) Example

Net operating income, which is equal to gross rental revenue less operating expenses (property taxes, insurance, repairs & maintenance, capital expenditures), is the key profitability or cash flow measure used to evaluate real estate development transactions.

Cap rate, which is equal to net operating income divided by the value of the property, is expressed as a percentage and used to value real estate. The lower the cap rate, the more highly valued a piece of real estate is, and the higher the cap rate, the less valued the real estate is. Price and cap rate move in inverse directions to each other, just like a bond.  Learn more in our financial math course.

Financial Modeling Course

The best way to learn is by doing, and CFI’s real estate financial modeling course gives you the step-by-step instruction you need to build financial models on your own.  It comes with both a blank template and a completed version, so you can easily build a model on your own or just go straight to the completed version. The high-quality video instruction will guide you every step of the way as you work through a case study for a townhouse real estate development project.

Additional Resources

Thank you for reading CFI’s guide to Foundations of Real Estate Financial Modeling. To learn more about valuation, corporate finance, financial modeling, and more, we highly recommend these additional free CFI resources:

A Complete Tutorial On Time Series Modeling In R

23 minutes

⭐⭐⭐⭐⭐

Rating: 5 out of 5.

Introduction

‘Time’ is the most important factor which ensures success in a business. It’s difficult to keep up with the pace of time.  But, technology has developed some powerful methods using which we can ‘see things’ ahead of time. Don’t worry, I am not talking about Time Machine. Let’s be realistic here!

I’m talking about the methods of prediction & forecasting. One such method, which deals with time based data is Time Series Modeling. As the name suggests, it involves working on time (years, days, hours, minutes) based data, to derive hidden insights to make informed decision making.

Time series models are very useful models when you have serially correlated data. Most of business houses work on time series data to analyze sales number for the next year, website traffic, competition position and much more. However, it is also one of the areas, which many analysts do not understand.

So, if you aren’t sure about complete process of time series modeling, this guide would introduce you to various levels of time series modeling and its related techniques.

What Is Time Series Modeling?

Let’s begin from basics.  This includes stationary series, random walks , Rho Coefficient, Dickey Fuller Test of Stationarity. If these terms are already scaring you, don’t worry – they will become clear in a bit and I bet you will start enjoying the subject as I explain it.

Stationary Series

There are three basic criterion for a series to be classified as stationary series:

1. The mean of the series should not be a function of time rather should be a constant. The image below has the left hand graph satisfying the condition whereas the graph in red has a time dependent mean.

2. The variance of the series should not a be a function of time. This property is known as homoscedasticity. Following graph depicts what is and what is not a stationary series. (Notice the varying spread of distribution in the right hand graph)

3. The covariance of the i th term and the (i + m) th term should not be a function of time. In the following graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.

Why do I care about ‘stationarity’ of a time series?

The reason I took up this section first was that until unless your time series is stationary, you cannot build a time series model. In cases where the stationary criterion are violated, the first requisite becomes to stationarize the time series and then try stochastic models to predict this time series. There are multiple ways of bringing this stationarity. Some of them are Detrending, Differencing etc.

Random Walk

This is the most basic concept of the time series. You might know the concept well. But, I found many people in the industry who interprets random walk as a stationary process. In this section with the help of some mathematics, I will make this concept crystal clear for ever. Let’s take an example.

Example: Imagine a girl moving randomly on a giant chess board. In this case, next position of the girl is only dependent on the last position.

Now imagine, you are sitting in another room and are not able to see the girl. You want to predict the position of the girl with time. How accurate will you be? Of course you will become more and more inaccurate as the position of the girl changes. At t=0 you exactly know where the girl is. Next time, she can only move to 8 squares and hence your probability dips to 1/8 instead of 1 and it keeps on going down. Now let’s try to formulate this series :

X(t) = X(t-1) + Er(t)

where Er(t) is the error at time point t. This is the randomness the girl brings at every point in time.

Now, if we recursively fit in all the Xs, we will finally end up to the following equation :

X(t) = X(0) + Sum(Er(1),Er(2),Er(3).....Er(t))

Now, lets try validating our assumptions of stationary series on this random walk formulation:

1. Is the Mean constant?

E[X(t)] = E[X(0)] + Sum(E[Er(1)],E[Er(2)],E[Er(3)].....E[Er(t)])

We know that Expectation of any Error will be zero as it is random.

Hence we get E[X(t)] = E[X(0)] = Constant.

2. Is the Variance constant?

Var[X(t)] = Var[X(0)] + Sum(Var[Er(1)],Var[Er(2)],Var[Er(3)].....Var[Er(t)]) Var[X(t)] = t * Var(Error) = Time dependent.

Hence, we infer that the random walk is not a stationary process as it has a time variant variance. Also, if we check the covariance, we see that too is dependent on time.

Let’s spice up things a bit,

We already know that a random walk is a non-stationary process. Let us introduce a new coefficient in the equation to see if we can make the formulation stationary.

Introduced coefficient: Rho

X(t) = Rho * X(t-1) + Er(t)

Now, we will vary the value of Rho to see if we can make the series stationary. Here we will interpret the scatter visually and not do any test to check stationarity.

Let’s start with a perfectly stationary series with Rho = 0 . Here is the plot for the time series :

Increase the value of Rho to 0.5 gives us following graph:

You might notice that our cycles have become broader but essentially there does not seem to be a serious violation of stationary assumptions. Let’s now take a more extreme case of Rho = 0.9

We still see that the X returns back from extreme values to zero after some intervals. This series also is not violating non-stationarity significantly. Now, let’s take a look at the random walk with rho = 1.

This obviously is an violation to stationary conditions. What makes rho = 1 a special case which comes out badly in stationary test? We will find the mathematical reason to this.

Let’s take expectation on each side of the equation  “X(t) = Rho * X(t-1) + Er(t)”

E[X(t)] = Rho *E[ X(t-1)]

This equation is very insightful. The next X (or at time point t) is being pulled down to Rho * Last value of X.

For instance, if X(t – 1 ) = 1, E[X(t)] = 0.5 ( for Rho = 0.5) . Now, if X moves to any direction from zero, it is pulled back to zero in next step. The only component which can drive it even further is the error term. Error term is equally probable to go in either direction. What happens when the Rho becomes 1? No force can pull the X down in the next step.

Dickey Fuller Test of Stationarity

What you just learnt in the last section is formally known as Dickey Fuller test. Here is a small tweak which is made for our equation to convert it to a Dickey Fuller test:

X(t) = Rho * X(t-1) + Er(t)

We have to test if Rho – 1 is significantly different than zero or not. If the null hypothesis gets rejected, we’ll get a stationary time series.

Stationary testing and converting a series into a stationary series are the most critical processes in a time series modelling. You need to memorize each and every detail of this concept to move on to the next step of time series modelling.

Let’s now consider an example to show you what a time series looks like.

Exploration of Time Series Data in R

Here we’ll learn to handle time series data on R. Our scope will be restricted to data exploring in a time series type of data set and not go to building time series models.

I have used an inbuilt data set of R called AirPassengers. The dataset consists of monthly totals of international airline passengers, 1949 to 1960.

Loading the Data Set

Following is the code which will help you load the data set and spill out a few top level metrics.

[1] “ts”

#This tells you that the data series is in a time series format [1] 1949 1 #This is the start of the time series

[1] 1960 12

#This is the end of the time series

[1] 12

#The cycle of this time series is 12months in a year Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Detailed Metrics #The number of passengers are distributed across the spectrum #This will plot the time series # This will fit in a line

Here are a few more operations you can do:

#This will print the cycle across years. #This will aggregate the cycles and display a year on year trend #Box plot across months will give us a sense on seasonal effect Important Inferences

The year on year trend clearly shows that the #passengers have been increasing without fail.

The variance and the mean value in July and August is much higher than rest of the months.

Even though the mean value of each month is quite different their variance is small. Hence, we have strong seasonal effect with a cycle of 12 months or less.

Exploring data becomes most important in a time series model – without this exploration, you will not know whether a series is stationary or not. As in this case we already know many details about the kind of model we are looking out for.

Let’s now take up a few time series models and their characteristics. We will also take this problem forward and make a few predictions.

Introduction to ARMA Time Series Modeling

ARMA models are commonly used in time series modeling. In ARMA model, AR stands for auto-regression and MA stands for moving average. If these words sound intimidating to you, worry not – I’ll simplify these concepts in next few minutes for you!

We will now develop a knack for these terms and understand the characteristics associated with these models. But before we start, you should remember, AR or MA are not applicable on non-stationary series.

In case you get a non stationary series, you first need to stationarize the series (by taking difference / transformation) and then choose from the available time series models.

First, I’ll explain each of these two models (AR & MA) individually. Next, we will look at the characteristics of these models.

Auto-Regressive Time Series Model

Let’s understanding AR models using the case below:

The current GDP of a country say x(t) is dependent on the last year’s GDP i.e. x(t – 1). The hypothesis being that the total cost of production of products & services in a country in a fiscal year (known as GDP) is dependent on the set up of manufacturing plants / services in the previous year and the newly set up industries / plants / services in the current year. But the primary component of the GDP is the former one.

Hence, we can formally write the equation of GDP as:

x(t) = alpha *  x(t – 1) + error (t)

This equation is known as AR(1) formulation. The numeral one (1) denotes that the next instance is solely dependent on the previous instance.  The alpha is a coefficient which we seek so as to minimize the error function. Notice that x(t- 1) is indeed linked to x(t-2) in the same fashion. Hence, any shock to x(t) will gradually fade off in future.

For instance, let’s say x(t) is the number of juice bottles sold in a city on a particular day. During winters, very few vendors purchased juice bottles. Suddenly, on a particular day, the temperature rose and the demand of juice bottles soared to 1000. However, after a few days, the climate became cold again. But, knowing that the people got used to drinking juice during the hot days, there were 50% of the people still drinking juice during the cold days. In following days, the proportion went down to 25% (50% of 50%) and then gradually to a small number after significant number of days. The following graph explains the inertia property of AR series:

Moving Average Time Series Model

Let’s take another case to understand Moving average time series model.

A manufacturer produces a certain type of bag, which was readily available in the market. Being a competitive market, the sale of the bag stood at zero for many days. So, one day he did some experiment with the design and produced a different type of bag. This type of bag was not available anywhere in the market. Thus, he was able to sell the entire stock of 1000 bags (lets call this as x(t) ). The demand got so high that the bag ran out of stock. As a result, some 100 odd customers couldn’t purchase this bag. Lets call this gap as the error at that time point. With time, the bag had lost its woo factor. But still few customers were left who went empty handed the previous day. Following is a simple formulation to depict the scenario :

x(t) = beta *  error(t-1) + error (t)

If we try plotting this graph, it will look something like this:

Did you notice the difference between MA and AR model? In MA model, noise / shock quickly vanishes with time. The AR model has a much lasting effect of the shock.

Difference Between AR and MA Models Exploiting ACF and PACF Plots

Once we have got the stationary time series, we must answer two primary questions:

The trick to solve these questions is available in the previous section. Didn’t you notice?

The first question can be answered using Total Correlation Chart (also known as Auto – correlation Function / ACF). ACF is a plot of total correlation between different lag functions. For instance, in GDP problem, the GDP at time point t is x(t). We are interested in the correlation of x(t) with x(t-1) , x(t-2) and so on. Now let’s reflect on what we have learnt above.

In a moving average series of lag n, we will not get any correlation between x(t) and x(t – n -1) . Hence, the total correlation chart cuts off at nth lag. So it becomes simple to find the lag for a MA series. For an AR series this correlation will gradually go down without any cut off value. So what do we do if it is an AR series?

Here is the second trick. If we find out the partial correlation of each lag, it will cut off after the degree of AR series. For instance,if we have a AR(1) series,  if we exclude the effect of 1st lag (x (t-1) ), our 2nd lag (x (t-2) ) is independent of x(t). Hence, the partial correlation function (PACF) will drop sharply after the 1st lag. Following are the examples which will clarify any doubts you have on this concept :

The blue line above shows significantly different values than zero. Clearly, the graph above has a cut off on PACF curve after 2nd lag which means this is mostly an AR(2) process.

Clearly, the graph above has a cut off on ACF curve after 2nd lag which means this is mostly a MA(2) process.

Till now, we have covered on how to identify the type of stationary series using ACF & PACF plots. Now, I’ll introduce you to a comprehensive framework to build a time series model.  In addition, we’ll also discuss about the practical applications of time series modelling.

Framework and Application of ARIMA Time Series Modeling

A quick revision, Till here we’ve learnt basics of time series modeling, time series in R and ARMA modeling. Now is the time to join these pieces and make an interesting story.

Overview of the Framework

This framework(shown below) specifies the step by step approach on ‘How to do a Time Series Analysis‘:

As you would be aware, the first three steps have already been discussed above. Nevertheless, the same has been delineated briefly below:

Step 1: Visualize the Time Series

It is essential to analyze the trends prior to building any kind of time series model. The details we are interested in pertains to any kind of trend, seasonality or random behaviour in the series. We have covered this part in the second part of this series.

Step 2: Stationarize the Series

Once we know the patterns, trends, cycles and seasonality , we can check if the series is stationary or not. Dickey – Fuller is one of the popular test to check the same. We have covered this test in the first part of this article series. This doesn’t ends here! What if the series is found to be non-stationary?

There are three commonly used technique to make a time series stationary:

1.  Detrending : Here, we simply remove the trend component from the time series. For instance, the equation of my time series is:

x(t) = (mean + trend * t) + error

We’ll simply remove the part in the parentheses and build model for the rest.

2. Differencing : This is the commonly used technique to remove non-stationarity. Here we try to model the differences of the terms and not the actual term. For instance,

x(t) – x(t-1) = ARMA (p ,  q)

This differencing is called as the Integration part in AR(I)MA. Now, we have three parameters

p : AR

d : I

q : MA

3. Seasonality: Seasonality can easily be incorporated in the ARIMA model directly. More on this has been discussed in the applications part below.

Step 3: Find Optimal Parameters

The parameters p,d,q can be found using  ACF and PACF plots. An addition to this approach is can be, if both ACF and PACF decreases gradually, it indicates that we need to make the time series stationary and introduce a value to “d”.

Step 4: Build ARIMA Model

With the parameters in hand, we can now try to build ARIMA model. The value found in the previous section might be an approximate estimate and we need to explore more (p,d,q) combinations. The one with the lowest BIC and AIC should be our choice. We can also try some models with a seasonal component. Just in case, we notice any seasonality in ACF/PACF plots.

Step 5: Make Predictions

Once we have the final ARIMA model, we are now ready to make predictions on the future time points. We can also visualize the trends to cross validate if the model works fine.

Applications of Time Series Model

Now, we’ll use the same example that we have used above. Then, using time series, we’ll make future predictions. We recommend you to check out the example before proceeding further.

Where did we start?

Following is the plot of the number of passengers with years. Try and make observations on this plot before moving further in the article.

Here are my observations:

1. There is a trend component which grows the passenger year by year.

2. There looks to be a seasonal component which has a cycle less than 12 months.

3. The variance in the data keeps on increasing with time.

We know that we need to address two issues before we test stationary series. One, we need to remove unequal variances. We do this using log of the series. Two, we need to address the trend component. We do this by taking difference of the series. Now, let’s test the resultant series.

adf.test(diff(log(AirPassengers)), alternative="stationary", k=0)

Augmented Dickey-Fuller Test

data: diff(log(AirPassengers)) Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01 alternative hypothesis: stationary

We see that the series is stationary enough to do any kind of time series modelling.

Next step is to find the right parameters to be used in the ARIMA model. We already know that the ‘d’ component is 1 as we need 1 difference to make the series stationary. We do this using the Correlation plots. Following are the ACF plots for the series:

#ACF Plots

acf(log(AirPassengers))

What do you see in the chart shown above?

Clearly, the decay of ACF chart is very slow, which means that the population is not stationary. We have already discussed above that we now intend to regress on the difference of logs rather than log directly. Let’s see how ACF and PACF curve come out after regressing on the difference.

acf(diff(log(AirPassengers))) pacf(diff(log(AirPassengers)))

Clearly, ACF plot cuts off after the first lag. Hence, we understood that value of p should be 0 as the ACF is the curve getting a cut off. While value of q should be 1 or 2. After a few iterations, we found that (0,1,1) as (p,d,q) comes out to be the combination with least AIC and BIC.

Let’s fit an ARIMA model and predict the future 10 years. Also, we will try fitting in a seasonal component in the ARIMA formulation. Then, we will visualize the prediction along with the training data. You can use the following code to do the same :

(fit <- arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12))) pred <- predict(fit, n.ahead = 10*12) ts.plot(AirPassengers,2.718^pred$pred, log = "y", lty = c(1,3)) Practice Projects

Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Test the techniques discussed in this post and accelerate your learning in Time Series Analysis with the following Practice Problems:

Conclusion

With this, we come to this end of tutorial on Time Series Modelling. I hope this will help you to improve your knowledge to work on time based data. To reap maximum benefits out of this tutorial, I’d suggest you to practice these R codes side by side and check your progress.

Frequently Asked Questions Related

¿Quieres Ser Un Trending Topic En Twitter? Así Es Cómo Se Hace

¿Quieres ser un Trending Topic en Twitter? Entonces, tienes que seguir estos pasos para mejorar tus posibilidades de hacer viral a tu hashtag.

Lo más probable es que siempre en tu celular en la lupa de Twitter te aparezcan entre 10 y 20 títulos que se hacen tendencia día a día, hora a hora. ¿Sabes que son? ¿Sabes cómo funcionan? ¿Sabías que podías generar una tendencia tu mismo? Aquí te cuento cómo hacerlo y que tines que tener en cuenta.

Pero, ¿Qué es un Trending Topic?

Según Wikipedia, la definición oficial sería: “Un trending topic es una de las palabras o frases más repetidas en un momento concreto en Twitter. Los diez más relevantes se muestran en la página de inicio, pudiendo el usuario escoger el ámbito geográfico que prefiera, mundial o localizado, o personalizadas, en función además de a quién sigue el propio usuario. La gran repercusión que están teniendo en la prensa ha provocado que esta expresión sea utilizada también para denominar un tema de gran interés, esté o no siendo comentado en la red social. ”

Un Trending Topic o una TT (Como muchos le decimos) es el tema más comentado en Twitter en ese momento. Por lo general, se usa el michi # que unido a una palabra o a una oración se convierte en el famoso hashtag.

¿Si un hashtag se repite muchas veces en Twitter, se convierte en tendencia?

La respuesta es no, al día muchas personas usar millones de hashtags, pero solo se convierten en tendencia aquellos que tengan un número considerable de menciones en minutos ya sea como tuit o RT.

¿Cómo consigo que un hashtag sea Trending Topic?

Se creativo y piensa en una frase o palabra que sea fácil de replicar.

El número de caracteres del hashtag debe ser entre 10 y 21, pasados los 21 no se hace tendencia.

La mejor hora para hacerlo depende mucho del movimiento en redes que tengan los medios de comunicación en cada país, pero siempre las mejores horas siempre serán a primera hora o a partir de las 10pm.

El 80% de las tendencias no tienen números, es mejor usar solo letras.

El 31% de los trending topics que escoge Twitter son gracias a los RT. Por lo que muchas veces los RT son más importantes ya que hacen que el HT corra con más rapidez.

Si eres parte de una empresa y están en un lanzamiento o eres parte de una producción de un programa es bueno hacer un grupo de amigos que RT y tuitee al mismo tiempo para que corra.

Algunos mitos:

¿Solo puede ir 1 por tweet?: En realidad no, pueden ir todos los hashtags que desees siempre y cuando no excedan de los 280 caracteres pero no es recomendable porque se ve muy desordenado. Se recomienda como máximo usar 2 o 3 como máximo.

¿Solo debo usar hashtags en Inglés?: No, está comprobado que el 86% de tendencias son en español, el 10% en inglés y un 3% en catalán.

¿Importante mi número de seguidores al momento de hacer un trending topic? Es importante pero no determinante. Si no tienes muchos seguidores, puedes pedirles a tus amigos que te ayuden dando un RT o a tuitear con tu hashtag. Si tuvieras un amigo influencer o verificado, eso ayudaría muchísimo.

¿Si mi hashtag es muy largo es difícil hacer tendencia? Exacto, mientras más corto el hashtag, la tendencia será más fluida y correrá más rápido.

¿Solo deben ser tuit con el hashtag o también debo usar fotos? Pueden ser solo tuits, pero si se usan fotos y GIFs la tendencia será más fuerte.

¿Se pueden poner mayúsculas? No es que se pueda, sino que se debe. El 71% de los TT contienen mayúsculas para diferenciar palabras dentro del Hashtag.

¿Cuánto duran por lo general los trending topic? Se calcula que diariamente hay más de 8,900 con una vida útil de 11 min

¿Cuál es el mejor horario para crear un Trending Topic?

De 4:00 a 10:00 se necesitan mil 200 tweets y 500 usuarios para llegar a ser trending topic.

De 10:00 a 16:00 se necesitan mil 700 tweets y 734 usuarios.

De 16:00 a 22:00 se necesitan mil 500 tweets y 811 usuarios.

De 22:00 a 4:00 se necesitan mil 900 tweets y 923 usuarios.

Working With Dataframes Using Pyspark

This article was published as a part of the Data Science Blogathon.

Introduction

Apache is a fast and general engine u

Speed – Approximately 100 times faster than traditional MapReduce Jobs.

Ease of Use – Supports many program

Libraries for SQL Queries, Machine Learning, and Graph Processing applications are present.

Parallel Distributed Processing, fault tolerance, scalability, and in-memory computation features make it more powerful.

Platform Agnostic- Runs in nearly any environment.

The Components of Apache Spark

DataFrames Using PySpark

Pyspark is an interface for Apache Spark in Python. Here we will learn how to manipulate dataframes using Pyspark.

Our approach here would be to learn from the demonstration of small examples/problem statements(PS). First, we will write the code and see the output; then, below the output, there will be an explanation of that code.

We will write our code in Google Colaboratory, a rich coding environment from Google. You can install Apache Spark in the local system, also.

(Installation Guide: How to Install Apache Spark)

Instal PySpark

First, we need to install pyspark using the pip command.

!pip install pyspark import pyspark

Explanation:

The above python codes install and import pyspark in Google Colaboratory.

from chúng tôi import SparkSession spark = SparkSession.builder.getOrCreate()

Explanation:

We need to create a spark session to be able to work with dataframes. The above lines of code are exactly doing the same.

Problem Statements (PS)

PS 1. Load the csv file into a dataframe

df=spark.read.csv("StudentsPerformance.csv",header=True,inferSchema=True) df. (5)

Explanation :

df.columns

Explanation :

df.printSchema()

Explanation:

PS 2. Select an output few values/rows of the math score column

df.select('math score').show(5)

PS 3: Select and output a few values/rows of math score, reading score, and writing score columns

df.select('math score','reading score','writing score').show(5)

PS 4: Create a new column by converting the math scores out of 200 (currently it’s given out of 100)

df.withColumn("Math_Score_200",2*df["math score"]).show(5)

Explanation:

Math_Score_200 is the name of the new column we created whose values are twice the values of math score column i.e, 2*df[“math score”]. So now we have scored out of 200 instead of 100.

PS 5: Rename the parental level of education column

df.withColumnRenamed("parental level of education","Parental_Education_Status").show(5)

Explanation:

PS 6: Sort the dataframe by reading score values in ascending order

df.orderBy('reading score').show(5)

Explanation:

Note: default is ascending order. So ascending=True is optional

For arranging the dataframe in descending order, we need to type df.orderBy(‘reading score’, ascending=False)

PS 7: Drop race/ethnicity column

df.drop('race/ethnicity').show(5)

PS 8: Show what are all the different education levels of parents

df.select('parental level of education').distinct().collect()

Explanation:

PS 9: Find the sum of reading scores for each gender

df.select('gender','reading score').groupBy('gender').sum('reading score').show()

Explanation:

We select our required columns for the task using .select and then group the data by gender using .groupBy and summing the reading scores for each group/category using .sum. Notice the name of the aggregated column is sum(reading score)

PS 10: Filter the dataframe where a reading score greater than 90

df.count()

Explanation:

First, check the total number of rows in the original dataframe. It’s 1000, as seen in the above output.

Explanation:

             2. Fetch the lowest marks in the reading score column

from chúng tôi import functions

Explanation:

chúng tôi module’s functions are handy to perform various operations on different columns of a dataframe.

print(dir(functions))

Explanation:

We can check what features/functions are available in the functions module using dir

help(functions.upper)

Explanation:

We can check what a particular function does use help

from pyspark.sql.functions import upper,col,min

Explanation:

Imported our required functions for the task.

df.select(min(col('reading score'))).show()

Explanation:

Calculated the minimum value of the reading score column

df.select(col('gender'),upper(col('gender'))).show(5)

Explanation:

Converted gender to uppercase

PS 12: Rename column names and save them permanently

df=df.withColumnRenamed("parental level of education","Parental_Education_Status") .withColumnRenamed("test preparation course","Test_Preparation_Course") .withColumnRenamed("math score","Math_Score") .withColumnRenamed("reading score","Reading_Score") .withColumnRenamed("writing score","Writing_Score") df.show()

Explanation:

Till now, the changes we were doing were temporary. To permanently retain the changes, we need to assign our changes to the same dataframe, i.e., df=df.withColumnsRenamed(..) or if we want to store the changes in a different dataframe, we need to assign them to a different dataframe i.e., df_new=df.withColumnsRenamed(..)

PS 13: Save DataFrame into a .csv file

df.write.csv("table_formed_2.csv",header=True)

Explanation :

We save our dataframe into a file named table_formed_2.csv

PS 14: Perform multiple transformations in a single code/query

df.select(df['parental level of education'],df['lunch'],df['math score']) .filter(df['lunch']=='standard') .groupBy('parental level of education') .sum('math score') .withColumnRenamed("sum(math score)","math score") .orderBy('math score',ascending=False) .show()

Explanation:

Notice how we have performed different operations/transformations on the dataframe, one transformation after another.

1st, we select the required col. using .select

2nd, we use .filter to choose lunch type as standard

3rd & 4th, we perform the summation of math score for each level of parent education

5th, we renamed aggregated column sum(math score) to math score

6th, we order/rank our result by math score

7th, we output our result using .show()

PS 15: Create a DataFrame with a single column named DATES which will contain a random date with time information. Create another column side by side with the date 5 days after the date you have chosen initially

from pyspark.sql.functions import to_date, to_timestamp, date_add

Explanation:

Importing required functions

df2=spark.createDataFrame([('2012-11-12 11:00:03',)],['DATES']) df2.show()

Explanation:

Creating dataframes with a single row containing date & time (format: YYYY-dd-MM HH:mm:ss ) and column name DATES

df3=df2.select(to_date(col('DATES'),'yyyy-dd-MM'),to_timestamp(col('DATES'),'yyyy-dd-MM HH:mm:ss')) renamed_cols = ['DATE','TIMESTAMP'] df4= df3.toDF(*renamed_cols) df4.show()

Explanation:

Created dataframes df3 first with two columns, one containing only date info and another containing date & time info. Note that for the latter case, we used the to_timestamp function.

We then created a df4 dataframe from df3 with the same information, but this time added the column names DATE and TIMESTAMP.

df4.select(col('TIMESTAMP'),date_

Explanation:

Conclusion

• Perform different transformations on dataframe using PySpark with proper explanations.

• In short, PySpark is very easy to implement if we know the proper syntax and have little practice. Extra resources are available below for reference.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Related

Explaining Mlops Using Mlflow Tool

This article was published as a part of the Data Science Blogathon.

Introduction

nt. We will start by briefly seeing MLOps before diving into the usage of MLflow for MLOps.

The concept of MLOps can be complex for novices. A good way to decipher it is by using an implementation tool like MLflow. The belief in this article is that MLOps tools can help understand MLOps concepts generally.

What is MLOps?

The terms “machine learning” and “DevOps” are combined to form the term “MLOps,” which is used in software development. MLOps can be seen as a set of guidelines that machine learning (ML) experts follow to hasten the deployment of ML models in real projects and enhance the overall integration of various project pipeline operations.

It can be viewed as expanding the DevOps technique to incorporate data science and machine learning. The propagation of AI in software production creates a need for agreed-upon best practices to provide testing, deployment, and monitoring of the new system.

The complete MLOps process includes three broad phases “Designing the ML-powered application,” “ML Experimentation and Development,” and “ML Operations.”

MLOps brings together design and operations in a way that makes the development of happen on a robust platform. MLOps require all the data, or artifacts, for model deployment to be contained in a group of files created by a training project. After grouping these model artifacts, developers must have the means to k de used to create them, the data used to train and test them, and the connections between them. This ma possible to automate the steps of app creation and delivery. This helps CI/CD so ML apps can be continually deployed, integrated, and delivered.

Benefits of Using MLOps

There are three key things MLOps bring to the table; there are automation, continuous deployment, and monitoring.

Automation

Automation removes the manual process of doing things. Automation helps the process of building regular ML models without any manual intervention. For instance, automated testing or debugging could reduce human error and save correction time. Before the problem gets out of hand, it is fixed or reported right away.

Monitoring

Monitoring is another form of automation, but it involves sending signals when certain conditions are met. These signals could be on models or data. It may be when an anomaly is detected, such as a drift, while for models, it may be when a metric or hyperparameter is triggered. This could be after a model is deployed so that even when it is in production, it is still receiving new data and automatically retraining it.

Continuous X

This is another key benefit of MLOps, but what does “X” imply? This also implies automation, where there is a loop in production. This could be continuous Delivery, commonly known as CD, Continuous Integration CI, Continuous Training CT, Continuous Monitoring, etc. You can add to the list too! This feature in MLOps provides a sort of automation that allows an extension even after deployment or in the process of deployment where there is continuous provision of some variables of some sort.

What are MLOps Tools?

Note that these tools are not directly meant for implementing MLOps, they only have good features for uplifting the ML process to MLOps. MLOps tools help organizations apply DevOps practices to creating and using AI and machine learning (ML). They were developed to help close the gap between developing ML models and reaping the benefits of those models in the commercial world.

The type of tool to employ depends on the nature of the project. These tools can be seen as simply platforms for effectively implementing MLOps.

What is MLflow?

MLflow is an open-source platform for managing the development of machine learning models with the goal of meeting four primary functionalities. These functionalities include. As said earlier, this tool does not directly do MLOps. It only has good functionality for MLOps which we want to see. This implies you can use the tools without actually implementing MLOps by just doing regular ML workflow.

MLflow Components for MLOps

MLflow provides four components to help manage the ML workflow which we have seen previously. We will see the details and how they affect MLOps:

MLflow Tracking; is an API and UI that allows logging and querying experiments using Python, REST, R API, and Java API APIs. It is designed for logging parameters, code versioning, and setting metrics, and artifacts when running machine learning code to allow for later visualizing of the results. This feature supports the MLOps guideline for creating processes with details to aid future tracing.

An example is code and data versioning. MLflow Tracking runs on any environment including a notebook. This tracking feature can be used to create robust systems that meet up to MLOps requirements.

MLflow Projects; Managing projects is a very important tool for MLOps. In MLflow it is a format for easily packaging data science code in a way that makes it reusable and reproducible. It has a component that includes an API and a command-line tool for running projects, making workflow chaining possible. These are standard formats for packaging data science codes that are reusable.

The projects are organized as directories with a Git repository. This high-quality code management in projects eases teamwork which is highly important in MLOps. Tracking MLflow Projects from the Git repository is easy since in using the MLflow Tracking API in a Project, MLflow automatically remembers the project version and any saved parameters.

MLflow Models; An MLflow Model offers a common configuration for encasing machine learning models so they may be used in multiple other tools. The configuration specifies the rules that permit users to store a model in different so-called “flavors” that different downstream tools can recognize. It offers a standard for distributing machine-learning models in various flavors. Each Model is handled as a directory with arbitrary files, and it is possible to use a descriptor file that lists the model’s various “flavors.”

MLflow provides tools to deploy many common model types to diverse platforms. Outputting models in MLflow makes it very clear using the Tracking API automatically remembers which Project and run they came from. With all these controls implementing good MLOps becomes a breeze!

MLflow Registry; It provides a central model repository, a collection of APIs, and a user interface to enable collaborative management of an MLflow Model’s whole lifecycle. It offers model versioning and stage transitions from staging to production or archiving model lineage, which MLflow experiment and run produced the model and annotations.

This provides a one-stop model store, set of APIs, and UI, to collectively control the entire lifecycle of an MLflow Model. The concept of registering a model will include each registered model having one or many versions. So that when a new model is added to the Model Registry, it is added with its version number. Typically, each new model registered to the same model name increments the version number. When a model is registered, it carries a unique name and contains versions, associated transitional stages, model lineage, with other metadata.

UI for registering a Model using MLflow screenshot showing the names and versions of registered models in MLflow

This versioning is a tool highly required for MLOps. We have seen some of the key features of the MLflow data mining tool and how they can be used. I feel these are the most effective ones that cut into the MLOps discussion. Generally, we can see that the strength of MLflow is in managing utilities like models and data by keeping track. This is very handy for robust systems as robustness is seen in being scalable or easily upgradeable.

Conclusion

Since managing the lifecycle of ML using MLOps can be challenging, every tool that can help assist and ease the pain becomes very useful. MLOps becomes achievable using the features of tools such as MLflow. With edge-cutting features in model and data management and providing a very large range of ways to develop models that perform very well in meeting MLOps standards, MLflow is another tool to look out for. The biggest achievement with MLflow is data and model management.

Key Takeaways;

As you may have known, a perfect approach to learning something is via tools. Tools provide a hands-on understanding of concepts where we saw MLOps.

MLOps can be seen as a set of guidelines that machine learning (ML) experts follow to hasten the deployment of ML models in real projects and enhance the overall integration of various project pipeline operations.

MLOps tools help organizations apply DevOps practices to creating and using AI and machine learning (ML).

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Update the detailed information about Topic Modeling Using Latent Dirichlet Allocation (Lda) on the Bellydancehcm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!