Trending March 2024 # Handling Missing Values With Random Forest # Suggested April 2024 # Top 6 Popular

You are reading the article Handling Missing Values With Random Forest updated in March 2024 on the website Bellydancehcm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Handling Missing Values With Random Forest

This article was published as a part of the Data Science Blogathon.

Introduction to Random Forest

Missing values have always been a concern for any statistical analysis. They significantly reduce the study’s statistical powers, which may lead to faulty conclusions. Most of the algorithms used in statistical modellings such as Linear regression, Logistic Regression, SVM and even Neural Networks are prone to these missing values. So, knowledge of handling lost data is a must. Several methods are popularly used to manage missing data. This article will discuss the Random Forest method of filling missing values and see how it fares compared to another popular technique, K-Nearest Neighbour.

What are the Missing Data?

Missing data is the data value not stored for a variable in the observation of interest. There are four ways the missing values could occur in a dataset. And those are

Structurally missing data,

MCAR(missing completely at random),

MAR(Missing at random) and

NMAR(Not missing at random).

Structurally missing data: These are missing because they are not supposed to exist. For example, the age of the youngest kid of a couple having no kids is best addressed by excluding the observations from any analysis of the variables.

 MCAR (Missing Completely At Random): These are the missing values that occur entirely at random. There is no pattern, and each missing value is unrelated. For example, a weighing machine that ran out of batteries. Some of the data will be lost just because of bad luck, and the probability of each missing data is the same throughout.

MAR (Missing At Random): The assumption here is that the missing values are somewhat related to the other observations in the data. We can predict the missing values by using information from other variables, such as indicating a person’s missing height value from age, gender, and weight. This can be handled using specific imputation techniques or any sophisticated statistical analysis.

NMAR(Not Missing At Random): If the data cannot be classified under the above three, the missing value falls in this category. The missing values here originated not at random but deliberately. For example, people in a survey slowly do not answer some questions.

Imputation Techniques

There are many imputation techniques we can employ to tackle missing values. For example, imputing means for continuous data is the most routine matter in the case of categorical data. Or we can use machine learning algorithms like KNN and Random Forests to address the missing data problems. For this article, we will be discussing Random Forest methods, Miss Forest, and Mice Forest to handle missing values and compare them with the KNN imputation method.

Random Forest for Missing Values

Random Forest for data imputation is an exciting and efficient way of imputation, and it has almost every quality of being the best imputation technique. The Random Forests are pretty capable of scaling to significant data settings, and these are robust to the non-linearity of data and can handle outliers. Random Forests can hold mixed-type of data ( both numerical and categorical). On top of that, they have a built-in feature selection technique. These distinctive qualities of Random Forests can easily give it an upper hand over KNN or any other methods.

KNN, on the other hand, involves the calculation of Euclidean distance of data points, thus making it prone to outliers. It cannot handle categorical data, so data transformation is needed, and it requires the data to be scaled to perform better. All these things can be bypassed by using Random Forest-based imputation methods.

For this article, we will discuss two such methods: Miss Forest and the other is Mice Forest. We will explore how they work and their python implementation.

Miss Forest

Arguably the best imputation method. If you need precision, then this is what you must use. An iterative imputation technique powered by Random Forest to precisely impute data. So, how does it work?

Step-1: First, the missing values are filled by the mean of respective columns for continuous and most frequent data for categorical data.

Step-2: The dataset is divided into two parts: training data consisting of the observed variables and the other is missing data used for prediction. These training and prediction sets are then fed to Random Forest, and subsequently, the predicted data is imputed at appropriate places. After imputing all the values, one iteration gets completed.

Step-3: The above step is repeated until a stopping condition is reached. The iteration process ensures the algorithm operates on better quality data in subsequent iterations. The process continues until the sum of squared differences between the current and previous imputation increases or a specific iteration limit is reached. Usually, it takes 5-6 iterations to attribute the data well.

Source: Andre Ye

Advantages of Missing Forest

It can work with mixed data, both numerical and categorical

Miss Forest can handle outliers, so there is no need for feature scaling.

Random Forests have inherent feature selection, which makes them robust to noisy data.

It can handle non-linearity in data

Multiple trees need to be constructed for each iteration, and it becomes computationally expensive when the number of predictors and observations increases.

Also, it’s an algorithm, not a model object, meaning it must be run every time data is imputed, which could be problematic in some production environments.

Mice Forest

Another interesting imputation method is the Mice algorithm stands for Multiple Imputation By Chained Equation. Technically any predictive model can be used with mice for imputation. Here we will be using LIghtGBM for prediction. And this is more or less similar to miss forest as far as pseudocode of algorithm is involved. The only difference in the package we will be dealing with is it uses a LightGBM instead of a  pure Random Forest.  The pseudocode for the algorithm is given as

Source: Sam Wilson

More on this you can find here.

Python Example

The best way to show the efficacy of the imputers is to take a complete dataset without any missing values. And then amputate the data at random and create missing values. Then use the imputers to predict missing data and compare it to the original.

For this section, we will also be using the KNN imputation method to give a performance comparison between them.

Import Libraries

Before going further, we would like to install several packages that we will be using further. We will use the missingpy library for Miss Forest, while we will use the Mice Forest for Mice Forest.

!pip install missingpy !pip install miceforest

Now, we will import libraries.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import miceforest as mf import random import sklearn.neighbors._base import sys sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base from missingpy import MissForest from sklearn.impute import KNNImputer

For this article, we will be using the California Housing dataset.

from sklearn.datasets import fetch_

california_housing

Convert it to workable pandas data frame format

Python Code:



Amputation for Random Forest

We will now introduce nan values at random in the dataset.

r1 = random.sample(range(len(df2)),36) r2 = random.sample(range(len(df2)),34) r3 = random.sample(range(len(df2)),37) r4 = random.sample(range(len(df2)),30) df2['AveRooms'] = [val if i not in r1 else chúng tôi for i, val in enumerate(dt2['AveRooms'])] df2['HouseAge'] = [val if i not in r2 else chúng tôi for i, val in enumerate(dt2['HouseAge'])] df2['MedHouseVal'] = [val if i not in r3 else chúng tôi for i, val in enumerate(dt2['MedHouseVal'])] df2['Latitude'] = [val if i not in r4 else chúng tôi for i, val in enumerate(dt2['Latitude'])] Imputation for Random Forest

Now, we will use each of the three techniques to impute data.

# Create kernels. #mice forest kernel = mf.ImputationKernel( data=df2, save_all_iterations=True, random_state=1343 ) # Run the MICE algorithm for 3 iterations on each of the datasets kernel.mice(3,verbose=True) #print(kernel)

Miss Forest

imputer = MissForest() #miss forest X_imputed = imputer.fit_transform(df2) X_imputed = pd.DataFrame(X_imputed, columns = df2.columns).round(1) Knn Imputation impute = KNNImputer() #KNN imputation KNNImputed = impute.fit_transform(df2) KNNImputed = pd.DataFrame(KNNImputed, columns = df2.columns).round(1) Evaluation

To evaluate the outcomes of the results, we will use the sum of the absolute differences between the imputed dataset and the original dataset. Only the columns and rows with null values will be used here, and this is to get an overall idea of the efficiency of the model.

missF = np.sum(np.abs(X_imputed[df2.isnull().any(axis=1)] - dt2[df2.isnull().any(axis=1)])) mice = np.sum(np.abs(completed_dataset[df2.isnull().any(axis=1)] - dt2[df2.isnull().any(axis=1)])) Knn = np.sum(np.abs(KNNImputed[df2.isnull().any(axis=1)] - dt2[df2.isnull().any(axis=1)])) for i in [missF, mice, Knn]: print(np.sum(i)) output:274.3717058165446 361.4645867313583 497.2673601714581

From the above outcome, we can have a rough estimate of the efficacy of each method, and it seems Miss Forest was able to recreate the original model better, followed by Mice Forest and then KNN.

We didn’t do any feature transformation, which means KNN, which relies on euclidean distance, suffered a lot.

We can also visualize the same using Seaborn and Matplotlib.

sns.set(style='darkgrid') fig,ax = plt.subplots(figsize=(29,12),nrows=2, ncols=2) for col,i,j in zip(['AveRooms','HouseAge', 'Latitude','MedHouseVal'],[0,0,1,1],[0,1,0,1]): sns.kdeplot(x = dt2[col][df2.isnull().any(axis=1)], label= 'Original', ax=ax[i][j] ) sns.kdeplot(x = X_imputed[col][df2.isnull().any(axis=1)], label = 'MissForest', ax=ax[i][j] ) sns.kdeplot(x = completed_dataset[col][df2.isnull().any(axis=1)], label = 'MiceForest', ax=ax[i][j] ) sns.kdeplot(x = KNNImputed[col][df2.isnull().any(axis=1)], label='KNNImpute', ax=ax[i][j] ) ax[i][j].legend()

The previous result can be verified here as Miss Forest traces closer to the original data, followed by the Mice Forest and then KNN.

This was a rough evaluation of the three imputation methods we used. Note: We didn’t take the bias introduced by each technique into account while evaluating. This is meant to give just a rough idea of imputation accuracy.

Conclusion

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Related

You're reading Handling Missing Values With Random Forest

A Beginner’s Guide To Random Forest Hyperparameter Tuning

Introduction to Random Forest

What’s the first image that comes to your mind when you think about Random Forest? It conjures up images of trees and a mystical and magical land. And that’s what the Random Forest algorithm does!

It is an ensemble algorithm that combines multiple decision trees and navigates complex problems to give us the final result.

I’ve lost count of the number of times I’ve relied on the Random Forest algorithm in my machine learning projects and even hackathons. What makes random forest different from other ensemble algorithms is the fact that each individual tree is built on a subset of data and features.

Random Forest comes with a caveat – the numerous hyperparameters that can make fresher data scientists weak in the knees. But don’t worry! In this article, we will be looking at the various Random Forest hyperparameters and understand how to tune and optimize them.

I assume you have a basic understanding of the random forest algorithm (and decision trees). If not, I encourage you to go through the below resources first:

Random Forest Hyperparameters we’ll be Looking at:

max_depth

min_sample_split

max_leaf_nodes

min_samples_leaf

n_estimators

max_sample (bootstrap sample)

max_features

Random Forest Hyperparameter #1: max_depth

Let’s discuss the critical max_depth hyperparameter first. The max_depth of a tree in Random Forest is defined as the longest path between the root node and the leaf node:

Using the max_depth parameter, I can limit up to what depth I want every tree in my random forest to grow.

In this graph, we can clearly see that as the max depth of the decision tree increases, the performance of the model over the training set increases continuously. On the other hand as the max_depth value increases, the performance over the test set increases initially but after a certain point, it starts to decrease rapidly.

Among the parameters of a decision tree, max_depth works on the macro level by greatly reducing the growth of the Decision Tree.

Random Forest Hyperparameter #2: min_sample_split

min_sample_split – a parameter that tells the decision tree in a random forest the minimum required number of observations in any given node in order to split it.

The default value of the minimum_sample_split is assigned to 2. This means that if any terminal node has more than two observations and is not a pure node, we can split it further into subnodes.

Having a default value as 2 poses the issue that a tree often keeps on splitting until the nodes are completely pure. As a result, the tree grows in size and therefore overfits the data.

By increasing the value of the min_sample_split, we can reduce the number of splits that happen in the decision tree and therefore prevent the model from overfitting. In the above example, if we increase the min_sample_split value from 2 to 6, the tree on the left would then look like the tree on the right.

On increasing the value of the min_sample_split hyperparameter, we can clearly see that for the small value of parameters, there is a significant difference between the training score and the test scores. But as the value of the parameter increases, the difference between the train score and the test score decreases.

But there’s one thing you should keep in mind. When the parameter value increases too much, there is an overall dip in both the training score and test scores. This is due to the fact that the minimum requirement of splitting a node is so high that there are no significant splits observed. As a result, the random forest starts to underfit.

You can read more about the concept of overfitting and underfitting here:

Random Forest Hyperparameter #3: max_terminal_nodes

Next, let’s move on to another Random Forest hyperparameter called max_leaf_nodes. This hyperparameter sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree. If after splitting we have more terminal nodes than the specified number of terminal nodes, it will stop the splitting and the tree will not grow further.

Let’s say we set the maximum terminal nodes as 2 in this case. As there is only one node, it will allow the tree to grow further:

Now, after the first split, you can see that there are 2 nodes here and we have set the maximum terminal nodes as 2. Hence, the tree will terminate here and will not grow further. This is how setting the maximum terminal nodes or max_leaf_nodes can help us in preventing overfitting.

Note that if the value of the max_leaf_nodes is very small, the random forest is likely to underfit. Let’s see how this parameter affects the random forest model’s performance:

We can see that when the parameter value is very small, the tree is underfitting and as the parameter value increases, the performance of the tree over both test and train increases. According to this plot, the tree starts to overfit as the parameter value goes beyond 25.

Random Forest Hyperparameter #4: min_samples_leaf

Time to shift our focus to min_sample_leaf. This Random Forest hyperparameter specifies the minimum number of samples that should be present in the leaf node after splitting a node.

Let’s understand min_sample_leaf using an example. Let’s say we have set the minimum samples for a terminal node as 5:

The tree on the left represents an unconstrained tree. Here, the nodes marked with green color satisfy the condition as they have a minimum of 5 samples. Hence, they will be treated as the leaf or terminal nodes.

However, the red node has only 3 samples and hence it will not be considered as the leaf node. Its parent node will become the leaf node. That’s why the tree on the right represents the results when we set the minimum samples for the terminal node as 5.

If we plot the performance/parameter value plot as before:

So far, we have looked at the hyperparameters that are also covered in Decision Trees. Let’s now look at the hyperparameters that are exclusive to Random Forest. Since Random Forest is a collection of decision trees, let’s begin with the number of estimators.

Random Forest Hyperparameter #5: n_estimators

We know that a Random Forest algorithm is nothing but a grouping of trees. But how many trees should we consider? That’s a common question fresher data scientists ask. And it’s a valid one!

We might say that more trees should be able to produce a more generalized result, right? But by choosing more number of trees, the time complexity of the Random Forest model also increases.

In this graph, we can clearly see that the performance of the model sharply increases and then stagnates at a certain level:

This means that choosing a large number of estimators in a random forest model is not the best idea. Although it will not degrade the model, it can save you the computational complexity and prevent the use of a fire extinguisher on your CPU!

Random Forest Hyperparameter #6: max_samples

The max_samples hyperparameter determines what fraction of the original dataset is given to any individual tree. You might be thinking that more data is always better. Let’s try to see if that makes sense.

We can see that the performance of the model rises sharply and then saturates fairly quickly. Can you figure out what the key takeaway from this visualization is?

It is not necessary to give each decision tree of the Random Forest the full data. If you would notice, the model performance reaches its max when the data provided is less than 0.2 fraction of the original dataset. That’s quite astonishing!

Although this fraction will differ from dataset to dataset, we can allocate a lesser fraction of bootstrapped data to each decision tree. As a result, the training time of the Random Forest model is reduced drastically.

Random Forest Hyperparameter #7: max_features

Finally, we will observe the effect of the max_features hyperparameter. This resembles the number of maximum features provided to each tree in a random forest.

We know that random forest chooses some random samples from the features to find the best split. Let’s see how varying this parameter can affect our random forest model’s performance.

We can see that the performance of the model initially increases as the number of max_feature increases. But, after a certain point, the train_score keeps on increasing. But the test_score saturates and even starts decreasing towards the end, which clearly means that the model starts to overfit.

Ideally, the overall performance of the model is the highest close to 6 value of the max features. It is a good convention to consider the default value of this parameter, which is set to square root of the number of features present in the dataset. The ideal number of max_features generally tend to lie close to this value.

End Notes

With this, we conclude our discussion on how to tune the various hyperparameters of a Random Forest model. I covered the 7 key hyperparameters here and you can explore these plus the other ones on your own. That’s the best way to learn a concept and ingrain it.

Next, you should check out the comprehensive and popular Applied Machine Learning course as the logical step in your machine learning journey!

Related

How To Fix Missing Values In Looker Studio (2023)

Have you ever seen zeros, unknowns, or no data in your Looker Studio dashboards?

These are indicative of missing values in Looker Studio. Other examples include null and N/A. While these are common, they aren’t the most desirable thing to see on your reports.

Here is an overview of what we’ll cover:

Let’s dive in!

Identifying Missing Values in Looker Studio

Missing values, like duplicates, are known to be troublemakers for data analysis, and they can show up as different names. Here are a few examples:

The best way to deal with missing values is to go to your data source and clean the missing data. However, there may be various reasons why you may want to do it within your visualization tool.

You may only have a small data set or don’t want to use another tool. You could also not have access to the data set or only need to edit the report.

Now, let’s jump into Looker Studio to start the process.

We first need to identify if we have missing values or not. A quick way to do this is to have a global view of your dashboard and find if there is any information referring to missing data.

For example, inspect the time series chart that breaks down the campaigns by their URLs.

If you pay attention to the legend, you’ll see this null value, which represents missing values in Looker Studio. So we know that we have missing values in this chart.

Here, we see that we have the No data value type.

It is imperative to identify if there are missing values in Looker Studio and understand the makeup of your data before making any changes. Afterward, we must decide what to do, either keep or remove them.

There are different rules of thumb for this. Usually, it’s safe to keep any missing data that make up less than 20% or 30% of your dataset. A higher percentage would necessitate removing those data.

You should know that each business has different standards for this. Some may have 50% of missing data and still decide to keep it because you can still get good insights despite a large margin of error.

When to Keep/Remove Missing Values

In our case, should we keep or remove these missing values in Looker Studio?

A good trick I use to aid in this decision is to use a chart to identify the proportion of missing data. What I like to do is use either a pie or donut chart.

The donut chart below reflects the proportion of landing pages shown in our time series chart.

Looking at the legend, you’ll find a blue dot at the bottom with no data, representing missing values in Looker Studio. Once we select this, we can see that our missing values take up 15.8% or around 16% of our dataset.

This percentage is less than our 20% or 30% threshold, so it’s safe to keep these values for our analysis.

There are two ways to fix missing values in Looker Studio: a cosmetic solution that will not impact your data and one that will.

Using Filters to Remove Null Values

The first method includes using a filter to remove values from the display.

Next, we’ll use the following formula for our filter:

Exclude Landing page Equal to (=) Null

We have removed the null values from the chart and its legend while retaining metrics values such as revenue and the number of users.

Next, let’s look at the second method.

Adding a Control to Remove Null Values

Let’s pretend we have 30% of the missing data we want to remove.

Next, add the dimension you want to analyze. For our example, this is the landing page dimension. We already have this created beside our date range control.

💡 Top Tip: Check out our guide on How to Make Looker Studio Dashboards Interactive to learn more about data controls, date range filters, and data source controls.

When we select the data control drop-down list, we will see a list of the different landing page URLs in our report. Unselect the null values to remove missing values in Looker Studio.

Notice the null values removed from the time series chart, and that the number of users, engagement rate, and revenue have dropped.

Please be cautious with this method because you may exclude other insightful data.

Let us see how excluding missing values this way works and its implications. Because we use data from Google Sheets, we will use it to understand what is happening in the background.

We have two columns: one for gender and another for the number of users. This sheet lists how many users we have for males and females.

In the gender column, you will see the NULL values in red.

Ideally, what we expect to happen when we use the data control method is to delete the null values, and keep the number of users similar to how it is displayed below:

However, this is not the case for data controls. Instead, Looker Studio looks at all the rows containing null values and removes the row contents entirely.

Deleting rows of values will not be much of an issue if you use a variable like an ID number because they are irrelevant to your analysis.

Since you cannot use these ID numbers for your calculations, removing null values will have no consequences. In the case of revenue, however, deleting rows of data would be problematic.

For example, this table shows the revenue generated per gender. Similarly, there are null values for gender. We have $3,000 in total revenue.

If we remove null values using data controls, we will significantly impact the total revenue, reducing it to a third of its original value. The new total is now $1,009.

Now, let’s bring back our null values to see our total numbers.

We have around $500,000 in total revenue. After excluding null values, this number drops to nearly $400,000, a decrease of approximately 16%.

In web analytics, some decrease is okay because analytics platforms will never give you an exact number as precise as those you will find in your CRM or shopping cart.

Still, we do need thresholds when it comes to revenue. Usually, removing missing values in Looker Studio that account for less than 5% of your data is acceptable. Since we are above this threshold at 16%, we must keep our null values.

Looker Studio has a native way of dealing with missing values. Look at the table to see how this works and sort the Users column. Notice our missing values showing as no data.

It is best practice to show missing values as zeros, particularly for metrics. Let us see how to change this for our table.

Select the table, go to Style, and scroll down to the Missing Data section.

This section allows us to choose how missing values should show up. We currently set it to show no data. If you go to the age and users column, you will see this no data value type.

💡 Top Tip: To learn more about Looker Studio charts, their properties, and types, check out our Google Looker Studio Charts to Create Stunning Reports guide.

To change how missing values in Looker Studio tables show up, select Show 0.

The missing data now show as zeros, while the total number of users remains the same.

When deciding whether to remove missing values, you should consider values that become irrelevant. Let’s have a look at the age dimension again.

For this, we inserted a pie chart to show the distribution of ages.

Since nearly 100% of our data is missing/empty, the age dimension is irrelevant, and we should remove them from the report. Remove the pie chart and the age dimension from the table.

The last thing to pay attention to is data types.

Null Values Data Types

We’ll inspect this scatter plot by looking at the relationship between the number of sessions and the total revenue.

Metrics, like Sessions, will have 123 icons. To change how the total revenue is stored, select Edit data source.

Changing data types does not result in immediate changes. Sometimes, your charts might also break. What we need to do is replace the old value.

Here, we have the Total revenue in the metric section with the CTD icon.

Looker Studio aggregates dimensions or non-numeric data using Count Distinct (CTD). Replace this dimension with Total Revenue with the 123 icons.

To see the changes we made, refresh the page. We should now see the correct numbers in our scatter plot.

If you prefer, another option is to rebuild the chart.

FAQ Should I keep or remove missing values in Looker Studio?

The decision to keep or remove missing values depends on various factors. Generally, it’s safe to keep missing data that makes up less than 20% or 30% of your dataset. However, different businesses may have different standards based on their specific needs and analysis requirements.

How can I remove null values from a chart in Looker Studio?

You can use filters to remove null values from a chart. Add a filter to the chart, specifying the condition “Exclude Landing page Equal to (=) Null” (adjusting the condition based on your specific use case).

Can incorrect data types affect the handling of missing values in Looker Studio?

Yes, incorrect data types can lead to wrong conclusions or odd results. It’s important to ensure that the data types are set correctly. Looker Studio provides options to change the data type and replace incorrect values with the appropriate ones.

Summary

Great! We have covered how to identify missing values in Looker Studio. Next, we looked at different approaches to handling them, one aesthetic solution and another that impacts your analysis.

There are other improvements that we could make to this dashboard. We discussed some of them in our guide on How to Overcome GA4 Limitations with Looker Studio. These techniques can apply to other datasets, not just from GA4 data.

Vlookup To Return Multiple Values

VLOOKUP to Return Multiple Values

We all know that Vlookup in Excel is used to look up the exact or approximate match, and we have all been doing this on our regular tasks. Vlookup looks up the value from the selected table range and returns the exact match as per the cell value it maps. But when we have multiple duplicate values in a table, we would only get the first value from the lookup range. The duplicate values below will not reflect or look up any value from the table range. But this is possible. The examples below show us how to look up multiple values using VLOOKUP to Return Multiple Values.

Start Your Free Excel Course

Excel functions, formula, charts, formatting creating excel dashboard & others

How to Use VLOOKUP to Return Multiple Values?

To make these names unique values, we can add any number or special character so that all values will become unique. As we can see, in column B, all the values have become unique after adding numbers to each cell value.

Examples of VLOOKUP to Return Multiple Values

Lets us discuss the examples of VLOOKUP to Return Multiple Values.

You can download this VLOOKUP to Return Multiple Values Excel Template here – VLOOKUP to Return Multiple Values Excel Template

Example #1

In this, we will see how to use Vlookup to get multiple values from one table to another. We have two tables below. Each table has the same headers and Owner and Product names in the same sequence. Now if we apply the vlookup in cell G2 to get the quantity sold for each Owner name, then we will only get the first value of each owner name as the owner names are repeated.

We will insert a column in the first table and make a key using the Owner name and Product columns to avoid such situations. In the Key column, we are using an underscore as a separator. We can use any separator here.

We can see the final unique key column A, as shown below.

In Table 2, we will apply Vlookup to get the value from Table 1 to Table 2. Insert the vlookup function as shown below.

As per syntax, we need to select the lookup value as we have created in Table 1. For that, combine or concatenate F1 and G1 values with the help of an underscore.

In the table array, select the complete table 1.

As we want to get Quantity Sold numbers from Table1 to Table2, we will select column 4 as Col Index.

Once we press enter and drag the formula till the end, all the values from the table1 will be fetched to the table2’s Quantity Sold column.

Example #2

To look up multiple values, we will use the Index function here. The index function in Excel is used to look up the value as a matrix. This means Index lookups at the value in the table with the help of chosen reference Columns and Row index numbers.

As per the syntax of the Index function, we need to select the array from where we want to get the value. Here our array is Column C.

As per syntax, we will use ROW and COLUMN numbers to get values. Here, we use the Small function to get the smallest value first from the lookup array.

If we press enter, we will get the value for the first cell only. To execute this function properly, press SHIFT + CTRL + Enter together. Then only after dragging will we be getting the looked values from table1.

Pros of Vlookup to Return Multiple Value

It is quite helpful in mapping or looking up the values against duplicate values.

We can see a single or all the values against the same lookup value.

There is no limit to the values we want to look up using multiple value criteria.

Things to Remember

While using the method shown in example-2, always press the SHIFT + CTRL + ENTER keys together to execute the applied formula. If we directly press enter, we would get the value only for the first cell, not for every down cell.

Instead of concatenating different cell values, we give numbers to each duplicate value and look up the value in the same order.

We can use a BLANK cell as well as a separator.

Numerous ways exist to create a wider key than the steps and process shown in the above examples.

Even if we add a SPACE, it will help create a unique key for Vlookup to return multiple values.

Recommended Articles

This has been a guide to VLOOKUP to Return Multiple Values. Here we discuss How to Use VLOOKUP to Return Multiple Values, along with practical examples and a downloadable Excel template. You can also go through our other suggested articles –

Mysql Query To Order By Null Values

   StudentId int NOT NULL AUTO_INCREMENT PRIMARY KEY,    StudentFirstName varchar(100),    StudentMarks int ); Query OK, 1 row affected (0.16 sec) Query OK, 1 row affected (0.15 sec) Query OK, 1 row affected (0.19 sec) Query OK, 1 row affected (0.19 sec) Query OK, 1 row affected (0.18 sec) Query OK, 1 row affected (0.13 sec) | StudentId | StudentFirstName | StudentMarks | | 1         | John             | 45 | | 2         | NULL           | 65 | | 3         | Chris | 78 | | 4         | NULL | 89 | | 5         | Robert | 99 | | 6         | NULL | 34 | | 7         | Mike | 43 | | StudentId | StudentFirstName | StudentMarks | | 5 | Robert | 99 | | 7 | Mike | 43 | | 1 | John | 45 | | 3 | Chris | 78 | | 2 | NULL | 65 | | 4 | NULL | 89 | | 6 | NULL | 34 | 7 rows in set (0.00 sec)

Twitter Now Handling One Billion Tweets Per Week

With the fifth anniversary of the launch of Twitter on March 21 fast approaching, the social networking site has released a few statistics to illustrate the site’s phenomenal growth. In a blog post, Twitter’s Carolyn Penner said Twitter was, ”on every measure of growth and engagement”, growing at ”a record pace”. Her claim was backed up with selected statistics that show just how far the Twitter has come since CEO and co-founder Jack Dorsey penned his first tweet. Here’s just a few: It took Twitter three years, two months and one day to hit one billion tweets. That same amount is now sent each week. The number of average tweets people send per day has also skyrocketed, from 50 million one year ago to 140 million in the last month. On March 11 alone, 177 million tweets were sent. The current tweets per second record stands at 6,939, set four seconds after midnight in Japan on New Year’s Day (presumably Ms Penner is referring to New Year’s 2010/11). Tweets per second hit 456 following the death of Michael Jackson on June 25, 2009. Accounts statistics were similarly impressive, with 572,000 new accounts created on March 12 alone and an average of 460,000 new accounts per day over the last month. Twitter is also seeing massive growth in mobile users similar to Facebook, with an 182 per cent increase in mobile users over the past year. Facebook claims that more than 200 million active access their site through mobile devices. The figures came just days after Twitter moved to shore up control of its platform, telling developers to stop copying official Twitter apps and focus on creative ways of integrating the service into other products.

With the fifth anniversary of the launch of Twitter on March 21 fast approaching, the social networking site has released a few statistics to illustrate the site’s phenomenal growth. In a blog post, Twitter’s Carolyn Penner said Twitter was, ”on every measure of growth and engagement”, growing at ”a record pace”. Her claim was backed up with selected statistics that show just how far the Twitter has come since CEO and co-founder Jack Dorsey penned his first tweet. Here’s just a few: It took Twitter three years, two months and one day to hit one billion tweets. That same amount is now sent each week. The number of average tweets people send per day has also skyrocketed, from 50 million one year ago to 140 million in the last month. On March 11 alone, 177 million tweets were sent. The current tweets per second record stands at 6,939, set four seconds after midnight in Japan on New Year’s Day (presumably Ms Penner is referring to New Year’s 2010/11). Tweets per second hit 456 following the death of Michael Jackson on June 25, 2009. Accounts statistics were similarly impressive, with 572,000 new accounts created on March 12 alone and an average of 460,000 new accounts per day over the last month. Twitter is also seeing massive growth in mobile users similar to Facebook, with an 182 per cent increase in mobile users over the past year. Facebook claims that more than 200 million active access their site through mobile devices. The figures came just days after Twitter moved to shore up control of its platform, telling developers to stop copying official Twitter apps and focus on creative ways of integrating the service into other products.

Update the detailed information about Handling Missing Values With Random Forest on the Bellydancehcm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!