Nary_Max?…What is it & Why it’s important

A blog on Nary_Max!

So, I have a simple question……Why do we use Google DataStudio? (Hint….the answer’s not the fact that its free)

Is it to show data and its different visualizations in a beautiful aesthetic format? Yes

Is it to show data  in a more dynamic and organized way, thus helping in analysis and decision making? Yes

But the basic answer for this question is – Reporting. The google data studio is a reporting platform that lets you collate data from all platforms in which you are investing effort (Well, almost all).

It is important to represent that effort properly to see the current performance of the organization, as well as for efficient future decisions. However, let’s face it, sometimes some data just cannot be represented. It may not exist, it may not be gathered, but for that time period, that data value is just that – A null.

And in data studio, if you are using calculated fields to create another metric, one of your unpredictable problems could be that the data with which you are calculating another metric contains null values. And you definitely cannot show a “null” for your organization’s efforts.

So, the question is, how to translate your null value into a 0 (Zero) when calculating another metric?

For doing that, Nary_Max is your go to function!

Meaning & Example

Basically, this function among n variables, that is, among different rows of data, can pick up the maximum data set. This in itself is a useful application of Nary_Max

But we were discussing on how does it help null values, right?

Let’s take an example to explain that.

In the figure shown below,from data studio, there are 2 fields, that is cost and variable cost. In both the cost and the variable cost data there are null values.

Now, there is a requirement to aggregate these fields into a calculated field called total cost. For more information on calculated fields, you can refer to this blog post here

In the calculated field given above, Nary_Max function is used to differentiate between 2 sets of data and use the higher value. The 2 sets of data in this case mean the actual data and 0.

The entire formula used in the above field is: SUM(NARY_MAX(Cost,0))+SUM(NARY_MAX(Variable cost,0))

Therefore, it will differentiate between the values in both the the fields with o. If there is a null value in any one of the fields, then Nary_Max function will use 0, as it is a higher value than null.

The end result comes like this

The total cost now has 0, if the values involved in both the fields are null.

In this way, the main calculated metrics which are observed by the shareholders are calculated even if they have null values in them.

If you need help with this, then we are a crazy team of Google and Adobe Certified Analytics Experts who eat, breathe, sleep and dream Analytics. And we’ve made it our purpose and passion to electrify your business’s potential through Analytics.

Contact us here.

As I am pretty sure, the above image summarizes itself.

Your one and only web analytics Chica here – Garima Mathur

3 Hacks from Data Studio

In today’s article, you will learn 3 tricks from the google data studio, which can make your life easier. Data Studio is Google’s powerful reporting and visualization solution for users who want to collate data from multiple sources into a single comprehensive, yet aesthetic visualization platform.

The 3 hacks from data studio in this article are:

  1. Blended Data
  2. Calculated metrics
  3. Rolling Date ranges
Blended Data

So, when will you need this?

Consider a situation where you have multiple data sources and you want to create a single data source containing specific information from all the relevant sources. In this case, either you can create an offline source or you can collate the data in google sheets. But, if you are using connectors within data studio then you cannot do this merging either in Google sheets or in offline excel.

In such situations, blending data sources within the Data Studio Interface will be the best solution and will also help a great deal in automating your data structure.

For example; If an organization wanted to find out how much do they spend on their marketing efforts with v/s without the retainer they pay to the marketing agency for Facebook. The first data source can be a Facebook connector within Data Studio, the second data source would be a google sheet containing the daily retainer cost. To add this extra information of the retainer into the marketing data coming from Facebook, they can create a new Blended data source, and create a joining link based on dates between the 2 diverse data sets.

Then they can get 2 different visualizations, one with the original data set and the other with the blended one.

Let’s look at how you can do that.

  • Select the option to blend data
  • A new table will appear in which there will be an option to add another source of data.
  • On clicking that, a variety of different options would be provided for the other data source linked to your dashboard.
  • Next, you need to provide a common key among all the data sources comprising as input for the blended data
  • Add the dimensions and the metrics of both the data sources according to your requirements

Please note, your existing calculated metrics in individual data sources might not work as expected. So, please cross check on that.

2. Calculated Metrics

In Data studio often we need some extra calculations with the available data, for example, our data source can provide us cost and revenue, but to calculate ROI we need to divide revenue by cost which might not be available directly as a metric. To solve this we can take 2 ways, either create a separate data source in Google sheets with the calculated metric values and upload that, or a hack is through using the Create field option in Data Studio itself.

  • Choose the option of create field as shown in the figure below:
  • In that a new dialogue box will open, this is represented by the figure below:
  • The metric can be calculated according to the requirements of your data. In the figure above, the example of Conversion rate is taken. The metric is named CVR and its calculation is based on 2 fields which already existed in the data. 

Another trick :

  • In contrast to the figure above, the data studio may not give the best result with the above formula, especially in cases where you are using multiple filters and dimensions. To solve this we can replace the above formula by this:


  •   Now we have used the SUM function. This will aggregate the data based on the combinations of dimensions and filters used in the dashboard.

There are various other functions also which can help to calculate figures from both the original source as well as blended data sources. Few of the popular ones are:

  1. Arithmetic : Addition, Subtraction, multiplication and Division
  2. Mathematical Formulas: MAX(), MIN(), SUM()
  3. Logical Comparisons : Allows data to be compared with by IF/ELSE or CASE/WHEN comments.

3. Rolling Date Ranges

Consider a situation in which you would like a visualization chart in data studio to only show a last week’s data every time you open the dashboard. However, it can be extremely cumbersome to update the date range of the data every week for accurate reporting. It can be cumbersome for some stakeholders at least.

For such problems, it is required to customize the date range to get updated automatically in a rolling manner. The rolling date range will enable you to represent the data according to the number of days/weeks/years with which you want refresh the data automatically.

So, how to enable a rolling date range?

  • In the default date range, choose the custom option.
  • In that there are many options to choose which category of date range is required. Options such as from last year to last month to meet your requirements, but there is also an advanced option.
  • In this advanced option, you can add and subtract the number of days/years/months from the calendar and customize it accordingly. This will set the time period for which you want the data to be showcased in a rolling manner irrespective of the date on which you access the dashboard. In the example above, this option will only show the data since 16 days from today till day before yesterday, whatever the current date might be.

Therefore, these 3 hacks can really help you in google data studio, to create reports which are more clear, precise and visually appealing faster.

How to enhance Google Adwords and Analytics Integration

Google Adwords and Analytics Integration- What is the objective of the blog post –

In this post you will learn how to customize your Google Adwords and Analytics Integration to link your campaign information to other areas of your site apart from landing page and conversion.

How it makes life easier –

With standard integration, one can analyse the performance of Adwords campaign along with the landing URL in terms of cost, conversion and revenue. But, what if you have other variables in the user journey that you are testing apart from the lander?

By using this customization technique you will be able to analyze how combination of your adword campaigns and other areas of your site are affecting the conversion and analyze the complete picture.

Real life scenario –

Suppose an eCommerce company is using Google Optimize to test its landing pages as well as product details pages.

The company want to see which adwords groups, lander and product page combinations are delivering the best results in terms of order conversions.

For this they create a google sheets dashboard to automate the report for analysis through the Google Analytics Addon like this :

Now, with standard integration the company will not be able to see this entire journey and will miss out on some critical information and they will rather see this ugly error:

But, they don’t need to lose hope. DataVinci to rescue. Lets get cracking.

What is the recipe?

  • 2 custom dimensions sanctioned at session scope (There can be more dimensions depending on the number of variables you are playing with)
  • Custom parameters in landing URL of Google Adwords campaign
  • 3rd Custom dimension to capture the Google Adwords custom parameter. Again, set at session scope

First, enable the custom dimensions from the Google Analytics admin. Name them appropriately and note down their index numbers. Make sure to set their scope at session level.

Next, customize your Google Adwords campaign parameters by passing in any campaign related information that you want to test. Let’s assume this information is ad group and you are passing this in “_adgroup” parameter.

This video tutorial provides the steps to update the custom parameters in Google Adwords :

Now, customize your Google Analytics implementation to capture the data in the respective dimensions.

This video tutorial provides steps to setup Google Analytics custom dimensions through Google Tag Manager.

Now what?

Ok, so when you set a custom dimension to a session scope, the last value that gets passed into it in a session gets associated with all the hits of that session. So, visualize this scenario in your head:

A visitor enters your site from your Google Adwords campaign. You have very smartly passed in custom information through custom url parameters in your landing URL. Your Google Analytics account captures this custom parameter information and stores it in a custom dimension, and since this custom dimension is set at a session scope, all the hits in your visitors sessions can be viewed against this value.

Also, on this landing page, we are capturing the landing URL in another custom dimension set at session scope. That means this information will also be available to be used with other data points captured through out the session.

Next, if the user navigates to the products page, we are capturing the version of the product page url as well and that too again in the session scope.

Now we have the three important dimensions which we want to see together broken down by the conversion metrics to make a decision. And this is how the report will look now:

Through the above minimal setup, the eCommerce company can clearly analyze which combinations are working better than others and push them to the broader audience.


Hope you liked this post on customizing your Google Adwords and Analytics Integration

How can we people?

Google Analytics is a very powerful tool when setup correctly. It can be customized to great extents and provide sensational insights to optimize your digital assets.

If you need help with this, then we are a crazy team of Google and Adobe Certified Analytics Experts who eat, breathe, sleep and dream Analytics. And we’ve made it our purpose and passion to electrify your business’s potential through Analytics.

Contact us here.

Found it informative? Leave a comment! You can also give us a thumbs up by sharing it with your community. Also did you know that you can light up our day by subscribing to our blog? Subscribe here –

Bucket Testing

What Is Bucket Testing?

Bucket testing (sometimes referred to as A/B testing or split testing) is a term used to describe the method testing two versions of a website against one another to see which one performs better on specified key metrics (such as clicks, downloads or purchases).

There are at least two variations in each test, a Variation A and a Variation B. Metrics from each page variation are measured and visitors are randomly placed into respective ‘buckets’ where the data can be recorded and analyzed to determine which performs best.

Companies that market and sell products or services online rely on bucket testing to help them maximize revenue by optimizing their websites and landing pages for conversions.

How It Works: An Example

Let’s look at a hypothetical example. Each bucket test begins with a hypothesis that a certain variation on a landing page will perform better than the control. Say you have an existing landing page for a free nutrition eBook, Eat Raw Foods and Live Longer.

The button on the bottom of your landing page’s sign-up form says ‘Submit,’ but your hypothesis is that changing the text to ‘Get Your Free Copy’ will result in more form conversions. The existing page with the ‘Submit’ button is the control, or Variation A. The page with ‘Get Your Free Copy’ on the button is Variation B. The key metric you will measure is the percentage of visitors who fill out the form, or a form completion.

Because you have an ad campaign driving several thousand visitors a day to your landing page, it only takes a few days to get the results from your bucket test. It turns out that ‘Get Your Free Copy’ has a significantly higher click rate than ‘Submit,’ but the form completion rate is basically the same. Since the form completion rate is the key metric, you decide to try something different.

Bucket Tests & Conversion Optimization

Bucket testing plays a big role in conversion rate optimization. Running a bucket test allows you to test any hypothesis that can improve a page’s conversions. You can continue to try higher-converting button text for Eat Raw Foods and Live Longer or you can go on to test other hypotheses, such as bolder headline copy, more colorful imagery or arrows pointing to the sign-up button that will get more people to convert.

Companies spend millions of dollars to drive traffic to landing pages and websites that promote their product or service. With simple variations to page copy, imagery, and layouts, you can conduct a series of bucket tests to gather data and to iterate towards your highest-performing version of the page. You simply create variations of the page, changing one element at a time and measuring key metrics, then collect the results until reaching statistically significant results for each experiment.

Bucket testing can make a significant impact on conversions per page, resulting in revenue increases on your highest-trafficked pages.

Bucket testing can also help to eliminate subjective opinions as deciding factors in a page’s design or layout. The author of Eat Raw Foods and Live Longer may think that her photo will drive more customer demand – or she may insist on a rainbow palette of colors.

With bucket testing, there is no need for debate on what design or page elements will work best to convert a customer. The quantitative data will speak for itself, and drive the decision for you.

Tests should be prioritized to run on your most highly trafficked pages, since you may need hundreds or thousands of visitors to each variation to gather statistically significant data. The more traffic a page receives, the quicker you will be able to declare a winner.

Common Page Elements To Test:

  • Headlines and sub-headlines: varying the length, size, font and specific word combinations
  • Images: varying the number of images, placement, type of imagery (photography vs. illustration) and subject matter of imagery
  • Text: varying the number of words, style, font, size and placement
  • Call-to-action (CTA) buttons: varying common ones such as ‘Buy Now,’ ‘Sign Up,’ ‘Submit,’ ‘Get Started,’ or ‘Subscribe’ and varying sizes, colors and page placement
  • Logos of customers or third party sites: build credibility and convey trustworthiness (could include Better Business Bureau, TRUSTe or VeriSign logos as well as customer logos)

Give us a shout if you need any help with A/B testing by filling in the form below

Check out our Google Analytics solutions here

Check out our Adobe Analytics solutions here

Digital Analytics on Permanent Roommates


Permanent Roommates, Created by TVF and produced by

Lets start things by understanding the business objective of, which is an online real estate lead generation portal. At the time of creating this post, the online real estate market in India is nearly 250 crore Rs. and is expected to grow at 50-100% CAGR.

Now, who are the customers of People looking for properties? Nope. Its the brokers and real estate developers.

The way it works for these real estate portals is that they work on a service based model where brokers and developers subscribe to their packages and in return get the leads from these portals.

So, what is the fundamental Key Performance Indicator for these websites? Yep, no brainer – the leads. The trend of leads, the total leads generated during various time frames etc etc

So, the more leads they generate the more they can attract the brokers and developers.

And the more property seekers they can attract to the website the more leads they can generate. Sounds simple, but as usual its is not,

The reasons –

1. There are presently 9 players in the online real estate market – So, obviously the property seeker is not having a dearth of options

2. There are presently 9 players in the online real estate market 🙂 The brokers and developers( OK only the brokers, developers are you know..pretty happy) will spend their money after considerable contemplation before subscribing to the services of any of these portals.

So now, what are these portals to the property seekers? Basically, they act as a research tool. So, essentially they need more variety, more choices, easy to use interface and authentic listings. Cool.

But, how to attract the users?

Generating more brand awareness and understanding the property seekers’ journey and experience on the portal. Now, meri dukaan meri website hi hai, and as things get more and more digital the human contact looses out. So, if a user comes to one website, pukes and leaves to other website, there is no one to ask the fellow – “didi what happened? color pasand nahi aya?” And here my friend you need people like me 🙂 the Analytics guys who will evaluate the entire customer journey, research upon it and give the answer to

“Mirror Mirror on the Wall, Who Is the Fairest of Them All?”

ummm..ok I am finding this interesting. Lets make an analytics map for

So, what is a regular customer journey?

To generate awareness of course they have to communicate across various customer touch points. The collaboration with TVF on YouTube is one such touch point.

As far as my understanding goes, they are trying to generate brand awareness, one, two, they are trying to promote their App downloads.

Hmm, so lets make a layered cake of descriptive analytics.

But first, lets set the goals. I strong believe in doing this. Goals essentially are the motivation behind doing something and the more focused we are with the goals the better are the returns on our efforts.

Again, what is the ultimate goal of ……Lead Generation (macro goal)

What would be one step behind this Goal conversion ?……Getting visitor to the property listing page (micro goal 1)

What would be one step behind this?……Getting the visitor to search for a category (micro goal 2)

What would be one step behind this?……Acquiring the visitor (micro goal 3)

So, my analytics layered cake should satisfy the taste buds with the above four goal flavors

Lets start with micro goal# 3 Aha

In all the digital analytics tools we have the birth right to set any goal completion as a metric. So, to analyse the completion of the above goals i will configure them in my analytics tool as metrics.

Since we are analyzing a marketing campaign the analytics will be around that only.

I wish to add here only that TVF has shared the URL in their description but have not appended any campaign code to the URL. I consider explaining the consequences of that beyond the scope of this particular entry.

Ok, in layer one of my analytics cake I want to cook reports that tell me about the total number of brand impressions through the TVF videos.

So, the mentions are being made once at the start, once somewhere in the middle and once in the end.

As CommonFloor’s marketing manager I will ask TVF to share YouTube Analytics data with me to see the percentage of video completes and corresponding numbers. As in how many people saw 50% of the video, how many saw 75%, how many 100%

So my total impressions will be like = total number of video views where video was seen less than 30% + twice the total number of views where video was more than 30% but less than 100% and thrice the total number of views where the complete video was viewed.

I used 30% here, because in the last video the brand was mentioned somewhere near the 30% mark, thus this of course will vary from video to video.

So, this will give me total impressions. Layer one baked.

For the second layer we need to take a certain leap of faith. But, yeah it won’t taste bad either.

I would like the see the rise in traffic on my website post the launch of the videos.

Now, I would look at the average traffic volume for the certain periods like daily, weekly, monthly before the first video was launched. Fine.

Now, Any rise in traffic volume above the volume i have been regularly experiencing can be attributed to the video launch. Again, as I said we have to take a leap of faith here.

Now, the traffic can come to my site in four ways.

One, after clicking on the commonfloor’s link provided in the description of the TVF videos.

Two, by typing the commonFloor’s address in the browser.

Three, sometime later after watching the video, may be through a different device.

Four, they have not seen the TVF video but someone who has seen the video suggests them this site like Tanu does with Mikesh.

So, we need to consider the above four cases to give TVF a justified credit.

For case one all the web analytics tools have a referral and campaign report which tells about the traffic being brought by a particular referrer or campaign. The tools also provide attribution reports which credit the success of a goal (the metric set) to channels through which the visitor has come to my site.

First, as mentioned above I will set a metric to count my micro goal 3 that is number of times my landing page was loaded.

Next I can see the Acquisition > All traffic > Source/medium report to see the number of visitors YouTube has brought to my site and what % of those visitors are new to my site that is they are coming to my site for the first time. Isn’t this great?

For cases two three and four the deployment of leap of faith comes into picture. The traffic that comes directly to my site without clicking on the link provided by TVF video.

All the web Anlytics tools will consider this traffic as “Direct”. And thus you need to add a temporal context to your analytics and compare the direct traffics you have been getting post video launch and pre video launch.

You can use the Attribution tools for further deep down analysis

So, with this I have my two layers. Layer one the brand impressions through the communication and layer two the traffic that those brand impressions brought.

Micro goal # 3 delivered. Lets have a voyage for micro goal #2

My micro goal #2 is getting the visitor to search a category.

Now, on following is the space where the visitor is submitting the requirement –

The above is on the home/landing page.

The next is on the listing page –

First I will work with my developer to set an event that this send data to analytics server every time  a user submits a query, be it on the homepage or the landing page.

Then i will set a goal metric to track the number of times the query was submitted by the user.

I will again open my source/medium reports to see how the users coming from YouTube are completing this particular goal.

Again i will compare my direct traffic’s goal completion pre and post the video lauches.

The above is a screenshot of the report I am trying to explain. As you can see you can choose your goal from the drop box. So, i can set this as micro goal # 2 for this case.

Layer #2 baked..Tada!

Now layer 3 for my micro goal #1 – getting the visitor to my listing page.

Now, you can question that I could have eliminated goal 2 since anyway I am tracking goal #1 which is not going to happen without goal #2.

The reason behind that is at times the listing page might not load or the users are submitting a query which my site is not providing.

So, I want to see the conversion from users submitting a query to my site satisfying it.

So, the layer 3 is pretty simple. Set a goal metric to count the number of times the listing page was loaded, open the source/medium report, set the goal to micro goal#1 and see the performance.

And finally the top layer. The platform for my insightful bride and groom.

Have a look below how a user submits a lead –

So, i will work with my developer to set an analytics event every time the submit button is clicked. And i will set up a goal metric to count the number of times the submission was made and again the procedure is same open the source/medium reports and check out the performance.

Alright, now my selection of layered cake has a reason. If you look at the structure of layer cake –

The base has the largest area and it gradually reduces as we go up. Which is also the case with the journey of customers that come to our site. To find the effectiveness of campaign we should consider the ratios of each of these steps.

  • % Micro goal 1/Impressions = % (total visits to my site with the association)/(Total impressions in the video)
  • % Macro goal/Micro goal #3 = % (total leads generate by campaign)/(Total visits by campaign)
  • % Macro goal/Micro goal #1 = % (total leads generated)/(Total number of time the listing page was viewed)
    • This tells me the effectiveness of my listing page and the quality of my listings.
    • I would also see a trend of this ratio. How is it trending as my site is becoming bigger
  • % Micro goal #1/ Micro goal #2 = % (listing page loads)/(the query is submitted)
  • % Micro goal #1/ Micro goal #3 = % (listing page loads)/(Landing page loads)

All the web analytics tools have some sort for goal flow report. I would definitely look at it to see the goal completion journey of my visitors.

I will also like to track a few other metrics like the bounce rate and time spent per visit.

I will also like to track the location of my users to see this campaign has been popular in which all geographies.

Alright, I am done 🙂

Hope you liked reading this one as much I liked writing it. I would sincerely love it if you can give me a feedback.

I would be extremely grateful if you can share this post.

Till then 🙂

What is difference between Correlation and Covariance?

Correlation is standardized form of covariance. That means you can compare correlation of two data sets having different units.

You can not compare covariance of two data sets that have different units.

Now, how to find correlation and covariance in R?

Two functions –

  • cor() for calculating correlations
  • cov() for calculating covariance

Syntax for both of them is same

  • cor(x,y,use,method)

Where x and y can be a variable/matrix or data frame.

“use” is basically to handle missing values. It can have following options

  • “everything” (default)- Everything will be included and if the datasets have NA values then correlation will also have a corresponding NAs
  • “all.obs” – If NAs is/are present, an error will be returned
  • “complete.obs”- list wise deletion of NA values is done
  • “pairwise.complete.obs”-pairwise deletion of NA values is done

method specifies which correlation we want to calculate

  • Pearson
  • Spearman
  • Kendell

Most of the times we would be using Pearson only, which is a parametric correlation, that it takes assumptions. Which are – the data sets are normally distributed and they are linear.

Spearman is used mostly in case of ordinal data, when the data sets have a monotonic relationship rather than the linear one. This one is non parametric, that is it takes no assumptions.

But, the case is not solved yet. Just by knowing the coefficient of correlation we can’t say that two datasets are having strong/weak/no relationship. The result can also be because of the choice of sample and thus the results can simply be because of chance. Thus, we also need to find the significance of correlation.

To find significance of a result to reject the NULL hypothesis we usually use the p-value. We reject the NULL hypothesis when p-value is less than the level of significance.

So, if the coefficient of correlation is high and the p-value is more than the level of significance, we can’t comment on the relationship as it simply means that the results are because of chance and a different sample of same population might produce a different result.

In comes “Hmisc” package

it has a function to solve our purpose, the rcorr() function. This function provides both the strength of correlation as well the corresponding p-values.

This one also takes the same arguments. The input to this function is always a matrix, thus you have to use the as.matrix() function on inputs that are not of the required type, also make sure that you put use=”pairwise.complete.cases”.

The methods can be any of Pearson or Spearman

Again the “Hmisc” package

But, you know what would be really interesting? Plotting the correlations.

For this we need the corrgram() function of corrgram package package.

Lets plot the correlation of all the variables of mtcars data set.



The color scale goes from red to blue. Intensity of red obviously indicating the negative intensity and intensity of blue indicating the positive intensity

In the above plot we can clearly see that mileage per gallon has negative correlation with cyllinders, displacement, horse power and weight.

Sweet, now my post has a picture.

Founder, DataVinci
[testimonials_cycle theme=”default_style” count=”-1″ order_by=”date” order=”ASC” show_title=”0″ use_excerpt=”0″ show_thumbs=”1″ show_date=”1″ show_other=”1″ hide_view_more=”0″ output_schema_markup=”1″ show_rating=”stars” testimonials_per_slide=”1″ transition=”fade” timer=”5000″ pause_on_hover=”true” auto_height=”container” show_pager_icons=”1″ prev_next=”1″ display_pagers_above=”0″ paused=”0″]

What is probability Distribution?

Hello there,

So, you wanna know about probability distributions with R eh?

Trust me I had to do quite some research before coming to this one. Some people can be like – *smirk* research on probability distribution? I knew that in my kindergarten. Well. I did not 🙁

So, what is a probability distribution function f(x)? The answer is not that simple and will depend on the type of x, whether x is categorical, discrete or continuous. Lets see –

If x is categorical then f(x) is called categorical distribution

If x is discrete then f(x) is called probability mass function (PMF)

If x is continuous then f(x) is called probability density function (PDF)

The probability of finding exact value of x when variable x is discrete can have some non zero value. But the probability of finding exact value of x when variable x is continuous is 0. In fact PDF for a particular value of x is probability of finding something close to that value and not exactly that value!


Now, most of the times when you will be dealing probability distribution of a continuous variable , you will be required to find something that is to do with the area between certain points in the distribution.

Since, probability can never be greater than 1, the total area under the probability distribution curve is equal to 1 and can never be greater than that.

When Discussing R programming , the three probability functions that you need to keep in mind are as follows –

1. dnorm()

2. pnorm()

3. qnorm()

Notice the “norm”, it stands for normal distribution.

Lets start with dnorm(). The syntax is as follows –

dnorm(x, mean, standard deviation)

if you only give x without specifying the mean and standard deviation it assumes that x belongs to a variable with mean 0  and sd 1

dnorm returns the value on PDF that is it will return probability for not exact value of x but something close to the provided value


pnorm(x, mean, sd, lower.tail)

pnorm returns probability of finding ALL values less than x. This is integration of PDF and is called cumulative distribution function (CDF)

if we put lower.tail = FALSE, it will return probability of finding all numbers higher than x

sweet! Now, qnorm()

Understand it like this if you do as follows –


The output will be x. Qnorm is Quantile function

When talking about random variables there is another one I would like to discuss. The rnorm()

rnorm(n, mean, sd)

This one simply returns a random number from a set of n values with given mean and standard deviation.

Thats it.

I hope this was helpful.

Now, please provide your comments on the content. I seriously need help on how to improve  my posts.

Till then,

Stay Awesome!

[testimonials_cycle theme=”default_style” count=”-1″ order_by=”date” order=”ASC” show_title=”0″ use_excerpt=”0″ show_thumbs=”1″ show_date=”1″ show_other=”1″ hide_view_more=”0″ output_schema_markup=”1″ show_rating=”stars” testimonials_per_slide=”1″ transition=”fade” timer=”5000″ pause_on_hover=”true” auto_height=”container” show_pager_icons=”1″ prev_next=”1″ display_pagers_above=”0″ paused=”0″]

What is t-test? How to do t-test in R?

Alright, first thing that makes me curious about Student’s t-test is why it is called so. The history is somewhat interesting.

First, why the “Student”?

“Student” is actually pseudonym or pen name of the author, Mr. William Sealy Gosset. He used a pen name because at the time he came up the concept he was not allowed to publish stuff under his own name by his employer. Some company with bad GTPW ratings it seems

Here is the dude –

Second, why the “T”?

“t-test” comes from “t-distribution” which in turn comes from “t-statistic” which in turn stands for “test – statistic” which in turn is short for “hypothesis test statistic”

So, brother, what is hypothesis test statistic man?

Hint1: It is used to test some hypothesis!

Hint2: hypothesis to verify any existence of difference between two data sets!

Yes, you smart ass, you got it right! t-statistic is used to perform hypothesis testing around sample mean. But, wait a second. There are plethora of methods out there to test hypothesis around sample means, but when will ya prefer t-test?

Good Question!

The fundamental requirement for using the t-test is that your population should be normally distributed. If you are not sure about the distribution of the population, you can still use the t-distribution if your sample is large enough to satisfy the central limit theorem. If the case is not so, go for some other test like Mann-Whittney U-test.

Second, you will use t-test when the standard deviation of the “Population” is not known. There are like thousands of posts out there who claim that t-statistic should be used when the sample size is less than equal to 30. But, this is not true. So, reiterating : Use a t-test when the standard deviation of the population is not know. 

Now, what is t-statistic?

Expressing it in words:

Lets understand t-statistic for a small sample of variable X. The mean of the population is M and mean of sample is Ms.

t-statistic is a ratio of variance between groups to variance within the groups.

So, the formula –

t-statistic = (Ms – M)/SE

Moving forward, lets understand the t-distribution –

Now, as we can see above t-statistic is dependent on standard error, which in turn is dependent on the size of the sample. And because different samples will have different standard errors and different degrees of freedom, there will be multiple t-distributions of the sample variable based on the size of the sample.

How does t-distribution look?

t-distribution is also a bell shaped curve, but it has more area in the tails as compared to area in the tails of normal random distribution of a population. As the size of the sample increases the shape of t-distribution evolves to match the shape of the normal random population distribution and at a sample size of infinity, t-distribution is exactly like the normal random population distribution.

Now, how do you do a t-test?

Any statistical test will have some confidence interval. For a given confidence interval and degree of freedom, you will use the t-statistic to calculate the p-value using which you will accept or reject the null hypothesis.

p-value on a t-distribution curve for a particular t-statistic will be area of the curve to the right of that t-statistic

The degree of freedom for a sample of size n is n-1.

Before explaining the steps, let me caress upon one tail/ two tail tests. By default all tests are two tail, which assumes that your sample mean can either be less or more than the population mean, that is you test in both the directions, that’s why 2 tail.

So, the null hypothesis of a two tail test is that there is no difference between sample mean and population mean.

When you take a single tail test, then depending upon whether it is upper tail or lower tail, you check for only one case. If its lower tail, you test whether the sample mean is less than or equal to population mean or not. And if it is upper tail, you test whether the sample mean is greater than or equal to the population mean or not.

Read again and assimilate.

So, the null hypothesis of a one tail test, based on the choice of tail, would be either the sample mean being greater than population mean or it being less than the population mean.


The complement of confidence interval is level of significance. For example, if you have a 95% confidence interval then you level of significance is 5%. Now, depending on whether you take one tail or two tail test, you will use the significance level.

Just to clarify what a 5% level of significance indicates – You can erroneously reject the null hypothesis in probably 5% cases. That is 95% of the times you will not make an error.

Now, finally, the steps –

Case #1 – H0: Sample mean is more than population mean

1. Calculate the t-statistic using the formula

2. For a given t-statistic using software find out the corresponding p value

As mentioned p value gives area to the right of the test statistic on t-distribution curve. That area corresponds to probability of type 1 error

3. Compare p value with level of significance

4. if p value is less than level of significance reject the null hypothesis

Case #2 – Sample mean is less than population mean

Step 1 and 2 are same as above

3. Subtract the p-value from 1

This gives the area to the left of t-statistic on the t-distribution curve

Case # 3 – Sample mean is not equal to population mean, it can be either greater or lesser

Step 1 and 2 are same

3. Double the p-value to accommodate both the cases

4. Compare the doubled value with level of significance

Easy peasy!

Now, ladies and gentlemen lets see how to do a t-test in “R”

Lets start with the independent t-test. Independent t-test is used to prove/disprove similarity between samples of data taken from independent populations.

To understand this, lets the “UScrime” data set in the “MASS” package

Here, the variable SO tells whether the data point is for southern state or not. Prob tells about the probability of imprisonment.

We want to find whether the probability of imprisonment any way affected by the state.

Our Null Hypothesis is that there is no impact on imprisonment based on the state.

The function to run the t-test in R is t.test. It takes either of the following syntax :


where x is a dichotomous variable and y is numeric



where y1 and y2 both are numeric

By default a 2-tail test is assumed. If you want to make it a one tail, then you need to add another argument – alternative=”less” or alternative=”greater”

Lets run the test –

Here the p-value is very less than the significance level of 0.05. That is there is indeed a difference between the data from South and non south states. And the probability of this difference being just because of chance is less than 5%.

Next, Dependent t-tests

In case of dependent t-tests, the two populations are not independent of each other, that is they affect each other. Lets take case of unemployment, for limited number of jobs, employment of younger people will affect the employment of some what older people and vice versa

To run a dependent t test in R we simply need to add “paired = T” argument along with the data inputs in the t.test() function.


where y1 and y2 are numeric data


With this more or less we should be good with t-test.



Founder, DataVinci




[testimonials_cycle theme=”default_style” count=”-1″ order_by=”date” order=”ASC” show_title=”0″ use_excerpt=”0″ show_thumbs=”1″ show_date=”1″ show_other=”1″ hide_view_more=”0″ output_schema_markup=”1″ show_rating=”stars” testimonials_per_slide=”1″ transition=”fade” timer=”5000″ pause_on_hover=”true” auto_height=”container” show_pager_icons=”1″ prev_next=”1″ display_pagers_above=”0″ paused=”0″]