Bucket Testing

What Is Bucket Testing?

Bucket testing (sometimes referred to as A/B testing or split testing) is a term used to describe the method testing two versions of a website against one another to see which one performs better on specified key metrics (such as clicks, downloads or purchases).

There are at least two variations in each test, a Variation A and a Variation B. Metrics from each page variation are measured and visitors are randomly placed into respective ‘buckets’ where the data can be recorded and analyzed to determine which performs best.

Companies that market and sell products or services online rely on bucket testing to help them maximize revenue by optimizing their websites and landing pages for conversions.

How It Works: An Example

Let’s look at a hypothetical example. Each bucket test begins with a hypothesis that a certain variation on a landing page will perform better than the control. Say you have an existing landing page for a free nutrition eBook, Eat Raw Foods and Live Longer.

The button on the bottom of your landing page’s sign-up form says ‘Submit,’ but your hypothesis is that changing the text to ‘Get Your Free Copy’ will result in more form conversions. The existing page with the ‘Submit’ button is the control, or Variation A. The page with ‘Get Your Free Copy’ on the button is Variation B. The key metric you will measure is the percentage of visitors who fill out the form, or a form completion.

Because you have an ad campaign driving several thousand visitors a day to your landing page, it only takes a few days to get the results from your bucket test. It turns out that ‘Get Your Free Copy’ has a significantly higher click rate than ‘Submit,’ but the form completion rate is basically the same. Since the form completion rate is the key metric, you decide to try something different.

Bucket Tests & Conversion Optimization

Bucket testing plays a big role in conversion rate optimization. Running a bucket test allows you to test any hypothesis that can improve a page’s conversions. You can continue to try higher-converting button text for Eat Raw Foods and Live Longer or you can go on to test other hypotheses, such as bolder headline copy, more colorful imagery or arrows pointing to the sign-up button that will get more people to convert.

Companies spend millions of dollars to drive traffic to landing pages and websites that promote their product or service. With simple variations to page copy, imagery, and layouts, you can conduct a series of bucket tests to gather data and to iterate towards your highest-performing version of the page. You simply create variations of the page, changing one element at a time and measuring key metrics, then collect the results until reaching statistically significant results for each experiment.

Bucket testing can make a significant impact on conversions per page, resulting in revenue increases on your highest-trafficked pages.

Bucket testing can also help to eliminate subjective opinions as deciding factors in a page’s design or layout. The author of Eat Raw Foods and Live Longer may think that her photo will drive more customer demand – or she may insist on a rainbow palette of colors.

With bucket testing, there is no need for debate on what design or page elements will work best to convert a customer. The quantitative data will speak for itself, and drive the decision for you.

Tests should be prioritized to run on your most highly trafficked pages, since you may need hundreds or thousands of visitors to each variation to gather statistically significant data. The more traffic a page receives, the quicker you will be able to declare a winner.

Common Page Elements To Test:

  • Headlines and sub-headlines: varying the length, size, font and specific word combinations
  • Images: varying the number of images, placement, type of imagery (photography vs. illustration) and subject matter of imagery
  • Text: varying the number of words, style, font, size and placement
  • Call-to-action (CTA) buttons: varying common ones such as ‘Buy Now,’ ‘Sign Up,’ ‘Submit,’ ‘Get Started,’ or ‘Subscribe’ and varying sizes, colors and page placement
  • Logos of customers or third party sites: build credibility and convey trustworthiness (could include Better Business Bureau, TRUSTe or VeriSign logos as well as customer logos)

Give us a shout if you need any help with A/B testing by filling in the form below

Check out our Google Analytics solutions here

Check out our Adobe Analytics solutions here

Split Testing

Split Testing Simplified

Split testing (also referred to as A/B testing or multivariate testing) is a method of conducting controlled, randomized experiments with the goal of improving a website metric, such as clicks, form completions, or purchases. Incoming traffic to the website is distributed between the original (control) and the different variations without any of the visitors knowing that they are part of an experiment. The tester waits for a statistically significant difference in behavior to emerge. The results from each variation are compared to determine which version showed the greatest improvement.

This marketing methodology is frequently used to test changes to signup forms, registration pages, calls to action, or any other parts of a website where a measurable goal can be improved. For example, testing changes to an online checkout flow would help to determine what factors increase conversions from one page to the next and will lead to increased orders for the website owner.

Seemingly subjective choices about web design can be made objective using split testing, since the data collected from experiments will either support or undermine a hypothesis on which design will work best. Demonstrating ROI (return on investment) for a testing platform can be measured easily because tests are created with a quantifiable goal in mind.

Split testing results

Split testing tools allow for variations to be targeted at specific groups of visitors, delivering a more tailored and personalized experience. The web experience of these visitors is improved through testing, as indicated by the increased likelihood that they will complete a certain action on the site.

Within webpages, nearly every element can be changed for a split test. Marketers and web developers may try testing:

  • Visual elements: pictures, videos, and colors
  • Text: headlines, calls to action, and descriptions
  • Layout: arrangement and size of buttons, menus, and forms
  • Visitor flow: how a website user gets from point A to B

Some split testing best practices include:

  • Elimination: fewer page elements create less distractions from the conversion goal
  • Focus on the call to action: text resonates differently depending on the audience
  • Aim for the global maximum: test with the overarching goal of the website in mind, not the goals of individual pages
  • Provide symmetric and consistent experiences: make testing changes consistent throughout the visitor flow to improve conversions at every step of the process

Habitual testing for a website owner or business helps to build a culture of data-informed decision-making that takes into account audience preferences. Each click on a website is a data point from a potential customer. Conflicting opinions can be put to the test with split testing methodology, and the visitors to the website will inform the final decision on the “best” design.

Split testing Process

Split testing is equivalent to performing a controlled experiment, a methodology that can be applied to more than just web pages. The concept of split testing originated with direct mail and print advertising campaigns, which were tracked with a different phone number for each version. Currently, you can split test banner and text ads, television commercials, email subject lines, and web products.

Hope you liked the post.

Give us a shout if you need any help with A/B testing by filling in the form below

Check out our Google Analytics solutions here

Check out our Adobe Analytics solutions here

Google Tag Manager (GTM) for mobile apps

Google Tag Manager (GTM) for Mobile Apps was first announced in August this year and has some great implications for app developers.

Perhaps most notably, the product has the potential to overcome one of the most critical challenges in the business: pushing updates to the user base without having to publish a new version on the app marketplace.

Typically, from the moment an app is shipped it is frozen, and from that point onwards the developer can only make changes to how the app behaves if the user accepts an update. By shipping an app with GTM implemented, configurations and values may be continuously updated by publishing new container versions through the web-based GTM interface.

In this post, we will cover how to get started with GTM for mobile apps and how to implement Universal Analytics tags using the GTM SDK for Android. As a heads up, this will occasionally get pretty technical, however I believe it is important to understand the product from its fundamentals.

Initial Set Up

Before we get started, some initial configuration steps need to be completed. More detailed instructions on these are available in the Google Developers Getting Started guide, but in a nutshell they include:

  • Downloading and adding the GTM library to our app project
  • Ensuring our app can access the internet and the network state
  • Adding a Default container to our app project

We will hold back on that last part, adding a Default container, until we have created some basic tags and are ready to publish. We will revisit the Default container later in this post.

Create an App Container

We need to start off by creating a new container in Google Tag Manager and select Mobile Apps as the type. Typically, we will have one container for each app we manage, where the container name is descriptive of the app itself (e.g. “Scrabble App”). Take note of the container ID on top of the interface (in the format “GTM-XXXX”) as we will need it later in our implementation.

App container for mobile app

Opening a Container

Assuming we have completed the basic steps of adding the Google Tag Manager library to our project, the first thing we need to do before we start using its methods is to open our container.

Similarly to how we would load the GTM javascript on a webpage to access a container and its tags, in an app we need to open a container in some main app entry point before any tags can be executed or configuration values retrieved from GTM. Below is the easiest way of achieving this, as outlined on the Google Developers site:

ContainerOpener.openContainer(
        mTagManager,     // TagManager instance.
        GTM-XXXX”,       // Tag Manager Container ID.
        OpenType.PREFER_NON_DEFAULT,   // Prefer not to get the default container, but stale is OK.
        null,                    // Timeout period. Default is 2000ms.
        new ContainerOpener.Notifier() {       // Called when container loads.
          @Override
          public void containerAvailable(Container container) {
            // Handle assignment in callback to avoid blocking main thread.
            mContainer = container;
          }
        }
    );

Before we talk about what this code does, let’s hash out the different container types to avoid some confusion:

  • Container from network: Container with the most recent tags and configurations as currently published in the GTM interface
  • Saved container: Container saved locally on the device
  • Fresh vs. Stale container Saved container that is less vs. greater than 12 hours old
  • Default container: Container file with default configuration values manually added to the app project prior to shipping

We will talk more about the Default container later on. Back to the code. In this implementation, the ContainerOpener will return the first non-default container available. This means that we prefer to use a container from the network or a saved container, whichever is loaded first, because they are more likely to hold our most updated values. Even if the returned container is Stale it will be used, but an asynchronous network request is also made for a Fresh one. The timeout period, set as the default (2 seconds) above, specifies how long to wait before we abandon a request for a non-Default container and fall back on the Default container instead.

We may change the open type from PREFER_NON_DEFAULT to PREFER_FRESH, which means Google Tag Manager will try to retrieve a Fresh container either from the network or disk. The main difference is hence that a Stale container will not be used if we implement PREFER_FRESH unless no other container is available or the timeout period is exceeded. We may also adjust the timeout period for both PREFER_NON_DEFAULT and PREFER_FRESH, however we should think carefully about whether longer request times negatively affects the user experience before doing so.

Tag Example: Universal Analytics Tags

We have completed the initial set up and know how to access our Google Tag Manager container. Let’s go through a simple example of how to track App Views (screens) within our app using Universal Analytics tags.

Step 1: Push Values to the DataLayer Map

The DataLayer map is used to communicate runtime information from the app to GTM, in which we can set up rules based on key-value pairs pushed into the DataLayer. Users of GTM for websites will recognize the terminology. In our example, we want to push an event whenever a screen becomes visible to a user (In Android, the onStart method is suitable for this). Let’s give this event the value ‘screenVisible’. If we want to push several key-value pairs, we may utilize the mapOf() helper method as demonstrated below. In this case, since we will be tracking various screens, it makes sense to also push a value for the screen name.

public class ExampleActivity extends Activity {

  private static final String SCREEN_NAME = "example screen";
  private DataLayer mDataLayer;

  public void onStart() {
    super.onStart(); 
    mDataLayer = TagManager.getInstance(this).getDataLayer();
    mDataLayer.push(DataLayer.mapOf("event", "screenVisible",
                                                   "screenName", SCREEN_NAME));
  }
//..the rest of our activity code
}

We may then simply paste this code into every activity we want to track as a screen, replacing the SCREEN_NAME string value with the relevant name for each activity (“second screen”, “third screen”, etc.).

Note: the container must be open by the time we push values into the DataLayer or GTM will not be able to evaluate them.

Step 2: Set Up Macros In Google Tag Manager

Simply put, macros are the building blocks that tell GTM where to find certain types of information. Some macros come pre-defined in GTM, such as device language or screen resolution, but we may also create our own. First of all we want to create a Data Layer Variable macro called screenName: this is the name of the screen name value we pass along with the event as demonstrated above.

GTM will then be able to evaluate the screenName macro, which can consequently be used in our tags. If we have not done so already, we may also create a Constant String representing our Analytics property ID at this point. These macros are now at our disposal in all container tags.

Macros for Mobile Apps

Step 3: Configure an App View Tag

Let’s set up our Universal Analytics App View tag. Our configurations are visible in the screenshot below (note the use of our newly created macros). The screen name field value of the App View will be automatically populated and corresponds to what we push to the DataLayer as the value of the screenName macro. The gaProperty macro value specifies which Google Analytics property data should be sent to (by reusing it throughout our container, for every Universal Analytics tag, we can both save time and prevent some critical typos).

Tag Manager app view tag

Step 4: Configure a Firing Rule For Our Tag

Finally, we need to set up the conditions under which the tag should execute. Since we are pushing an event with the value “screenVisible” every time an activity becomes visible, this should be the condition under which our tag should fire, as demonstrated below.

Tag Manager firing rule

Step 5: Save and Publish

We can continue to create other tags at this point. It may be beneficial, for example, to create some Google Analytics Event tags to fire on certain interactions within our app. We should apply the same logic in these instances: We need to push various event values to the DataLayer as interactions occur, and then repeat the steps above to create the appropriate Universal Analytics tags. When we’re happy, all that’s left to do is to create a new version of the container and Publish.

Tag Manager version

As we ship our app with Google Tag Manager implemented, requests will be made to the GTM system to retrieve our tags and configuration values as we discussed earlier.

Hold on, there was one more thing: the Default container!

Default Containers

When we are finished with our initial Google Tag Manager implementation and feel happy with the tags we have created, we are almost ready to ship our app. One question should remain with us at this point: what do we do if our users are not connected to the internet and hence unable to retrieve our tags and configurations from the network? Enter the Default container.

Let’s back up a little bit. In the GTM world, tag creation, configuration, settings, etc. is primarily handled in the web-based GTM interface. The power of this is obvious: we no longer need to rely on our development teams to push code for every change we want to make. Instead, we make changes in the GTM interface, publish them, and our tags and values are updated accordingly for our user base. This of course relies on the ability of our websites or applications to reach the GTM servers so that the updates can take effect. Here things get a bit more tricky for mobile apps, which partly live offline, than for websites.

To ensure that at least some container version is always available to our app, we may add a container file holding our configuration values to the project. This can be a .json file or a binary file, the latter being the required type to evaluate macros at runtime through GTM rules. We may access the binary file of our container through the GTM user interface by going to the Versions section. Here, we should download the binary file for our latest published container version and add it to our project.

create tag manager version

The binary file should be put in a /assets/tagmanager folder and its filename should correspond to our container ID (the file must be located in this folder, and it must be named correctly with our container ID). At this point, we should have both the JAR file and the binary file added to our project as shown below.

Mobile app tag manager files

Once this is done, we are ready to ship the app with our Google Tag Manager implementation. As described earlier, Fresh containers will be requested continuously by the library. This ensures that, as we create new versions of our container and publish them in the web-based GTM interface, our user base will be updated accordingly. As a back-up, without any access to a container from either the network or disk, we still have the Default container stored in a binary file to fall back on.

Summary

Let’s summarize what we have done:

  1. After completing some initial configuration steps, we created a new app container in the web-based GTM interface
  2. We figured out how to open our container as users launch our app, choosing the most suitable opening type and timeout value (taking into consideration user experience and performance)
  3. We then implemented code to push an event to the Data Layer as various screens become visible to our users, setting up a Universal Analytics App View tag in GTM to fire every time this happens
  4. We downloaded the binary file of our container and added it to our app project to be used as a Default container
  5. Lastly, we created and published our container in GTM

We are now ready to ship our application with GTM implemented!

Closing Thoughts

Google Tag Manager for mobile apps can be an incredibly powerful tool. This basic example shows how to implement Universal Analytics using this system but barely scratches the surface of what is possible with highly configurable apps that are no longer frozen. Simply put, getting started with GTM for mobile apps today sets businesses up for success in the future, I recommend trying it out as soon as possible.

I would love to hear your thoughts around Google Tag Manager for mobile apps. What are your plans for (or how are you currently) using it?

Check out our Google Analytics solutions here

Check out our Adobe Analytics solutions here

Say Ola! by filling in the form below.

Adobe Analytics Configuration Variables

So, first of all what are Configuration Variables. Lets Google – SiteCatalyst Configration Variables

Webmetric.org  has provided a pretty exhaustive content. Much appreciated. Usually, there is just copy paste from SiteCatalyst PDFs. But, this one is helpful.

Another good content can be found at AnalyticCafe

So, one general understanding after skimming through the content is that the Configuration Variables are the fundamental setting for the Analytics JavaScritpt Library. In most cases you will not change them at a page level (Exceptions- Link tracking ). Secondly, they don’t directly collect data but rather affect data collection. You will not find a report for report suite IDs or currencies by default in SiteCatalyst.

Lets have a look at the list of Configuration Variable

  1. s.account
  2. currencyCode
  3. charSet
  4. trackDownloadLinks
  5. trackExternalLinks
  6. trackInlineStats
  7. linkDownloadFileTypes
  8. linkInternalFilters
  9. linkExternalFilters
  10. linkLeaveQueryString
  11. linkTrackVars
  12. linkTrackEvents
  13. cookieDomainPeriod
  14. fpCookieDomainPeriod
  15. doPlugins
  16. usePlugins
  17. dynamicAccountSelection
  18. dynamicAccountMatch
  19. dynamicAccountList

So, that brings the total to 19 configuration variables over which we have control.

Tomorrow I will tick them off one by one from this list.

Ok, Aug 21st – Time flies!

So, its 10.35 AM, I am back from gym and now its time for some Analytics Heavy lifting. I start my daily learning with some blogs. I alternately read Google Analytics and SiteCatalyst. Today its Google Analytics,

So lets see what Justin Cutroni has for us today.

I read Advanced Content Tracking with Universal Analytics. Go check it out. He has written a JavaScript to track events for how is content being read by readers on a site. Pretty Sweet and simple. I also need to learn how to write such codes.

Now, coming back to Configuration Variables –

1.s.account

The report suite is the most fundamental level of classification you apply on analytics data. Each report suite has a Unique ID. The s.account is nothing but a way tell Analytics data collection server about the Report suite to which the data is to be sent and we do that through the unique ID. Please keep in mind the report suite name is different from report suite ID. The scope of report suite name is only till your account and the scope of report suite ID expands till Adobe Data collection servers.

For a single suite tagging, here is an example –

s.account = “AwesomeAnalytics”

For multi suite tagging use a comma separator –

s.account = “AwesomeAnalytics,SuperAwesomeAnalytics”

Other things to keep in mind –

  • Each ID has a total length of only 40 bytes
  • Don’t use space while declaring the multiple IDs
  • Unique ID should contain only alphanumeric characters, the only exception being hyphen “-“

Well, that’s pretty much it about the s_account, lets look at the next variable – currencyCode

Now, this is one variable which should be ideally set in the s_code. Why I am saying this? Coz its non persistent, so you gotta send this information with every image request. Now, in most cases the currency of the website is kinda constant and even if it changes you can write a small code with certain if else statements and allot the currency accordingly. Thus, try declare in the s_code itself.

But, yes, why it is used. It is basically used to facilitate currency conversions. If you were not aware then SiteCatalyst offers a base currency for each report suite which can be different from the currency at which the item is sold on the website. Now, the currencies are different hence the revenue should also be different. That is we need something to facilitate conversions. currencyCode does just that. Example –

You are having a global website, but you have your head office at US. So, to track your performance you will prefer the reports in US dollars. But across the world you are not selling in dollars. So, lets suppose you are making sales in Japan. So, on the japanese website, you will set s.currencyCode = “YEN” and you can have the base currency of your report suite as Dollars. Now, SiteCatalyst will automatically convert the revenue from Yen to Dollars in your reports. That’s really sweet.

Few things to keep in mind –

There are certain currencies like Swedish Krona that don’t use period “.” but use comma “,” in there currency. But periods and commas have different meaning to SiteCatalyst. So, only use periods and not commas while feeding revenue data to any variable.

The default value for this is USD. If you require any conversion, you can simply ignore this

The debugger parameter is CC

Ok, the next one – charSet

This one is only to be used when you are going to feed in data to variables which is having non ASCII characters. Here is a link to the ASCII table – I AM ASCII TABLE.So, just cross check what you are feeding to the variables.

SiteCatalyst uses two character sets – UTF-8 and ISO-8859-1

This one plays a very crucial role for tagging international websites whose character set is beyond the standard ASCII characters.

The charSet declaration is used by Analytics servers to convert the incoming data in UTF-8

The value in s.charSet should match the value of charSet declared in Meta tag or http header.

As we know that the reports in SiteCatalyst can be populated in multiple languages. When we chose any non English language, then SiteCatalyst uses UTF-8 encoding.

Each Analytics variable has a defined length limit expressed in bytes.For standard report suites, each character is represented by a single byte; therefore, a variable with a limit of 100 bytes also has a limit of 100 characters. However, multi-byte report suites store data as UTF-8, which expresses each character with one to four bytes of data.This action effectively limits some variables to as little as 25 characters with languages such as Japanese and Chinese that commonly use between two and four bytes per character

The JS file must define the charSet variable . (All page views and traffic are assumed to be standard 7-bit ASCII unless otherwise specified.) Setting the charSet variable , tells the Analytics engine what language should be translated into UTF-8. Some language identifiers used in meta-tags or JavaScript variables do not match up with the Analytics conversion filter .

I will take ahead from here tomorrow.

Okay Aug, 22nd 11.17 Am

The only thing I believe that matters most to become successful is consistency.

First, time for some blog read. SiteCatalyst today.

I found this post by Adam Greco really Awesome –


http://adam.webanalyticsdemystified.com/2010/05/17/crm-integration-2-passing-crm-data-to-web-analytics/

Aug 24th, 6.30 PM, Sunday

I have been a bit pressed with time for the last two days. Actually was pressed only on Friday,

yesterday I did not do anything that productive except the gym. It was weekend so i could train

guilt free. By the time i left the gym last night, i was so tired i could sleep in the gym it self 😛

Anyway the next configuration variable – trackDownloadLinks

This is one is pretty simple and sweet. It does two things –

  1. As the name suggests, provides the provision to track download links
  2. Affects the clickMap data for the downloadable links

The implementation is pretty simple. Its value is either true or false –

s.trackDownloadLinks = true

I wonder why would anyone keep it false. So, if its set to true anytime someone clicks on a downloadable link, the data is sent to Analytics engine. What data? The name of the link and the click instance.

When its combined with the clickMap. The data will be shown along with the page on which the link was clicked. If you set this variable to false, clickMap data will be affected.

That’s pretty much it about this one. So, the next one – trackExternalLinks

I will not spend much time on it. Its pretty straight forward. If its set to true it tracks clicks on external links else it does not. All links whose domain does not match with the one you define in are external links. This one also affects the clickMap data.

So, the next – trackInlineStats

this one is important. Again this only has the value as either true or false. If it is set to true then only the clickMap data is recorded otherwise not. It populates the clickMap reports

Next – linkDownloadFileTypes

This one works in tandem with trackDownloadLinks. It is simply a comma separated list of extensions for various file types that exist on your site. If a particular extension is not part of this link, the corresponding downloadable file will not get tracked for Analytics.Thing to keep in mind is that Analytics can track only left clicks downloads and not the right click save as downloads, as left click is under the scope of browser and right click and save as is beyond the scope of browser and under the scope of operating system. Example –

s.linkDownloadFileTypes=”exe,zip,wav,mp3,mov,mpg,avi,wmv,doc,pdf,xls,xml”

Tomorrow I will start from linkInternalFilters

Ok, its August 25th, 10:30 AM

As usual I will start my day with some blog read. Google Analytics today. Here I come Mr. Cutroni

I read – http://cutroni.com/blog/2014/01/09/set-google-analytics-content-grouping/

Good as usual

Now, the next variable – linkInternalFilters

This variable is used in conjunction with trackExternalLinks. When we define linkInternalFilters we basically tell SiteCatalyst or the Analytics engine about what are the links or domains that we don’t want to be tracked as external links or in other words you are telling the Analytics engine that these are my internal links. So, you also need to specify the domain on which the ananlytics is getting implemented. Lets look at an example –

s.trackExternalLinks=true

s.linkInternalFilters=”javascript:,example.com,example1.com,exampleAffiliate.com”

In the above example i have told the Analytics engine that dude links to example.com, example1.com and exampleAffiliate.com should not be considered as external.

Again, notice this I am not using any spaces in the syntax.

Cool, lets move ahead – linkExternalFilters

This one is used when you are very particular about the exit links you want to track. In other words there are only certain exit links you want to track and not all exit links. So, using this configuration variable you provide that list of exit links to be tracked to the analytics engine. For example, I have many exit links on my site but I want to track only BadaBoom.com

s.trackExternalLinks=true //yep bro track ’em

s.linkInternalLinks=”javascript:,example.com”

s.linkExternalLinks=”BadaBoom.com”

Now, what we have done is that we have provided two filters to the Analytics engine. One tells what are my internal links. This will filter out all the links that I don’t consider my internal links. Next I put a filter around the external links for the links I want to be tracked. The rest of the external links won’t be displayed in the exit links report. If you don’t want the second filter simply leave it blank-

s.linkExternalFilter=””

Now, the next one – linkLeaveQueryString. By the way we are half way done 🙂

I find the name of this dude very complex – It says leave Query String, but if you set it to true, it does not leave query string. So, to deal with this one – tell it what you don’t want and it will give you what you want. Another thing all the configuration variables that have “link” in them will affect ummm.. yeah.. the link tracking. So, while tracking the links do you want to consider the Query String parameters or you don’t want? Just remember example.com and example.com?query=badaboom are essentially the same pages but when you set s.linkLeaveQueryString = true, this will track both links as separate and different. So, if you want only those links that lead to example.com irrespective of whether it was example.com?query=Badaboom or example.com?query=BadaBadaBoom set s.linkLeaveQueryString= false.

So, next one – linkTrackVars

This one is very very very important as any goof up with this will drastically affect your server calls and data being captured by other variables. So, lets dance.

This one is used to send data on exit, download and custom links. Its very dumb and freaks out it no instructions are given to it. You have to tell it what are the custom variables to which the data needs to be sent upon the click events. If you don’t tell it, its panics and resends data for all the variables who have a value at the time of a click to the analytics engine and hence duplicates the data for the other report causing erroneous inflation. So, never leave this one blank. I repeat never leave this one blank!

in the Analytics s_code define it as none –

s.linkTrackVars=none

Ideally, this should be recorded in the onClick events. example –

<a href=”index.html” onClick=”

var s=s_gi(‘rsid’);

s.linkTrackVars=’prop1,prop2,events’;

s.linkTrackEvents=’event1′;

s.prop1=’Custom Property of Link’;

s.events=’event1′;

s.tl(this,’o’,’Link Name’);

“>My Page

The above example i copied from the SiteCatalyst Implementation guide. In this example we are doing a custom link tracking.the var s = s_gi(‘rsid’) is providing the report suite Id which is rsid in this case.We want to send data to prop1, prop2 and event1 on click of this link

s.tl() is used for custom link tracking

Please take notice we don’t provide value to linkTrackVars as s.prop1 or s.prop2 or s.events.We simply write prop1,prop2,events. Again, no commas. For tracking events we use linkTrackEvents which I will be covering next.Ok, so tomorrow I will take from linkTrackEvents. I want to you to smile right now. May the rest of your day be super amazing. Cheers!

August 26th, 12:07 pm, yes got a bit delayed today. Was working up for my other project –

livAwesome, its a self-help project I am working upon. You can check that out.

Ok, lets start with a blog. SiteCatalyst today.

Nice, so the next vatiable – linkTrackEvents

So, whenever you are doing a custom tagging for a link, and on specific onClick events you want to fire apart from prop and eVar an event as well, linkTrackEvents is used over there. While defining this event you first need to include “events” notice the “s” in linkTrackVars and in the following statement specify the name of the event you will be using in linkTrackEvents.Again, you will not be specifying the values as s.event1 or s.event2, you will simply write event1,event2,event3….

The declaration of event in s.events=eventX is still required. Please refer the to the example above for better clarity.

Ok, now lets look at the next one – cookieDomainPeriod

Any mistakes with this one can screw up your data, particularly around the count of unique visitors. Let me explain what it does. The setting on this variable directly affects the acceptance/rejection and functionality between the browser and Analytics cookie.

If you find the understanding of this variable difficult then don’t take alot of load. Simply put the number of periods in your DOMAIN as the input to this variable. Put only the number of periods in Domain and NOT THE SUBDOMAIN.

By, default the value of this variable is 2.

If your domain is AwesomeAnalytics.com then the cookie domain period is 2 if it is AwesomeAnalytics.co.in then the cookie domain period is 3.

Now, if the page is having URL www.badaboom.AwesomeAnalytics.com then still the cookie domain period is 2 and not three because we are providing the name of domain and not the subdomain.

But, what happens if my cookieDomainPeriod is 2 but my domain has three periods. For

example, AwesomeAnalytics.co.in. In this case the s_code will attempt to set a cookie for co.in domain and not AwesomeAnalytics.co.in. Now, since co.in is not a valid domain, the browser will get pissed off and ask the s_code to bugger off and reject the cookies, this will terribly affect the count of unique visitors and functionality of many of the plugins.

Another example you have a subdomain badaboom.AwesomeAnalytics.com. Here the CookieDomainPeriod is 2 and you give the value three. The analytics will assume that the domain is badaboom.AwesomeAnalytics.com and not AwesomeAnalytics.com. Now, if you have another subdomain NotBadaBoom.AwesomeAnalytics.com. This will be separate

website altogether, so would be AwesomeAnalytics.com. So, this will affect the functionality of cookies and also in linkInternalFilters you have the name of your domain and not the subdomain, things will get screwed up.

One more thing – enter the values a string that is 2 is “2”

If its still not clear just set the value to number of periods in your domain please.

Now, next fpCookieDomainPeriod.

The fp stands for first party. This is used by Analytics to set up first party cookies OTHER

THAN visitor Id cookie (s_vi) that is it does not control s_vi. The visitor click map cookie -s_sq and cookie checker -s_cc, plus cookies set up by a few plugins like getValOnce are affected by this variable.

Again simply put the number of periods in your domain into this period.

Tomorrow, I will take it up from dPlugins. Cheers!

August 28th, if you did not notice, i did not blog yesterday, was really busy with some other work. Its been two days I have not been to gym.

Ok, so getting down to business – doPlugins

So you might be aware that there are various plugins that you can use to enhance the functionality of your Analytics implementation. To use plugins apart from adding the plugin code to the Appmeasurement.js library or the s_code, You need to configure three other variables –

usePlugins

doPlugins

s_doPlugins function

First, set s.usePlugins=true

second, define the s_doPlugins function third, feed to output of s_doPlugins function to s.doPlugins variable –

Example, we want to feed a default value to prop1 by using some plugin. Add this code to s_code –

usePlugins=true

function s_doPlugins(s){

s.prop1=”Plugin output”

}

s.doPlugins=s_doplugins

Just a piece of information – When visitor comes to your website, then the Javascript files get cached to the visitors browser. The Analytics is driven by that cached version only, till it gets updated. So, if you have made any updates in your Appmeasurement library or s_code and the change is not getting reflected that is because the visitor’s browser is using the cached version. This can happen to you also during testing. You might have made the correct update, but it won’t get reflected because your browser is not deploying the latest library. So, its a good practice to clear your cache before testing any updates.

September 3rd. Blogging after let me count – 5 days! I went home actually and post my return I have not been able to manage my time well. But, anyway the essence lies in moving forward.*Fingers popping*

Just three left – dynamicAccountSelection, dynamicAccountMatch and dynamicAccountList. All three are used in conjunction with each other. As their name suggests, they are used to dynamically select the report suite to which the data is to be sent.

Lets start with dynamicAccountSelectio. The input is just true or false. If want dynamic selection, simply set it to true.

Now, dynamicAccountList. In the syntax of dynamicAccountList we specify two things. One, the report suite Ids and two, the URLs.

We basically tell the Javascript to select particular report suite IDs based on the URL of the webpage. So, lets have a look at the syntax.

s.dynamicAccountList=”reportsuiteId1,reportsuiteId4=AwesomeAnalytics.com;reportsuiteId5,reportsuiteId3=BadaBoom.com”

Key characters to keep in mind here are comma, semi colon and equal to. Look at the way I have put them here. I so wish SiteCatalyst had been consistent with it. like for products, different product declarations are separated by comma and individual properties by semi colon. Here the case is opposite. So, keep these things in mind.

Now, dynamicAccountMatch. This is kinda complex for people who are not that geeky. So, you first enable dynamicAccountSelection, then you provide the criteria in the dynamicAccountList and then, you tell the Javascript where to implement this criteria, and this is done through dynamicAccountMatch.

The value you feed to dynamicAccountMatch are DOM objects on which you want to apply the criteria mentioned in dynamicAccountList. Following are the DOM objects to which you can apply dynamicAccountSelection –

window.locatio.host – to apply it to domain, this is also by default

window.location.pathname – to apply it to the path in the URL

( window.location.search?window.location.search:”?”) – to apply it on a Query string

window.location.host+window.location.pathname – Host name and Path

window.location.pathname+(window.location.search?window.location.search:”?”) – Path + Query String

window.location.href – Entire URL

Keep in mind dynamic Account selection is not supported by the latest Appmeasurement for Javascript.
And Voila! we are done! So, how many days it took me? 15 days! jeez. But, the point is this blog now exists in this world and we both learned something. Consistency matters.
Please give me a feedback if you have some time.
Till then, Happy Analytics!
Cheers!

Page Naming in Digital Analytics

Page Naming in Digital Analytics is the most fundamental requirement of Implementing Analytics on your Website. Yet, time and again I discern a casual approach towards it.


Now, human nature is more risk averse than reward motivated. So, let me tell you what you are losing first with a casual page naming.


In Sitecatalyst/Report and Analytics, several reports like your pathing reports, your clickmap reports are essentially based on your Page names. So, if you screw up your Page Naming you miss a lot on your other goodies.Plus, Pages report is very very special. It is an out of box s.prop(traffic variable) with special powers. You can drill down all other reports based on pages report. Thus, it is one of those reports which will be kind of ubiquitous in your Analytics reporting and thus it gets extremely important for the organization to do the page naming exercise with devotion.


Page Naming is not retroactive. Let me elaborate – Suppose you casually named your pages, then one fine day someone in your organization is struck by lightening and the fellow declares – dude! we need to have clean page naming. Very well dude! But, the problem in Sitecatalyst will consider the modified/updated names as entirely new pages! So, you will not get historical data combined with the fresh data for a page whose name has been modified. Instead, you will get data in two separate rows as two separate pages.


If you don’t give your webpage a name for Analytics, Sitecaalyst will automatically fetch the URL as the page name. Now, few screw ups here –

  • http://www.SanmeetIsSoAwesome.com and http://www.SanmeetIsSoAwesome.com/# are essentially the same page, but to Sitecatalyst they are entirely separate pages!
  • If your page is Dynamic using some swanky Rich Internet Applications, then all the pages will be clubbed as one because the content of the page changes but the URL remains the same. So, all pages will be reported as one!
  • How good are you with reading URLs and separating one from the other in a table full of 50 URLs? I am pathetic at it. Please give me names
  • In Sitecatalyst tables, the rows have a limited space to display Data. You are wasting precious real estate with http://www…….yada yada
  • ClickMap uses the s.pageName variable to identify ClickMap pages. If the implementation changes on a page where s.pageName is a different value, ClickMap does not show historical data. Thus, you end up getting partial data for links on a page.
  • The partial data issue is equally applicable to your pathing reports also. Pathing reports – Next page, Next page flow, previous page, previous page flow et al, in all these reports you select the page name as the criteria, not the URL.

I hope, you now understand the importance of page naming. If you are scared that “freak! but, my page naming is screwed.” then dude, you should be!

Now, I will present some points from the implementation side, some strategies to implement the page naming variable and some cheats to clean your page naming implementation, yeah yeah don’t be scared.

*popping fingers*

So, Now from implementation point of view Pages report is populated by s.pageName variable which has a query string parameter as pageName or gn in the Adobe Analytics image request. As i mentioned above it is a traffic variable and thus has a character limitation of 100 bytes, this means input beyond 100 charachers to this variable will be truncated.

Now, various ways through which you can feed value to this variable –

  • Server side –

Better suited for Dynamic pages

  • Hard code –

Suitable when you have very less number of pages and you are dealing with a team which is Analyticsally challenged. So its like –dude, don’t do anything else, don’t touch anywhere, just place this code on your webpage and tell me once done, again don’t touch anywhere. Dynamic pages are difficult to hard code, totally avoid it. 

  • Page Name plugin 

This plugin needs some configuration and it uses the structure of the URL to populate page names. Use this plugin only when you have a clean URL structure. 

I will make an entry on this plugin in some other blog.

  • document.title

This one is pretty straight forward. It uses the the value in the <title> tag of your HTML page and uses that as the page name. It is one of the most common approaches to automate the process of page naming. The only drawback which i have witnessed is at times multiple pages have the same title though their context is different. And what you gonna do bro when a page does not have a <title> tag?

  • Outsource your page naming to an Indian company. eClerx does a very good job with it. We will look at each and every page of your website and give it a clean name with valid context. Hint – there is something called processing rules *bling bling*

Now, some strategies to have page names. The three Cs of Awesome page naming –

  1. Conciseness – Keep it short
  2. Clarity – Keep it easy to understand
  3. Context – Ensure that the end user is understanding the right thing

Now, the small s – Structure! why structure?

The page Naming is still not that simple. It very much depends on the structure of your website and how vast and deep is your website. if it is a biggie then just one s.pageName will not be sufficient. You have to use other variables that are giving the detailed description of the webpage.

If you have a proper structure in your s.pageName, then you can use the this structure to feed values to other variables as well. like s.channel variable which is used to populate the site sections report.

A few examples –

Now, lets take a hypothetical website – http://SanmeetIsSoCool.com (born narcissist) , now i sell super hero collectibles. I sell comics, costumes, merchandise, videos and hope 🙂

Now, lets take a webpage where I am selling Batman’s Poster where Bruce Wayne meets Bruce Wayne. So, what should be the ideal page name?

Merchandise | Batman | Poster | Bruce meets Bruce

Now, this is concise, it is easily understood and conveys the right information and it is structured.

I can use this same variable, perform a channelextract and populate other variable like my s.channel for site sections reports. What are the advantages of using same variable to populate others? Automation bro!

Now, some cheats-

1. How to find pages with no page name?

Simple, go to your pages report. In the filter box type http . This will give you the list of pages with no page name. Again this thing won’t work if you are using the page name plugin. That’s why i said that use page name plugin only when you have clean URL structures

2. How to find Mutiple URLs populating same page name?

One, you go to pages report. Drill them down with the report which is being used to capture page URLs. Here if you can discern that different URLs are having same page name then you get your answer. Now, again this thing is not advisable for massive websites

Two, outsource it to an Indian company. eClerx does a very good job with that 🙂 We will audit each and every page of your site and then report the pages where different contexts are sharing same page name.

3. How to fix page names without altering the site?

Processing rules bro! Use the condition when URL matches something.com overwrite s.pagename to some page

Now, again as i said the data is not retroactive and it will be a birth of a new page in your analytics reports. So, what you can do is use new variables – one prop and one eVar to store the new page name and henceforth refer to them, and maintain record of what has been replaced for any historical comparisons.

I hope you found what you were looking for and I was able to convey my message properly. Please feel free to reply back to me your thoughts. I sincerely welcome any suggestions, critique.

If you want to contribute any more information, you are most welcome. Please share it if you feel this is worth sharing.

Till then, stay thirsty, stay awesome and spread positivity 

Best,

Sanmeet

How to do Logistic Regression in R?

Its a big post so don’t give up! And by any chance if you feel like closing the window, please take the pain to scroll down to the bottom and comment “this post is awesome!”…ummm…Yeah.

So, getting down to business – 
Regression is the “I TOLD YA SO!” dude of Data science. So, yup, it predicts!

A safe definition would be –

Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc

Blah!


This post has been dedicated to Logistic regression. I will start by giving some theory around Logistic regression, I know theories are kind of boring but one must know the concepts. I have tried to keep the language as simple as possible, but if you find it difficult to understand then that’s great, it tells I am getting smarter.

Then I have taken the famous Titanic data set and run a regression on that in R. In between you will come across a video on Receiver Operating Characteristics. I highly recommend you to watch it.

What is Logistic regression?

When the dependent variable is categorical in nature , the regression we will use is logistic regression.It can be Binomial, it can be Multinomial, it can be Ordinal. I will restrict the scope of this Post to Binomial.

What is Binomial Logistic Regression?

When the dependent variable of a model is dichotomous in nature, that is it can have only two outcomes, the regression is Binomial Logistic Regression.

So, here we have a “Yes” or “No” kind of dependent variable. The dependent variable for logistic regression is the probability of success of an event.

Now, when probability is a dependent variable, first thing is that it is limited between 0 and 1, and  it will be extremely difficult to limit the combination of weighted independent variables between this range. So, instead of taking the probability we take the odds, which is ratio of probability of success of an event to probability of its failure. Now, odds can be anything between 0 to infinity. But, the problem is that it will always remain positive which again puts a limitation on the formula. To handle this we take the logarithm of odds. So when the odds is less than one, the log value is negative and when the odds are more than one, the log value is positive. And when it is 1, log is zero.

With this we get a generalized linear model that predicts a categorical dependent variable.

Where,

P is the probability of success and 1 – P is probability of failure.

Beta 0 is the intercept, that is the value when all the independent variables are zero

x represents dependent variables

Beta 1,2,3…….k will represent the coefficients of the dependent variables.

Now, pay attention here –

The Logisitic regression is NOT based on method of LEAST SQUARES, it is rather based on MAXIMUM LIKELIHOOD, that is repeated iterations are made till we reach a point where we stop getting marginal improvements. At this point the model is said to attain convergence

A few things you need to keep in mind while massaging the data, else no orgasm.

1.  Make sure you don’t have high multicollinerity between the independent variables. If this is the case the model will not reach convergence.

2. Don’t have high number of categories within any variable. The maximum number of categories you should have is five. If you have more categories you will be required to do clustering before the regression.

3. You don’t have large proportions of empty cells. If this is the case you need to remove them or replace them with constants.

Alright, I will now discuss a set by step process on How to do a Logisitic Regression in R with Titanic Data Set.



*Popping fingers*

I have taken this data from Kaggle. I am taking the train.csv one.

Step 1 : Understand the data

First I have created an object Titanic and read “Train.csv” into it. Then I have summarized it to overview its composition. Here Things like PassengerId, Name, Ticket, Fare, Cabin are having too many levels as factors. Moreover, there is no point clustering them.

I have an added reasons to remove Cabin, coz out of 891 values 687 are blank.

Step 2 : Convert the variables into proper data type and handle missing values

After Summarizing I found that R was taking variables like Survived, Pclass as integer. I converted them to factors.

Also, there were plenty of missing values for the age variable. I replaced these missing values with mean. The reason for taking the mean is that it does not change the overall mean of the variable.

Step 3 : Check for Multi Collinearity 

– The regression model will not work fine if the data is having high muti collinearity

Next I took a subset of data with variables having non factor data type. You can’t run correlation on variables with factor data type. Then I check for multi collinearity and this is what is get –

We can clearly see that there is not high collinearity between any of the variables. Moving on.

Step 4 : Using the 70:30 rule of thumb, generating training and testing data sets

Sample() function generates a random sample. set.seed() is used to ensure that while debugging we get the same sample, else Sample() will generate different sample everytime.

Step 5: Create a model on the training data set. dev object in our case.

We are using the glm() function here, which stands for generalized linear model. Since we are using binary logistic regression, set the family as binomial(). Link is logit which is nothing but log of odds. ~. indicates that we are considering all the variables of the data set to predict the survived variable. You can check the Formulae table for more clarity on this. Lets interpret the results-

We can ignore residuals in most of the cases, since the model is not based on OLS but on Maximum Likelihood

Pay attention to the last column and the aestriks. The variables which are having the aestriks are having statistically significant relationship with the dependent variable. So, here class, age, sex, siblings are having a statistically significant relationship with the survival.

As you can observe that all the factors of the variables have not been taken. One factor from each variable has been taken as the base. The base case is always taken alphabetically or in case of numeric factors in ascending order. So here Pclass1 has been taken as base case for class and female has been taken as base case for sex.

If you look at the first column then you will observe that all the significant variables are negatively related with survival.

The AIC is the Akaike Information Criterion. You will use this while comparing multiple models. The model with lower value of AIC is better.

Null deviance is related to a model which only has the intercept and Residual deviance is related to our model with the independent variables. These are actually chi squared statistics with the corresponding degrees of freedom.

We will basically use this to test the null hypothesis that the deviance of model with only the intercept and deviance of model with the independent variables added to it is same.

If the value is significantly small, we can reject the null hypothesis which we are doing in our case.

Next we are going the test our model on the testing data set. Before we do that I will request to you learn some basics about ROC (Receiver Operating Characteristic). This video presents it in a very clear manner –

Step 6: Predict the survivals for the Testing data set

> test$Prid <-predict(Model,test,type=”response”)

We have created a new variable in the test data set and used the predict function to predict the probability of survival based on our model.

Step 7 : lets look at the ROC curve

We need the ROCR package for this. All the evaluations with ROCR start by first creating a predictor object, for this we have used the prediction function. As you can see the inputs are predictions of our model and actual results. So we are scoring the performance on predicted value and actual value.

Next we have used the performance function, which is used to display the various performance characteristics of the predictor object, which in our case is score. With this first we are fetching the AUC, which comes at around .89, which is a good value.

Lets have a look at confusion matrix before proceeding ahead


Next, we have plotted the curve between the TPR and FPR at various cut offs which is essentially your ROC.

As you can see that we are getting a good cut off at around 0.4

You can also visualize the model by the following simple logic –

The output of this will be –


I leave it up to you to assimilate this 🙂

Step 8 : Choosing the cut off. (Almost there) *Phew*

Next, I have taken a cut off at 0.4. The floor() function gives the smallest nearest integer to the input. And then I have created the confusion matrix.

Output –

So, the TPR at a cut off of 0.4 is 78.2% and FPR is 21.5%

Again, TPR is how often you are predicting survivals upon actual survivals. FPR is how often you are predicting non survivals as survivals.

Lets take another cut off

So, at cut off of .52 we get –

The TPR now is 73% and FPR is 12.5%

So, you get the point at each cut off the TPR and FPR will vary. So how do you choose the cut off? It is you who decide that, what are you looking for. You want a high TPR at the expense of high FPR or you want a low FPR at the expense of low TPR. You gotta take the call bro.

If you have reached to the bottom, thank you. I hope I did not waste your time. I will sincerely appreciate if you can present your thoughts on the post.

Till then,

Stay Awesome!

 Founder,

 

What is difference between Correlation and Covariance?

Correlation is standardized form of covariance. That means you can compare correlation of two data sets having different units.

You can not compare covariance of two data sets that have different units.

Now, how to find correlation and covariance in R?


Two functions –

  • cor() for calculating correlations
  • cov() for calculating covariance

Syntax for both of them is same

  • cor(x,y,use,method)

Where x and y can be a variable/matrix or data frame.

“use” is basically to handle missing values. It can have following options

  • “everything” (default)- Everything will be included and if the datasets have NA values then correlation will also have a corresponding NAs
  • “all.obs” – If NAs is/are present, an error will be returned
  • “complete.obs”- list wise deletion of NA values is done
  • “pairwise.complete.obs”-pairwise deletion of NA values is done

method specifies which correlation we want to calculate

  • Pearson
  • Spearman
  • Kendell

Most of the times we would be using Pearson only, which is a parametric correlation, that it takes assumptions. Which are – the data sets are normally distributed and they are linear.

Spearman is used mostly in case of ordinal data, when the data sets have a monotonic relationship rather than the linear one. This one is non parametric, that is it takes no assumptions.

But, the case is not solved yet. Just by knowing the coefficient of correlation we can’t say that two datasets are having strong/weak/no relationship. The result can also be because of the choice of sample and thus the results can simply be because of chance. Thus, we also need to find the significance of correlation.

To find significance of a result to reject the NULL hypothesis we usually use the p-value. We reject the NULL hypothesis when p-value is less than the level of significance.

So, if the coefficient of correlation is high and the p-value is more than the level of significance, we can’t comment on the relationship as it simply means that the results are because of chance and a different sample of same population might produce a different result.

In comes “Hmisc” package

it has a function to solve our purpose, the rcorr() function. This function provides both the strength of correlation as well the corresponding p-values.

This one also takes the same arguments. The input to this function is always a matrix, thus you have to use the as.matrix() function on inputs that are not of the required type, also make sure that you put use=”pairwise.complete.cases”.

The methods can be any of Pearson or Spearman

Again the “Hmisc” package

But, you know what would be really interesting? Plotting the correlations.

For this we need the corrgram() function of corrgram package package.

Lets plot the correlation of all the variables of mtcars data set.

>library(corrgram)

>corrgram(mtcars)

The color scale goes from red to blue. Intensity of red obviously indicating the negative intensity and intensity of blue indicating the positive intensity

In the above plot we can clearly see that mileage per gallon has negative correlation with cyllinders, displacement, horse power and weight.

Sweet, now my post has a picture.

Best,
Sanmeet,
Founder, DataVinci
[testimonials_cycle theme=”default_style” count=”-1″ order_by=”date” order=”ASC” show_title=”0″ use_excerpt=”0″ show_thumbs=”1″ show_date=”1″ show_other=”1″ hide_view_more=”0″ output_schema_markup=”1″ show_rating=”stars” testimonials_per_slide=”1″ transition=”fade” timer=”5000″ pause_on_hover=”true” auto_height=”container” show_pager_icons=”1″ prev_next=”1″ display_pagers_above=”0″ paused=”0″]

What is t-test? How to do t-test in R?

Alright, first thing that makes me curious about Student’s t-test is why it is called so. The history is somewhat interesting.

First, why the “Student”?

“Student” is actually pseudonym or pen name of the author, Mr. William Sealy Gosset. He used a pen name because at the time he came up the concept he was not allowed to publish stuff under his own name by his employer. Some company with bad GTPW ratings it seems


Here is the dude –

Second, why the “T”?

“t-test” comes from “t-distribution” which in turn comes from “t-statistic” which in turn stands for “test – statistic” which in turn is short for “hypothesis test statistic”

So, brother, what is hypothesis test statistic man?


Hint1: It is used to test some hypothesis!

Hint2: hypothesis to verify any existence of difference between two data sets!


Yes, you smart ass, you got it right! t-statistic is used to perform hypothesis testing around sample mean. But, wait a second. There are plethora of methods out there to test hypothesis around sample means, but when will ya prefer t-test?

Good Question!

The fundamental requirement for using the t-test is that your population should be normally distributed. If you are not sure about the distribution of the population, you can still use the t-distribution if your sample is large enough to satisfy the central limit theorem. If the case is not so, go for some other test like Mann-Whittney U-test.

Second, you will use t-test when the standard deviation of the “Population” is not known. There are like thousands of posts out there who claim that t-statistic should be used when the sample size is less than equal to 30. But, this is not true. So, reiterating : Use a t-test when the standard deviation of the population is not know. 

Now, what is t-statistic?

Expressing it in words:

Lets understand t-statistic for a small sample of variable X. The mean of the population is M and mean of sample is Ms.


t-statistic is a ratio of variance between groups to variance within the groups.

So, the formula –

t-statistic = (Ms – M)/SE

Moving forward, lets understand the t-distribution –

Now, as we can see above t-statistic is dependent on standard error, which in turn is dependent on the size of the sample. And because different samples will have different standard errors and different degrees of freedom, there will be multiple t-distributions of the sample variable based on the size of the sample.

How does t-distribution look?

t-distribution is also a bell shaped curve, but it has more area in the tails as compared to area in the tails of normal random distribution of a population. As the size of the sample increases the shape of t-distribution evolves to match the shape of the normal random population distribution and at a sample size of infinity, t-distribution is exactly like the normal random population distribution.

Now, how do you do a t-test?

Any statistical test will have some confidence interval. For a given confidence interval and degree of freedom, you will use the t-statistic to calculate the p-value using which you will accept or reject the null hypothesis.

p-value on a t-distribution curve for a particular t-statistic will be area of the curve to the right of that t-statistic

The degree of freedom for a sample of size n is n-1.

Before explaining the steps, let me caress upon one tail/ two tail tests. By default all tests are two tail, which assumes that your sample mean can either be less or more than the population mean, that is you test in both the directions, that’s why 2 tail.

So, the null hypothesis of a two tail test is that there is no difference between sample mean and population mean.

When you take a single tail test, then depending upon whether it is upper tail or lower tail, you check for only one case. If its lower tail, you test whether the sample mean is less than or equal to population mean or not. And if it is upper tail, you test whether the sample mean is greater than or equal to the population mean or not.

Read again and assimilate.

So, the null hypothesis of a one tail test, based on the choice of tail, would be either the sample mean being greater than population mean or it being less than the population mean.

Alrighty!

The complement of confidence interval is level of significance. For example, if you have a 95% confidence interval then you level of significance is 5%. Now, depending on whether you take one tail or two tail test, you will use the significance level.

Just to clarify what a 5% level of significance indicates – You can erroneously reject the null hypothesis in probably 5% cases. That is 95% of the times you will not make an error.

Now, finally, the steps –

Case #1 – H0: Sample mean is more than population mean

1. Calculate the t-statistic using the formula

2. For a given t-statistic using software find out the corresponding p value

As mentioned p value gives area to the right of the test statistic on t-distribution curve. That area corresponds to probability of type 1 error

3. Compare p value with level of significance

4. if p value is less than level of significance reject the null hypothesis

Case #2 – Sample mean is less than population mean

Step 1 and 2 are same as above

3. Subtract the p-value from 1

This gives the area to the left of t-statistic on the t-distribution curve

Case # 3 – Sample mean is not equal to population mean, it can be either greater or lesser

Step 1 and 2 are same

3. Double the p-value to accommodate both the cases

4. Compare the doubled value with level of significance

Easy peasy!

Now, ladies and gentlemen lets see how to do a t-test in “R”

Lets start with the independent t-test. Independent t-test is used to prove/disprove similarity between samples of data taken from independent populations.

To understand this, lets the “UScrime” data set in the “MASS” package

Here, the variable SO tells whether the data point is for southern state or not. Prob tells about the probability of imprisonment.

We want to find whether the probability of imprisonment any way affected by the state.

Our Null Hypothesis is that there is no impact on imprisonment based on the state.

The function to run the t-test in R is t.test. It takes either of the following syntax :

t.test(y~x,data)

where x is a dichotomous variable and y is numeric

or

t.test(y1,y2)

where y1 and y2 both are numeric

By default a 2-tail test is assumed. If you want to make it a one tail, then you need to add another argument – alternative=”less” or alternative=”greater”

Lets run the test –

Here the p-value is very less than the significance level of 0.05. That is there is indeed a difference between the data from South and non south states. And the probability of this difference being just because of chance is less than 5%.

Next, Dependent t-tests

In case of dependent t-tests, the two populations are not independent of each other, that is they affect each other. Lets take case of unemployment, for limited number of jobs, employment of younger people will affect the employment of some what older people and vice versa

To run a dependent t test in R we simply need to add “paired = T” argument along with the data inputs in the t.test() function.

t.test(y1,y2,paired=T)

where y1 and y2 are numeric data

Ex-

With this more or less we should be good with t-test.

Best,

Sanmeet

Founder, DataVinci

 

 

 

[testimonials_cycle theme=”default_style” count=”-1″ order_by=”date” order=”ASC” show_title=”0″ use_excerpt=”0″ show_thumbs=”1″ show_date=”1″ show_other=”1″ hide_view_more=”0″ output_schema_markup=”1″ show_rating=”stars” testimonials_per_slide=”1″ transition=”fade” timer=”5000″ pause_on_hover=”true” auto_height=”container” show_pager_icons=”1″ prev_next=”1″ display_pagers_above=”0″ paused=”0″]

What’s the Difference Between a CPU and a GPU?

Image result for Difference between CPU and GPU

The CPU (central processing unit) has often been called the brains of the PC. But increasingly, that brain is being enhanced by another part of the PC – the GPU (graphics processing unit), which is its soul.

Cpu-vs-gpu All PCs have chips that render the display images to monitors. But not all these chips are created equal. Intel’s integrated graphics controller provides basic graphics that can display only productivity applications like Microsoft PowerPoint, low-resolution video and basic games.

The GPU is in a class by itself – it goes far beyond basic graphics controller functions, and is a programmable and powerful computational device in its own right.

What Is a GPU?

The GPU’s advanced capabilities were originally used primarily for 3D game rendering. But now those capabilities are being harnessed more broadly to accelerate computational workloads in areas such as financial modeling, cutting-edge scientific research and oil and gas exploration.

In a recent BusinessWeek article, Insight64 principal analyst Nathan Brookwood described the unique capabilities of the GPU this way: “GPUs are optimized for taking huge batches of data and performing the same operation over and over very quickly, unlike PC microprocessors, which tend to skip all over the place.”

Architecturally, the CPU is composed of just few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously. The ability of a GPU with 100+ cores to process thousands of threads can accelerate some software by 100x over a CPU alone. What’s more, the GPU achieves this acceleration while being more power- and cost-efficient than a CPU.

GPU-Accelerated Computing Goes Mainstream

GPU-accelerated computing has now grown into a mainstream movement supported by the latest operating systems from Apple (with OpenCL) and Microsoft (using DirectCompute). The reason for the wide and mainstream acceptance is that the GPU is a computational powerhouse, and its capabilities are growing faster than those of the x86 CPU.

In today’s PC, the GPU can now take on many multimedia tasks, such as accelerating Adobe Flash video, transcoding (translating) video between different formats, image recognition, virus pattern matching and others. More and more, the really hard problems to solve are those that have an inherent parallel nature – video processing, image analysis, signal processing.

The combination of a CPU with a GPU can deliver the best value of system performance, price, and power.

Here is a table summarizing the differences further :

Image result for Difference between CPU and GPU

Here is brilliant YouTube Video explaining the two units :

 

Hope you liked this post!

Best,
Team DataVinci