DEV Community

maxwizard01
maxwizard01

Posted on

How to group large data set into classes using R.

How to create table for group Data

most of the time as a statisticians or analyst you will be dealing with big data
which might make you need grouping for a convenient view of your data. Imagine your data was given like below.

Example

the following shows the wages of employee in a month.

88,82,96,102,104,106,104,24,26,29,86,36, 60, 23, 24,39,48, 46, 33, 36,39, 78,67,82, 32, 67, 27, 24,26,27, 30,36, 37,49, 50, 56,83, 99,68, 28, 55, 54, 26,29, 30,40, 46, 44, 99,84, 36,51, 86,88, 87,29, 40, 40, 40, 66, 45,23,26, 46,46, 96,99, 100,100, 101,103, 106, 107, 46,48, 49,48, 94,55, 56,59, 60, 70, 72,76, 79,80, 50,49, 93,86, 54,83, 89,90, 94, 96, 99,102,46

as you can see this data is so large that it is hard to say anything about it,
to summarized the data we need table but here we don't just need a frequency table, frequency table could have been nice for it if the data has small range and repeating it self so many times.

what do I need to do when there large range and the element of the data hardly repeat itself??

All you need to do is to group the data in to class by creating some intervals.
it is always advisable to contrust up to 5classes and not more than 15classes.
also each class should have the same class-width.

what is class width??

A class width is just the number of element/value that could possibly be in a class.
for example if a class has the range 21-25 and 20-29 we would say their classwidth is 5 and 10 respectively. that happen because the element that could be consider to be in each class is below.
21-25 : 21,22,23,24 and 25 (5 numbers)
20-29 : 20, 21,22,23,24,25,26,27,28 and 29 (10numbers).

what is the trick to quickly get the classwidth of any class??
the trick I ususlly used is just by finding the range of the class and add 1.
i.e class-width=Range-1
so, for 21-25 ==>Range=25-21 = 4, class-width=4+1 =5.

Now if you look at the data above how do you think we should group them??well this is very easy even if your data is as large as hundreds, thousanads even millions.

How to group my large data set in to Class

To group your data set it is important to know the minimum and maximum value of you data, so as to know where the class should start and end and how large your class width should be.

How to calculate the min and maximum value.
to calculate the minimum and maximum value here make use of the min() and max()functions. for example let us find the maximum and minimum value of the data above. we will write the following codes to do that.

data=c(88,82,96,102,104,106,104,24,26,29,86,36,60,23,24,39,48,46,33,36,39,78,67,82,32,67,27,24,26,27,30,36,37,49,50,56,83,99,68,28,55,54,26,29,30,40,46,44,99,84,36,51,86,88,87,29,40,40,40,66,45,23,26,46,46,96,99,100,100,101,103,106,107,46,48,49,48,94,55,56,59,60,70,72,76,79,80,50,49,93,86,54,83,89,90,94,96,99,102,46)
min(data)
max(data)
Enter fullscreen mode Exit fullscreen mode

Result>>

23
107
Enter fullscreen mode Exit fullscreen mode

the codes above shows that the minimum value is 23 and the max value is 107.
Now to group this data we can start from 21 and end at 110 since between this range is all our data set will fall. another question will be "what should we use as our class width?", that one you can picture it yourself. you can decided to use 10 or 15. any of the two is okay but here i will be using 15.
Note: you can use any class width it depends on how big the range of your data set is. just make sure you group them into 5 to 15classes.

my classes here will look like the following:
21-35,36-50,51-65...95-110.
if you choose 10 as class width your classes will look like
21-30,31-40,41-50....101-110. this seems much more easy.
we can go on and count the number of observation fall in each class and write it as the respective frequency for each class.

How do I Group my data set using R.
I believed you already understand the theoretical way to construct the data now let us see how to do it using code.
Before we write the codes, u should know the following

Important functions needed to construct group frequency table in R.

  1. seq(): it is used to form a sequence of numbers, it takes three important argument;from,to and by. i.e the start,end and increment.
  2. cut(): the function used to divide large data set into classes by stating the upper class limit of each class in a sequence. it takes 3important argument; data,break,label and include.lowest.
  3. table(): to tabulate any data set given or represent it in a tabular form.

Now let's make use of the above functions to solve the problem.
Codes>>>

myData=c(88,82,96,102,104,106,104,24,26,29,86,36,60,23,24,39,48,46,
  33,36,39,78,67,82,32,67,27,24,26,27,30,36,37,49,50,56,83,
  99,68,28,55,54,26,29,30,40,46,44,99,84,36,51,86,88,87,29,
  40,40,40,66,45,23,26,46,46,96,99,100,100,101,103,106,107,
  46,48,49,48,94,55,56,59,60,70,72,76,79,80,50,49,93,86,54,
  83,89,90,94,96,99,102,46)
groupData=cut(myData,break=seq(20,110,15),include.lowerclass=FALSE)
table(groupData)
Enter fullscreen mode Exit fullscreen mode

Result>>

 (20,35]  (35,50]  (50,65]  (65,80]  (80,95] (95,110] 
    19       27       10       10       16       18 
Enter fullscreen mode Exit fullscreen mode

the result above shows each class and their respective frequency using class width of 15.

How does the Codes work??

One particular thing I usually emphasis on is "understand how the codes work if you want codes to work for you". so I will explain more on how the parameters for cut() function works.

first parameter is the data-set, second is break; this is used to show how we want to break the class. since I already know the where my class will start and end together with the class width, then it will be nice to use seq() function to show the breaking point for each class.

the third parameters could be set to true or false, but I use False here because I know I am not going to include 20 in the first class, remember my target is between 21 and 110.

The result above is telling you that 19 of the data fall between 21 and 35. Note that ( mean greater than and ] means less than or equal to .
i.e (21,35] means datas greater than 20 and less than or equal to 35 which is the same thing as to say 21-35.
is there a way to write those classes in our own better ways??
Of course yes! you can write each class in a better way so that you don't need to understand the bracket. take a look at the following codes


myData=c(88,82,96,102,104,106,104,24,26,29,86,36,60,23,24,39,48,46,
  33,36,39,78,67,82,32,67,27,24,26,27,30,36,37,49,50,56,83,
  99,68,28,55,54,26,29,30,40,46,44,99,84,36,51,86,88,87,29,
  40,40,40,66,45,23,26,46,46,96,99,100,100,101,103,106,107,
  46,48,49,48,94,55,56,59,60,70,72,76,79,80,50,49,93,86,54,
  83,89,90,94,96,99,102,46)
lowerclass=seq(21,110,15)
upperclass=lowerclass+14
classInterval=paste(lowerclass,'-',upperclass)
groupData=cut(myData,breaks=seq(20,110,15),
labels=classInterval,include.lowerclass=FALSE)
table(groupData)
Enter fullscreen mode Exit fullscreen mode

Result>>

21 - 35 36 - 50 51 - 65 66 - 80 81 - 95 96 - 110
   19      27      10      10      16      18
Enter fullscreen mode Exit fullscreen mode

How the codes work
if you try to run seq(21,110,15) you will realized that this gives you exactly the lower class limit for each class. you should be careful here as what we wanna deal with here is exactly how to write it in our natural ways instead of using bracket.
so, to get the upper class we know we need to add the class width but remember you must subtract 1.
Now we can definately match both together to form the range. paste() is the function used here to deal with that task. then I add labels to the cut() parameters letting it equal to the classInterval I just created. and boom! we did it.

This is quite interesting! but if you are someone like me I will like it if the table is two column table and how do I do that?? very easy, just make use of data.frame() function and you are done. take a look at the following codes.

Codes>>

  myData=c(88,82,96,102,104,106,104,24,26,29,86,36,60,23,24,39,48,46,
  33,36,39,78,67,82,32,67,27,24,26,27,30,36,37,49,50,56,83,
  99,68,28,55,54,26,29,30,40,46,44,99,84,36,51,86,88,87,29,
  40,40,40,66,45,23,26,46,46,96,99,100,100,101,103,106,107,
  46,48,49,48,94,55,56,59,60,70,72,76,79,80,50,49,93,86,54,
  83,89,90,94,96,99,102,46)
lowerclass=seq(21,110,15)
upperclass=lowerclass+14
classInterval=paste(lowerclass,'-',upperclass)
wages=cut(myData,breaks=seq(20,110,15),
                    labels=classInterval,include.lowerclass=FALSE)
groupTable=table(wages)
data.frame(groupTable)
Enter fullscreen mode Exit fullscreen mode

Result>>

    wages    Freq
1   21 - 35   19
2   36 - 50   27
3   51 - 65   10
4   66 - 80   10
5   81 - 95   16
6  96 - 110   18
Enter fullscreen mode Exit fullscreen mode

This is incredible right?? I hope you enjoy this article?? if yes then don't forget to share it to some else who might interested. you can drop me a comment if you have any question or correction. don't forge to follow me on Instagram, or Facebook you can also DM on whatsApp should have anything to discuss with me feel free for anything. Thanks!

Top comments (0)