I just got new experience working on “Big Data”, those term are the real term means that the data is very big and working on it in a low end laptop like mine (64bit processor and 4GB memory) is almost impossible.
Some of the competitions in kaggle for example, provide Gigs of data. Even more than 10 or 20GB. My laptop memory is not sufficient to see the sample of data in Excel or to load it in R. Other activities such as learning the data also consume a lot of memory because the algorithm will work iteratively to find the optimal parameter.
But, it doesn’t mean that we can not working on the data at all. I resume 6 step that I can do to deal with it. Please note that I use Large Text File Viewer, Sqlite and R. The method might also work if you decide other program that does similiar things such as MySQL and Python.
Step 1: I can’t load it with Excel, but I can read open it using Large Text File Viewer Program
Even if notepad is fail to load the data, Large Text Viewer is quite powerful program to replace it. The data in csv format for example will be read as text. This first step is important to see the field of the data to know the number of column in the table and type of field in each column.
Step 2: Load the program to sqlite
After the number of column and type of column is know, load the data into sqlite. The following command is an example how to load the data into sqlite, the column name is just a mock up.
Create the database
C:/thepath/tosavethefile>sqlite3 datadb.db SQLite version 3.8.2 2013-12-06 14:53:30 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite>
Create new table (ex:train) and load the train data into the table
sqlite> create table train(ID integer, label integer, i1 integer, i2 integer, t1 text, t2 text); sqlite> .separator "," sqlite> .mode csv sqlite> .import train.csv train
Create new table (ex:test) and load the test data into the table
sqlite> create table test(ID integer, i1 integer, i2 integer, t1 text, t2 text); sqlite> .separator "," sqlite> .mode csv sqlite> .import test.csv test
Check the number of rows in the table
sqlite> SELECT COUNT(*) FROM train;
Step 3: Load the partial data from sqlite to R
R is unable to load the entire data. The best way to do it is to load part of data, for example 10% of data, from sqlite to R using RSQlite library. Since the number of rows in the table is known, for example 30 millions row, 10% of the data is 3 millions.
Since we only take 10% of the data, we need to do following code 10 times and also step 4 10 times. The following code is for the first 10% of the data.
#load library library(RSQLite) #make connection drv<-dbDriver("SQLite") con<-dbConnect(drv,"datadb.db") #select 10 percent of the data train.data<-dbGetQuery(con,"SELECT * FROM train LIMIT 1,3000000")
The second 10% of the data will have this selection code
train.data<-dbGetQuery(con,"SELECT * FROM train LIMIT 3000001,3000000")
Step 4: Train the data in R and Save
Some data manipulation might be needed before the training began. After the data are ready, train the subset of data and save the model to use for later purpose. The following code shows when I train the data using linear model (regression)
lmmodel<-lm(count~.,data=train.data) save(lmmodel,file="lmmodel1.rda")
Step 5: Ensemble Method
In the end, there will be 10 model. We need to load the model one by one. Since the model is very light, it won’t take much memory.
Load(“lmmodel1.rda”)
The ensemble method, such as what happened in Random Forest or Boost algorithm, is the averaging of all the variable in the model. In this case, the coefficient in the linear regression model.
If I run the model
lmmodel
I will get all the value of the coefficients. To get the average of each coefficient, we need to access each value of the coefficient, sum it and the divide it by 10. The following code show how to access the value of the coefficient of each variable.
lmmodel$coefficients["i1"]
Above is the example for i1 variable. Below is the example of the Intercept variable.
> lmmodel$coefficients["(Intercept)"] (Intercept) -7.544367
Step 6: Test the data
The last step is the test step, where I load the test data and then find the label value based on the ensemble model. Suppose that the ensemble model is saved in “lmensemble”, then the following code is the prediction is the lable value.
First load the test data
</pre> test.data<-dbGetQuery(con,"SELECT * FROM test ") <pre>
Then predict the label.
predict.val<-predict(lmensemble,data.test,se.fit=TRUE)
Thats the wrap up. Feel free to comment.