R is the mother tongue of the data science afficiondas(popularly known as "data scientists") and researchers. If you love playing with data, it is one of the most intuitive languages around but once you start playing with complex expressions and especially large datasets, performance becomes paramount consideration.
For example, suppose an expression/function, f takes 0.1 second units. If dataset has rows in order of thousand, the cost comes to about 15-20 mins which is bearable. but if dataset has rows in higher orders like in millions, it ends up in business of hours which leaves very very less room of margin for even miniscule error as it grows exponentially with size of data.
I learnt it the hard way in past few days writing a simple R function to do data processing and had to come back to my coding instincts to come around it.
The guys from coding background like me have all used profilers to do performance check on their code and optimize it as much as possible. R also provides it and voila!
The trick is to create data samples and profile concerned function using them. Once satisfied that code is optimized as much as possible, move on to bigger datasets for actual run.
> sampleData <- final.traindata[1:10, ]
> Rprof("profiler.out")
> output <- f(x = sampleData)
> Rprof(NULL)
First line here creates data sample. RProf command in second line starts the profiler with only parameter being the path of the profiler output file. It will be created in current working directory by default. Then, execute the function/expression you want to profile and close the profile with NULL parameter.
You are done with profiling and its time now for profiling result!!
> summaryRprof("profiler.out")
$by.self
self.time self.pct total.time total.pct
"as.POSIXct.POSIXlt" 0.82 24.26 0.90 26.63
"structure" 0.36 10.65 0.42 12.43
"levels" 0.26 7.69 0.30 8.88
"as.Date" 0.18 5.33 1.80 53.25
"is.na" 0.10 2.96 1.06 31.36
"sort.int" 0.10 2.96 0.16 4.73
"strptime" 0.10 2.96 0.16 4.73
"NextMethod" 0.10 2.96 0.10 2.96
"$" 0.08 2.37 0.24 7.10
"[[.data.frame" 0.08 2.37 0.16 4.73
"as.Date.POSIXlt" 0.08 2.37 0.08 2.37
"[.data.frame" 0.06 1.78 0.32 9.47
"format" 0.06 1.78 0.18 5.33
"$<-.data.frame" 0.06 1.78 0.12 3.55
"match" 0.06 1.78 0.08 2.37
"format.POSIXlt" 0.06 1.78 0.06 1.78
"is.factor" 0.06 1.78 0.06 1.78
"as.POSIXct" 0.04 1.18 0.94 27.81
"+.Date" 0.04 1.18 0.38 11.24
"as.character" 0.04 1.18 0.28 8.28
"all" 0.04 1.18 0.04 1.18
"any" 0.04 1.18 0.04 1.18
"as.POSIXlt" 0.04 1.18 0.04 1.18
"attr" 0.04 1.18 0.04 1.18
"dim" 0.04 1.18 0.04 1.18
"length" 0.04 1.18 0.04 1.18
"names" 0.04 1.18 0.04 1.18
"paste" 0.04 1.18 0.04 1.18
"as.Date.factor" 0.02 0.59 1.56 46.15
"as.Date.character" 0.02 0.59 1.44 42.60
"charToDate" 0.02 0.59 1.18 34.91
"is.na.POSIXlt" 0.02 0.59 0.98 28.99
"Ops.factor" 0.02 0.59 0.44 13.02
"+" 0.02 0.59 0.40 11.83
"[" 0.02 0.59 0.34 10.06
"$<-" 0.02 0.59 0.14 4.14
"%in%" 0.02 0.59 0.10 2.96
"<Anonymous>" 0.02 0.59 0.04 1.18
"levels.default" 0.02 0.59 0.04 1.18
".subset2" 0.02 0.59 0.02 0.59
"c" 0.02 0.59 0.02 0.59
"is.atomic" 0.02 0.59 0.02 0.59
"nargs" 0.02 0.59 0.02 0.59
"sys.call" 0.02 0.59 0.02 0.59
$by.total
total.time total.pct self.time self.pct
"f" 3.38 100.00 0.00 0.00
"as.Date" 1.80 53.25 0.18 5.33
"as.Date.factor" 1.56 46.15 0.02 0.59
"as.Date.character" 1.44 42.60 0.02 0.59
"charToDate" 1.18 34.91 0.02 0.59
"is.na" 1.06 31.36 0.10 2.96
"is.na.POSIXlt" 0.98 28.99 0.02 0.59
"as.POSIXct" 0.94 27.81 0.04 1.18
"as.POSIXct.POSIXlt" 0.90 26.63 0.82 24.26
"[<-" 0.64 18.93 0.00 0.00
"[<-.factor" 0.64 18.93 0.00 0.00
"Ops.factor" 0.44 13.02 0.02 0.59
"==" 0.44 13.02 0.00 0.00
"structure" 0.42 12.43 0.36 10.65
"+" 0.40 11.83 0.02 0.59
"+.Date" 0.38 11.24 0.04 1.18
"[" 0.34 10.06 0.02 0.59
"[.data.frame" 0.32 9.47 0.06 1.78
"levels" 0.30 8.88 0.26 7.69
"as.character" 0.28 8.28 0.04 1.18
"$" 0.24 7.10 0.08 2.37
"format" 0.18 5.33 0.06 1.78
"as.character.Date" 0.18 5.33 0.00 0.00
"format.Date" 0.18 5.33 0.00 0.00
"sort.int" 0.16 4.73 0.10 2.96
"strptime" 0.16 4.73 0.10 2.96
"[[.data.frame" 0.16 4.73 0.08 2.37
"$.data.frame" 0.16 4.73 0.00 0.00
"[.factor" 0.16 4.73 0.00 0.00
"[[" 0.16 4.73 0.00 0.00
"$<-" 0.14 4.14 0.02 0.59
"$<-.data.frame" 0.12 3.55 0.06 1.78
"NextMethod" 0.10 2.96 0.10 2.96
"%in%" 0.10 2.96 0.02 0.59
"noNA.levels" 0.10 2.96 0.00 0.00
"as.Date.POSIXlt" 0.08 2.37 0.08 2.37
"match" 0.08 2.37 0.06 1.78
".POSIXct" 0.08 2.37 0.00 0.00
"format.POSIXlt" 0.06 1.78 0.06 1.78
"is.factor" 0.06 1.78 0.06 1.78
"as.character.factor" 0.06 1.78 0.00 0.00
"all" 0.04 1.18 0.04 1.18
"any" 0.04 1.18 0.04 1.18
"as.POSIXlt" 0.04 1.18 0.04 1.18
"attr" 0.04 1.18 0.04 1.18
"dim" 0.04 1.18 0.04 1.18
"length" 0.04 1.18 0.04 1.18
"names" 0.04 1.18 0.04 1.18
"paste" 0.04 1.18 0.04 1.18
"<Anonymous>" 0.04 1.18 0.02 0.59
"levels.default" 0.04 1.18 0.02 0.59
"NROW" 0.04 1.18 0.00 0.00
".subset2" 0.02 0.59 0.02 0.59
"c" 0.02 0.59 0.02 0.59
"is.atomic" 0.02 0.59 0.02 0.59
"nargs" 0.02 0.59 0.02 0.59
"sys.call" 0.02 0.59 0.02 0.59
$sample.interval
[1] 0.02
$sampling.time
[1] 3.38
Profiler output has two sections, $by.self and $by.total. "$by.self" gives individual operations with highest recorded time. "$by.total" gives the tree-like break up of operations and their weightage in overall recorded time.
I personally find the second one more fruitful as it gives the nice flow of code execution and at each step, time consumed and its percentage to overall time.
Have fun with Data !!
For example, suppose an expression/function, f takes 0.1 second units. If dataset has rows in order of thousand, the cost comes to about 15-20 mins which is bearable. but if dataset has rows in higher orders like in millions, it ends up in business of hours which leaves very very less room of margin for even miniscule error as it grows exponentially with size of data.
I learnt it the hard way in past few days writing a simple R function to do data processing and had to come back to my coding instincts to come around it.
The guys from coding background like me have all used profilers to do performance check on their code and optimize it as much as possible. R also provides it and voila!
The trick is to create data samples and profile concerned function using them. Once satisfied that code is optimized as much as possible, move on to bigger datasets for actual run.
> sampleData <- final.traindata[1:10, ]
> Rprof("profiler.out")
> output <- f(x = sampleData)
> Rprof(NULL)
First line here creates data sample. RProf command in second line starts the profiler with only parameter being the path of the profiler output file. It will be created in current working directory by default. Then, execute the function/expression you want to profile and close the profile with NULL parameter.
You are done with profiling and its time now for profiling result!!
> summaryRprof("profiler.out")
$by.self
self.time self.pct total.time total.pct
"as.POSIXct.POSIXlt" 0.82 24.26 0.90 26.63
"structure" 0.36 10.65 0.42 12.43
"levels" 0.26 7.69 0.30 8.88
"as.Date" 0.18 5.33 1.80 53.25
"is.na" 0.10 2.96 1.06 31.36
"sort.int" 0.10 2.96 0.16 4.73
"strptime" 0.10 2.96 0.16 4.73
"NextMethod" 0.10 2.96 0.10 2.96
"$" 0.08 2.37 0.24 7.10
"[[.data.frame" 0.08 2.37 0.16 4.73
"as.Date.POSIXlt" 0.08 2.37 0.08 2.37
"[.data.frame" 0.06 1.78 0.32 9.47
"format" 0.06 1.78 0.18 5.33
"$<-.data.frame" 0.06 1.78 0.12 3.55
"match" 0.06 1.78 0.08 2.37
"format.POSIXlt" 0.06 1.78 0.06 1.78
"is.factor" 0.06 1.78 0.06 1.78
"as.POSIXct" 0.04 1.18 0.94 27.81
"+.Date" 0.04 1.18 0.38 11.24
"as.character" 0.04 1.18 0.28 8.28
"all" 0.04 1.18 0.04 1.18
"any" 0.04 1.18 0.04 1.18
"as.POSIXlt" 0.04 1.18 0.04 1.18
"attr" 0.04 1.18 0.04 1.18
"dim" 0.04 1.18 0.04 1.18
"length" 0.04 1.18 0.04 1.18
"names" 0.04 1.18 0.04 1.18
"paste" 0.04 1.18 0.04 1.18
"as.Date.factor" 0.02 0.59 1.56 46.15
"as.Date.character" 0.02 0.59 1.44 42.60
"charToDate" 0.02 0.59 1.18 34.91
"is.na.POSIXlt" 0.02 0.59 0.98 28.99
"Ops.factor" 0.02 0.59 0.44 13.02
"+" 0.02 0.59 0.40 11.83
"[" 0.02 0.59 0.34 10.06
"$<-" 0.02 0.59 0.14 4.14
"%in%" 0.02 0.59 0.10 2.96
"<Anonymous>" 0.02 0.59 0.04 1.18
"levels.default" 0.02 0.59 0.04 1.18
".subset2" 0.02 0.59 0.02 0.59
"c" 0.02 0.59 0.02 0.59
"is.atomic" 0.02 0.59 0.02 0.59
"nargs" 0.02 0.59 0.02 0.59
"sys.call" 0.02 0.59 0.02 0.59
$by.total
total.time total.pct self.time self.pct
"f" 3.38 100.00 0.00 0.00
"as.Date" 1.80 53.25 0.18 5.33
"as.Date.factor" 1.56 46.15 0.02 0.59
"as.Date.character" 1.44 42.60 0.02 0.59
"charToDate" 1.18 34.91 0.02 0.59
"is.na" 1.06 31.36 0.10 2.96
"is.na.POSIXlt" 0.98 28.99 0.02 0.59
"as.POSIXct" 0.94 27.81 0.04 1.18
"as.POSIXct.POSIXlt" 0.90 26.63 0.82 24.26
"[<-" 0.64 18.93 0.00 0.00
"[<-.factor" 0.64 18.93 0.00 0.00
"Ops.factor" 0.44 13.02 0.02 0.59
"==" 0.44 13.02 0.00 0.00
"structure" 0.42 12.43 0.36 10.65
"+" 0.40 11.83 0.02 0.59
"+.Date" 0.38 11.24 0.04 1.18
"[" 0.34 10.06 0.02 0.59
"[.data.frame" 0.32 9.47 0.06 1.78
"levels" 0.30 8.88 0.26 7.69
"as.character" 0.28 8.28 0.04 1.18
"$" 0.24 7.10 0.08 2.37
"format" 0.18 5.33 0.06 1.78
"as.character.Date" 0.18 5.33 0.00 0.00
"format.Date" 0.18 5.33 0.00 0.00
"sort.int" 0.16 4.73 0.10 2.96
"strptime" 0.16 4.73 0.10 2.96
"[[.data.frame" 0.16 4.73 0.08 2.37
"$.data.frame" 0.16 4.73 0.00 0.00
"[.factor" 0.16 4.73 0.00 0.00
"[[" 0.16 4.73 0.00 0.00
"$<-" 0.14 4.14 0.02 0.59
"$<-.data.frame" 0.12 3.55 0.06 1.78
"NextMethod" 0.10 2.96 0.10 2.96
"%in%" 0.10 2.96 0.02 0.59
"noNA.levels" 0.10 2.96 0.00 0.00
"as.Date.POSIXlt" 0.08 2.37 0.08 2.37
"match" 0.08 2.37 0.06 1.78
".POSIXct" 0.08 2.37 0.00 0.00
"format.POSIXlt" 0.06 1.78 0.06 1.78
"is.factor" 0.06 1.78 0.06 1.78
"as.character.factor" 0.06 1.78 0.00 0.00
"all" 0.04 1.18 0.04 1.18
"any" 0.04 1.18 0.04 1.18
"as.POSIXlt" 0.04 1.18 0.04 1.18
"attr" 0.04 1.18 0.04 1.18
"dim" 0.04 1.18 0.04 1.18
"length" 0.04 1.18 0.04 1.18
"names" 0.04 1.18 0.04 1.18
"paste" 0.04 1.18 0.04 1.18
"<Anonymous>" 0.04 1.18 0.02 0.59
"levels.default" 0.04 1.18 0.02 0.59
"NROW" 0.04 1.18 0.00 0.00
".subset2" 0.02 0.59 0.02 0.59
"c" 0.02 0.59 0.02 0.59
"is.atomic" 0.02 0.59 0.02 0.59
"nargs" 0.02 0.59 0.02 0.59
"sys.call" 0.02 0.59 0.02 0.59
$sample.interval
[1] 0.02
$sampling.time
[1] 3.38
Profiler output has two sections, $by.self and $by.total. "$by.self" gives individual operations with highest recorded time. "$by.total" gives the tree-like break up of operations and their weightage in overall recorded time.
I personally find the second one more fruitful as it gives the nice flow of code execution and at each step, time consumed and its percentage to overall time.
Have fun with Data !!