The easiest method is to do the captions by voice recognition. After you transcribe the video, Google's voice recognition places the captions into the right spot in the video and I was amazed at how accurately it was able to do it. I noticed with the following video that using this method the timing is off at the very end but other than that the timing was pretty true.
Using Caption Tube is more accurate but also more time consuming and technical. I found that I had to wait for it to buffer multiple times when working with the following video which is only 22 seconds long. I somehow ended up with double captions in places and I was happy to see that CaptionTube saved my work and I was able to go back and make edits. Even though this method was more technical and time consuming I think that it gets easier with a little bit of practice. Here is a video that I captioned using CaptionTube.