How Goroutines Work
Introduction to Go
If you are new to the Go programming language, or if the sentence "Concurrency is not parallelism" means nothing to you, then check out Rob Pike's excellent talk on the subject. Its 30 minutes long, and I guarantee that watching it is 30 minutes well spent.
To summarize the difference - "when people hear the word concurrency they often think of parallelism, a related but quite distinct concept. In programming, concurrency is the composition of independently executing processes, while parallelism is the simultaneous execution of (possibly related) computations. Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." 
Go allows us to write concurrent programs. It provides goroutines and importantly, the ability to communicate between them. I will focus on the former.
Goroutines and Threads - the differences
Go uses goroutines while a language like Java uses threads. What are the differences between the two? We need to look at 3 factors - memory consumption, setup and teardown and switching time.
The creation of a goroutine does not require much memory - only 2kB of stack space. They grow by allocating and freeing heap storage as required. Threads on the other hand start out at 1Mb (500 times more), along with a region of memory called a guard page that acts as a guard between one thread's memory and another.
A server handling incoming requests can therefore create one goroutine per request without a problem, but one thread per request will eventually lead to the dreaded OutOfMemoryError. This isn't limited to Java - any language that uses OS threads as the primary means of concurrency will face this issue.
Setup and teardown costs
Threads have significant setup and teardown costs because it has to request resources from the OS and return it once its done. The workaround to this problem is to maintain a pool of threads. In contrast, goroutines are created and destroyed by the runtime and those operations are pretty cheap. The language doesn't support manual management of goroutines.
When a thread blocks, another has to be scheduled in its place. Threads are scheduled preemptively, and during a thread switch, the scheduler needs to save/restore ALL registers, that is, 16 general purpose registers, PC (Program Counter), SP (Stack Pointer), segment registers, 16 XMM registers, FP coprocessor state, 16 AVX registers, all MSRs etc. This is quite significant when there is rapid switching between threads.
Goroutines are scheduled cooperatively and when a switch occurs, only 3 registers need to be saved/restored - Program Counter, Stack Pointer and DX. The cost is much lower.
As discussed earlier, the number of goroutines is generally much higher, but that doesn't make a difference to switching time for two reasons. Only runnable goroutines are considered, blocked ones aren't. Also, modern schedulers are O(1) complexity, meaning switching time is not affected by the number of choices (threads or goroutines).
How goroutines are executed
As mentioned earlier, the runtime manages the goroutines throughout from creation to scheduling to teardown. The runtime is allocated a few threads on which all the goroutines are multiplexed. At any point of time, each thread will be executing one goroutine. If that goroutine is blocked, then it will be swapped out for another goroutine that will execute on that thread instead.
As the goroutines are scheduled cooperatively, a goroutine that loops continuously can starve other goroutines on the same thread. In Go 1.2, this problem is somewhat alleviated by occasionally invoking the Go scheduler when entering a function, so a loop that includes a non-inlined function call can be prempted.
Goroutines are cheap and do not cause the thread on which they are multiplexed to block if they are blocked on
- network input
- channel operations or
- blocking on primitives in the sync package.
Even if tens of thousands of goroutines have been spawned, it's not a waste of system resources if most of them are blocked on one of these since the runtime schedules another goroutine instead.
In simple terms, goroutines are a lightweight abstraction over threads. A Go programmer does not deal with threads, and similarly the OS is not aware of the existence of goroutines. From the OS's perspective, a Go program will behave like an event-driven C program. 
Threads and processors
Although you cannot directly control the number of threads that the runtime will create, it is possible to set the number of processor cores used by the program. This is done by setting the variable
GOMAXPROCS with a call to
runtime.GOMAXPROCS(n). Increasing the number of cores may not necessarily improve the performance of your program, depending on its design. The profiling tools can be used to find the ideal number of cores for your program.
As with other languages, it is important to prevent simultaneous access of shared resources by more than one goroutine. It is best to transfer data between goroutines using channels, ie, do not communicate by sharing memory; instead, share memory by communicating.
Lastly, I'd strongly recommend you check out Communicating Sequential Processes by C. A. R. Hoare. This man was truly a genius. In this paper (published 1978) he predicted how the single core performance of processors would eventually plateau and chip-makers would instead increase the number of cores. His proposal to exploit this had a deep influence on the design of Go.
1 - Concurrency is not parallelism by Rob Pike
2 - Effective Go: Goroutines
3 - Goroutine stack size was decreased from 8kB to 2kB in Go 1.4.
4 - Goroutine stacks became contiguous in Go 1.3.
5 - Dmitry Vyukov explains scheduling of goroutines on golang-nuts
6 - Analysis of the Go runtime scheduler by Deshpande et al.
7 - 5 things that make Go fast by Dave Cheney
If you're interested in learning more about Go, there are a couple great talks about the language here