Recording Vocals at Home: The Acoustic, Electrical & Spatial Reality

Studio Guide 01 · Cloud Atelier · Updated April 2026 · ~14 min read

Before you choose a microphone, your room has already chosen one for you. Before you compare preamps, the physics of self-noise and gain has already set a floor on how clean your recording can be. This guide treats vocal recording the way an acoustics textbook treats it — as a system of interacting physical and electrical variables, not a shopping list.

HOW WE RESEARCH · WHAT WE DO NOT CLAIM

Cloud Atelier does not run a test lab. We have not personally A/B tested every microphone, interface or monitor cited in this guide. The physics in this article (RT60, self-noise, polar patterns, latency, LUFS) come from published acoustics literature and standards. The product-specific specifications come from current manufacturer datasheets. Models are mentioned because their published spec satisfies a stated criterion — not because we declared them “best.” Where you see a product below, you will also see the source of the spec we cited and a link to an independent reviewer (Sound on Sound) where you can verify our reading against working engineers.

Pre-read: If your monitoring environment is not honest, every decision in this article is a guess. Before reading further, consider Treat Your Room — the acoustic prerequisite for every recording and mixing decision.

1. Your room is the first instrument

A microphone does not record your voice. It records the sum of your voice plus every reflection of that voice arriving at the diaphragm in the milliseconds after the direct sound. In a small bedroom, those reflections arrive within 5–15 milliseconds — well inside the temporal integration window of human hearing — and the listener perceives them not as discrete echoes but as smearing, boxiness, and an unmistakable quality the industry calls roominess.

RT60: the single most useful number

Reverberation time, written RT60, is the time it takes for a sound to decay by 60 dB after the source stops. Professional vocal booths target an RT60 of 0.15–0.25 seconds across the speech band (200 Hz – 4 kHz). Untreated bedrooms typically measure 0.4–0.7 seconds. Hardwood-floored living rooms often measure above 1.0 second. The Sabine equation explains why:

RT60 = 0.161 × V / A
where V is room volume in m³ and A is total absorption in sabins (m² of equivalent perfect absorber).

A 12 m² bedroom with a 2.5 m ceiling has a volume of 30 m³. To bring its RT60 down from 0.6 s to 0.25 s in the 1 kHz band, you need to add roughly 12 sabins of absorption — equivalent to four 2-inch fibreglass panels (each ~3 sabins at 1 kHz) covering the first reflection points. That is the physics. Foam wedges from the hardware store provide perhaps 0.4 sabins per square metre at 1 kHz and almost nothing below 250 Hz, which is why foam-only treatment leaves the low-mid mud entirely intact.

Room modes and the lower limit of an honest recording

Standing waves trap energy at frequencies whose half-wavelength fits exactly between two parallel surfaces. For a room of dimensions L × W × H (in metres), the axial mode frequencies are f = c / (2L), where c is the speed of sound (~343 m/s). A 4 m wall produces a mode at 43 Hz; a 3 m wall produces one at 57 Hz. Inside a bedroom, you typically have a forest of axial, tangential, and oblique modes between 30 and 200 Hz. They cause some bass notes to ring loudly while others almost vanish, depending on where the singer and microphone happen to stand.

For vocal recording specifically, this matters less than it does for a kick drum — the fundamental of a male voice sits around 100–130 Hz and a female voice around 180–240 Hz — but resonances between 200 and 400 Hz directly stack on the warmth band of the human voice and cause the boxy quality that makes amateur recordings instantly recognisable.

First reflection points and comb filtering

When a sound reflects off a wall and returns to the microphone within roughly 1 millisecond of the direct sound, the two signals interfere coherently. Frequencies whose period is twice the delay are reinforced; those whose period equals the delay are cancelled. This is comb filtering, and it is the dominant audible signature of a small untreated room. A reflection arriving 1 ms late notches frequencies at 500 Hz, 1.5 kHz, 2.5 kHz, etc. The ear hears this as a hollow, phasey colouration that no amount of mixing removes.

Direct sound + first reflections: every untreated parallel surface adds another comb.

PRACTICAL FIX

Place the singer in the longer dimension of the room, facing the long wall. Hang a heavy moving blanket or a 2-inch fibreglass panel directly behind the microphone (capturing the singer’s rear radiation) and another to the singer’s left and right. Cover the floor immediately under the mic with a thick rug. That is roughly 8 sabins of absorption for under $80 and removes the worst of the comb filtering before anyone presses Record.

2. The signal chain, from acoustic pressure to digital sample

Every vocal recording is a chain of energy conversions. Understanding the chain matters because every stage has a noise floor, and the chain’s overall signal-to-noise ratio is dominated by whichever stage is weakest. Most home producers spend three figures on a microphone and then feed it into a 50-dollar interface running at unity gain, never realising that the preamp stage just halved their dynamic range.

Five stages, each with its own noise floor and headroom budget.

3. Microphone physics: condenser vs dynamic

A microphone converts the pressure variations of a sound wave into a voltage. There are two dominant electroacoustic principles in studio use, and choosing between them is the most consequential decision in the chain.

The condenser principle

A condenser (capacitor) microphone places a thin gold-sputtered diaphragm a few microns from a charged backplate. Sound pressure displaces the diaphragm, modulating the capacitance and therefore the voltage across the capsule. Because the moving mass is essentially negligible, condensers respond to transients within a few microseconds and their frequency response extends well above 16 kHz. They are also extremely sensitive: typical sensitivity is 15–30 mV/Pa, ten to twenty times that of a dynamic microphone.

That sensitivity is why condensers capture the breath, the lip click, the chest resonance — and also the laptop fan three metres away. Sensitivity has no preference for source.

The dynamic principle

A dynamic microphone attaches a small voice coil to the diaphragm and suspends it in a magnetic field. Diaphragm motion induces a voltage in the coil. The moving mass is much higher (the coil itself is copper), so transient response is slower, frequency response above 12–16 kHz rolls off, and sensitivity sits typically between 1 and 3 mV/Pa — ten to twenty decibels lower than a condenser.

That lower sensitivity is exactly why dynamics work in untreated rooms. The microphone is, by physics, less able to hear the room. It is also why a dynamic needs more preamp gain to reach a usable recording level, which leads us directly to the SM7B problem.

Self-noise: the spec almost nobody reads

The equivalent self-noise of a condenser microphone is the SPL value (in dBA) that, at the diaphragm, would produce the same output voltage as the microphone’s own electronic hiss in silence. Lower is better. Anything under 14 dBA is excellent; under 8 dBA is exceptional. A microphone with 18 dBA self-noise will sound noisier than your room ambience until your ambient SPL drops below about 30 dB, which is rare anywhere outside a treated booth at night. Self-noise places a hard floor on recording cleanliness that no preamp can undo.

Microphone	Type	Sensitivity	Self-noise (A)	Max SPL
Audio-Technica AT2020	Condenser	14.1 mV/Pa	20 dBA	144 dB SPL
Rode NT1 (5th Gen)	Condenser	32 mV/Pa	4 dBA	142 dB SPL
Shure SM7B	Dynamic	1.12 mV/Pa	n/a (passive)	180+ dB SPL
Neumann TLM 102	Condenser	11 mV/Pa	12 dBA	144 dB SPL

4. Preamp gain, headroom, and the SM7B problem

Microphone-level signals are tiny. A condenser at conversational distance produces around 1–10 mV peak. A dynamic in the same position produces 0.1–1 mV. Bringing those signals up to line level (around 1 V) requires 40–65 dB of gain in a clean amplification stage. The preamp’s job is to do that gain without adding hiss or distortion.

Two specifications matter. Equivalent input noise (EIN) tells you how much noise the preamp itself introduces, referred to its input. EIN of −128 dBu is excellent; −120 dBu is adequate; −115 dBu becomes audible behind a quiet dynamic mic at high gain. Maximum gain before clipping tells you whether the preamp can drive a low-output dynamic at all.

The Shure SM7B’s 1.12 mV/Pa sensitivity at conversational distance produces roughly 0.5 mV peak. Bringing that to line level requires about 65 dB of gain. A first-generation Focusrite Scarlett 2i2 maxes out at 56 dB; a fourth-generation 2i2 reaches 69 dB; a UA Volt 2 reaches 55 dB. If your preamp does not comfortably hit 60+ dB, an inline amplifier such as the Cloudlifter CL-1 or Triton FetHead adds 25 dB of clean gain before the preamp ever sees the signal, sidestepping the issue.

5. Converters, sample rate, bit depth, and 32-bit float

After the preamp, the signal hits an analog-to-digital converter. Two specs are decisive. Bit depth determines dynamic range: 16-bit gives 96 dB; 24-bit gives 144 dB. Modern interfaces record 24-bit by default. The extra headroom does not improve the sound — it gives you 18 dB of margin for safe gain staging, which prevents clipping when a singer suddenly belts.

Sample rate determines bandwidth via the Nyquist theorem (max frequency = sample rate / 2). 44.1 kHz captures up to 22.05 kHz, exceeding human hearing. 48 kHz is the broadcast standard. 96 and 192 kHz exist primarily to leave room for internal processing without aliasing — for tracking vocals at home, 48 kHz is the correct default. Higher rates double your file size for inaudible benefit.

Some recent interfaces (Rode NT1 5G USB, Zoom F3, Tascam Portacapture X8) record in 32-bit float. The format is mathematically incapable of clipping in the digital domain within normal voltage ranges. You set gain casually, record the take, and adjust level after the fact without losing fidelity. For solo home recording where you cannot ride faders mid-take, this is a meaningful workflow advance.

6. Technique: distance, axis, polar response in practice

Distance: the inverse-square law and the proximity effect

Sound pressure falls roughly 6 dB for every doubling of distance from the source. Move from 6 inches (15 cm) to 12 inches (30 cm) and your direct signal drops 6 dB while the room reflections, which are diffuse and fill the space more uniformly, drop only slightly. The direct-to-reverberant ratio collapses. This is why home producers are taught to record close: at 15 cm, the direct signal is so much louder than the room that the room nearly disappears.

Cardioid microphones also exhibit the proximity effect: low frequencies are emphasised as you move closer to the capsule because pressure-gradient transducers respond to the difference between front and back arrival, and that difference grows non-linearly at close range. At 5 cm a cardioid can add +6 dB at 100 Hz. This is what makes broadcast voices sound full and chesty, and what makes amateur podcasts sound boomy. Distance is a tone control.

Axis and the off-axis problem

Polar pattern diagrams are usually drawn for 1 kHz. They look much more even than reality. At 8 kHz a cardioid is typically 6–10 dB darker off-axis than on-axis, which is why a singer who drifts off the front of the capsule loses high-frequency detail before they lose level. Always face the capsule directly, with a pop filter 5 cm in front and the singer 10 cm behind the filter. Mark the floor with tape if you need to.

7. Decision matrix by room and budget

Choosing the microphone is the last step, not the first. The matrix below sorts by what you actually have to work with: how acoustically dead the room is, and how much budget the signal chain can absorb.

Your situation	Mic type	Specific picks	Why
Untreated bedroom, hard floors, parallel walls	Dynamic, cardioid	Shure SM7B, Shure MV7, Rode PodMic	Lower sensitivity rejects room reflections; comb filtering becomes manageable.
Treated corner with 4 panels and a heavy rug	Large-diaphragm condenser	Audio-Technica AT2020, Rode NT1 5G	Detail and air without exposing untreated reflections.
Properly treated booth or whisper-quiet room	Premium LDC	Neumann TLM 102, AKG C414 XLII	You have earned the noise floor and frequency extension that justifies the price.
Fully untreated room, podcast/voiceover only	Dynamic + acoustic blanket	SM7B + Cloudlifter, MV7+ USB-C	Spoken word tolerates dynamic colouration; rejection > detail.

8. The five mistakes home producers actually make

1. Buying the microphone first

A Neumann TLM 102 in an untreated bedroom records a worse vocal than a SM7B does. Spend the first $80–$200 on absorption (panels, blankets, rugs) before upgrading the capsule.

2. Recording too far from the mic

If you can fit your fist between mouth and pop filter, you are too far. Hand width (8–10 cm) is about right for spoken word. For sung vocals 15–20 cm with a pop filter is standard.

3. Setting gain by peak instead of average

Aim peaks for −12 to −6 dBFS, not −3 dBFS. Modern 24-bit recording has so much headroom that erring on the quiet side costs nothing and protects you from sudden loud takes.

4. Forgetting the high-pass filter

Anything below 80 Hz in a vocal track is HVAC, traffic rumble, or stage thump — never voice fundamental except for the lowest bass singers. Engage your interface or DAW high-pass at 80 Hz on the way in, or at minimum during mixing.

5. Treating reverb as a colour rather than a corruption

Reverb you add in the DAW is your choice. Reverb baked into the recording by an untreated room is not. Track dry, then add reverb as an effect — never the other way around.

SUMMARY

Vocal recording at home is governed by physics, not gear. The room sets the ceiling, the microphone choice respects the room, the preamp is sized to the microphone, the converter is forgiving by default at modern bit depths, and technique determines whether all of the above translates into a usable take. Once you understand the chain, you can build a release-quality vocal recording for under $400 in a bedroom or spend $4,000 and still sound amateur if the room is wrong.

NEXT IN THE STUDIO GUIDES

Guide 02 → Podcast & Broadcast Audio — LUFS, true-peak, the dynamics-microphone case for spoken word, and the standard voice processing chain.

EQUIPMENT THAT MEETS THE CRITERIA · VOCAL RECORDING

Models below are grouped by the physical criterion they satisfy. We list the spec source (manufacturer datasheet) and a link to an independent reviewer (Sound on Sound) so you can verify our reading against working engineers. We did not personally A/B test these models.

Criterion: Sub-$300 large-diaphragm condenser, room treatment available

Cardioid LDC, manufacturer-published self-noise low enough that the mic does not become the noise floor in a treated room. The two models below sit at the floor and the ceiling of this price band.

Audio-Technica AT2020

LDC · 20 dBA self-noise · 144 dB max SPL · ~$100