You've been lied to about the newline character
You’ve been lied to about the Newline Character
The humble newline character: \n
.
You’ve seen it in countless code examples. Usually something like
foo\n
bar\n
\n
You look at that and probably think, it represents the end of a line. Or maybe you think it represents the start of a line. If you believe either of those things, I’m sorry to inform you that you’re wrong.
Fortunately, by the end of this post you’ll have a much better mental model of \n
.
Please note that this post is extremely pedantic, but so are computers, and in this case it actually makes a practical difference. You’ll need to set aside your language training, and what it means to you, and think like an incredibly dumb and excessively pedantic computer. That’s what we’re giving instructions to as programmers, and that’s where this becomes relevant.
The only accurate way to think about \n
is as the boundary between two lines. It exclusively marks the ending of one line, and the beginning of another.
When you open your text editor you see
foo
bar
but that’s not what’s actually stored. The file is a stream of characters: foo\nbar\n
. The \n
marks the boundary between “lines” that don’t actually exist. There’s also an end of file character but we can ignore that here.
So, let’s logic this out. Our ground truth is the fact that foo
with no \n
is a perfectly valid line. If \n
indicated the start of a new line then 2 things must be true:
- lines must start with
\n
foo
must represent either the middle or end of a line, but not the start or entirety of one.
Since we know that foo
is a valid line then both of those must be false. You could argue that computers are making a special exception for the start of a file and have special handling for that situation, but they don’t and that would be way more work than just treating it as a boundary indicator.
If \n
indicates the end of a line then the following must be true:
- lines must end with
\n
foo
must represent the start or middle of a line, but not the end or entirety of one.
Again, both statements are false.
There’s also a simple tool to prove that it must be a boundary: the split
function. Before we explore that, I want you to think about “splitting” a piece of paper. If you make one cut through a piece of paper you end up with two pieces of paper. Cut it again, and you get three. You always end up with cuts + 1
pieces of paper. “Splitting” a string is the same, or it should be. Some languages have inconsistent behavior around \n
.
We’re going to be working with foo\nbar\n\n
which most code examples would show you on 3 lines like this:
foo\n
bar\n
\n
While the intent of the \n
character may have been to indicate the start of a line (technically it was used to move a print head down a “line” worth of distance), what we actually recorded was a boundary indicator. You see, the above isn’t 3 lines. It’s 4, and here’s the proof.
iex(1)> String.split("foo\nbar\n\n", "\n")
["foo", "bar", "", ""]
Here our string is split on \n
into n+1
elements, just like our hypothetical paper. Personally I find it easier to visualize like this:
foo|bar||
The line is a piece of paper and each |
is where I’m going to cut with my scissors. Elixir split that string exactly as you’d expect, and gives us four elements.
One of my coworkers (paraphrased) asked how you would differentiate ""
from "\n"
, and "\n\n"
if it didn’t do that?
iex(1)> String.split("", "\n")
[""]
iex(2)> String.split("\n", "\n")
["", ""]
iex(3)> String.split("\n\n", "\n")
["", "", ""]
Each \n
in the initial string is an indicator of where to cut. No cuts and you still have the whole paper. One cut, and you have two pieces, and so on.
IF \n
were truly a start of line (“new line”) character then String.split("\nfoo", "\n")
should result in 1 line, because the \n
would mean “start a line”, and there’s no second line started with another \n
. Thus we have [explicit begin character]foo[implicit end]
which gives us only one line. Splitting on \n
in that string gives us ["", "foo"]
thus proving it’s not an indicator of a line’s start. Or at least, the computer doesn’t think of it as one.
That’s the key here. It’s irrelevant what you or I think it should represent. What’s important, is what it represents to the computer. To the computer, there is no situation in which \n
exists and there is only one “line”.
We’re dealing with a legacy term from a time when computers had no screens, that was brought to us by a special gearing from the bar you slapped on a typewriter to shift a roller sideways and rotate it “one line” worth of distance. The “new line” term has persisted, but no longer reflects what’s actually happening.
--
All of this came about because of very real bug in code I wrote to parse request headers. I knew how to fix the bug, but I couldn’t understand why it was a bug in the first place. This was because I kept stubbornly, and incorrectly, thinking that the following was only three lines, because I’ve always treated it as an “end of line” character.
foo\n
bar\n
\n
The most obvious practical takeaway is that splitting a string, by any means is suddenly going to match up with your expectations if you think of it as a boundary character instead of a “new line” character.