Lies We Tell Ourselves About Email Addresses

Mono Font

Lies We Tell Ourselves About Email Addresses

TL;DR: Don't overthink it, just send a verification email.

2026-06-06T22:54:55-04:00

Bear with me, because some of these "lies" are going to feel obvious, or like unimportant trivia. In all honesty, that's not that far from the truth. However, I hope you'll let me try to build a detailed and fun illustration showing how something as mundane as email can break our expectations in surprising ways.

We'll cover a lot of edge cases, stumble over some small blocks, and even discover a few technically-correct things that are valid, but not commonly supported even in big systems like Gmail (probably for good reason, to be fair).

Each example isn't necessarily meant to be a meaningful use-case that you need to go make sure you're handling correctly. But, all together, they're meant to gradually build toward one main point: Email addresses are mired in old history, and the definitions of valid and invalid components of the system continue to subtly change over time.

It's easy to make a common-sense decision that is unexpectedly problematic. On top of that, older devs (and older systems) may have expectations that were correct in the past but don't work anymore. And without further ado, on to the lies...

"Email addresses can just be validated with a regex"

Let's get the low-hanging-fruit out of the way. I'm far from the first person to say this online and I certainly won't be the last. But I feel it's justified repetition because, even after a thousand blog posts posted, tweets twote, and reddit comments moderated, this antipattern stubbornly persists in 2026.

The regex approach has three main avenues for causing heartache for you, your business, and your customers:

It is relatively expensive to run, for very little-to-no real benefit
- To be fair, this point used to hit harder, before we started throwing simple queries at giant clusters of power-hogging GPUs.
Regex is hard, regex wizardry is rare, and regex engine implementations are inconsistent. It's very, very easy to accidentally get it wrong without realizing it.
- We'll go over a couple of the ways it could go wrong through the rest of this post.
Even if you copypaste one of the "good" regexes, the world is evolving. What's valid today may be legacy tomorrow,
- The internet is littered with advice that was good 20 years ago. It's also littered with regular expressions that were good 20 years ago, because legacy is forever.

Opinion alert: Input validation is more about helping the user by making it harder to make simple mistakes. It should be restrictive enough to make your user's life easier, but only just. Don't rely on input validation to protect you from your users; use it to protect your users from themselves. In that sense, there's a valid argument for using regex tests to improve UX, which is worth considering. UX is important and I won't dismiss that fact. However, to play devil's advocate for a moment, perhaps the risk outweighs the reward. In the year of our lord 2026, you can reasonably expect your users to know how to type their own email address - or even better, auto-input from their OS, browser, keyboard app, or password manager.

It's likely that more people out there are being filtered by badly-implemented form validation than there are being filtered by their own need of hand-holding. On that note, then...

Don't validate email addresses. If you simply must, use a simple client regex to help users avoid common mistakes and typos.
- Try to keep it as non-restrictive as possible. Something like ^[^@]+@[^@\s]+$, which only makes sure your user has input "something@something"
If you have validation on the API or form handler, use THE SAME regex, for consistency with the frontend.
- This comes back to that "don't use input validation to protect you from your users" point. By default, protect yourself by sanitizing inputs, not by rejecting them.
Verify the address, don't worry about validating it.
- Send an email, let your user click a verification link or input a verification code.

That's it. It doesn't need to be more complicated than that. You don't need to check the domain's MX record; your email service does this as part of the whole "sending the email" thing (Also, spoiler alert, but there's some more MX Record fun below). And you definitely don't need to do a big regex. In fact you're probably already sending a verification email anyway! If you are, this might be an excuse to delete code, which is every programmer's actual favorite thing!

Now with the easy part out of the way, let's go over some of the specific points that make email handling fun and confusing!

"An email address needs to be valid, and email providers support any valid addresses"

This may shock you, but the Internet is made of software. And if there's any universal through-line on software, it's the fact that it very frequently diverges from your expectations.

Email server software, from big providers like Gmail to open-source projects like Postfix, have varying levels of support for the official rules of email formats. Postfix only added support for SMTPUTF8 around 2015 but it wasn't enabled by default until years layer. Dovecot, on the other hand, still doesn't support it in 2026. This isn't limited to open-source; Gmail puts restrictions on allowed characters when creating an address, but it does appear to support sending with UTF8.

We'll dive into SMTPUTF8 and RFC 6531 a bit more in a later section. But let's look at another example, where we can go explicitly against the restrictions of every relevant RFC. Let's look at RFC 5321 Section 4.5.3, which defines length limits.

4.5.3.1.1.  Local-part

   The maximum total length of a user name or other local-part is 64
   octets.

That's a pretty simple restriction, and thankfully easy to understand! But there are some terms that might be unfamiliar for some readers:

"local-part."
- Simply put, the local-part is everything that comes before the @ character.
- (Let's keep it that simple for this example, but we'll come back to the local-part for some other parts of the article)
octet
- An octet is a standard 8-bit byte (Wikipedia)

So now, the email address entirelytoomanycharactersinthisemailwhatisevenhappeningblahblahdonttrythisathome@gitpush--force.com, which has an 80-byte local-part, is 100% unambiguously invalid. So naturally it doesn't work.

Right?

Yeah, remember that thing about software diverging from expectations? You can actually send email to me at that address. Your provider will (probably) allow it without complaining, my provider will happily deliver it to my inbox.

"An email address can only have ASCII characters"

This belief will probably be more commonly held in the English-speaking world, but I'm curious: If you're not in the Anglosphere, do you still expect emails to require ASCII latin characters?

Back in 2012, which is simultaneously recent and ancient in the world of tech, Email Internationalization became a thing through a set of RFCs corresponding to different parts of the email stack. For email addresses specifically, RFC 6531 defines the SMTPUTF8 extension, which allows for non-ASCII characters in the local-part of an email address. In that sense, it's actually pretty surprising that so much of the world's population wasn't able to put their own name, in its native written form, in an email address until just 14 years ago.

International characters were technically working and allowed even before 2012, under things like Punycode through RFC 3490, something of an encoding hack where unicode characters were encoded in ASCII under-the-hood. But at that time, it only applied to the domain name, and the local-part was still limited to ASCII.

"An email address needs to be human-readable"

Now that we've covered the whole Latin character thing, let's go a bit deeper down the rabbit hole. With internationalization, the local part is defined as an octet stream. You can technically put bytes that don't map to a valid character in standard unicode. I could put � in an email address, and it would be valid. But it's not human-readable in any language.

"Email addresses always have a second-level domain (SLD) and a top-level domain (TLD)"

Think of the familiar email addresses you work with every day. An @icloud.com island in a sea of @gmail.com, with a smattering of ISP, @employer.com, and @school.edu emails. All follow this familiar pattern: something-dot-something.

There are three types of valid addresses which don't follow this pattern, in descending order of importance.

Addresses which have subdomains, therefore multiple dots in the domain.
- Hopefully, at least this one isn't that surprising. If you're outside the US or Canada, you're probably used to seeing things like .co.uk, .co.au, etc.
- But many smart people have accidentally excluded these users with well-intentioned regexes.
Addresses without a dot
- In the real world, this is only important for things like intranet addresses, where each machine on your network has a hostname. It's important for email software to support, but not likely to be a consideration for most of us.
- Technically someone at ICANN or Verisign or whoever could register an address like admin@net, but let's be real.
Addresses which use an IP Address instead of a domain name. RFC 5321 Section 4.1.3 explicitly lays out support for this, which it calls "Address Literals."
- This is very uncommon, but it is valid in the real world.
- Related, you can also reach me at ben@[50.169.39.178] if your email client supports it.
  - Sadly, Gmail's web client seems to have issues with this. But I think if you use a client like Thunderbird, it'll still work through Gmail.
  - Gmail's web client sends the email, but in my testing it seems to strip away the [] square brace characters specified in the "Address Literals" section of the RFC, and needed to send to an unnamed host.

"Email addresses always have a 'normal' TLD"

My actual personal email address is not on a .com or .net domain, it's a .email domain. This is something that I like quite a lot, because I'm a dork who loves things that are silly and weird just for the sake of being silly and weird.

But I wound up making an alias with .net domain, because it turns out that a lot of companies run validation logic which, apparently, just doesn't allow for domains other than the normal suspects: .com, .net, .org, .edu, etc.

"An email address can only have a single @ character"

Note: I have struggled to verify this one, and it's possible I'm actually misreading the RFC. I'm really including this here in case someone is able to help point out anythting I'm missing! I mentioned that you can reach me at ben@[50.169.39.178] if your client allows it, and added that gmail seems to struggle with it. But for this next one, Gmail just outright refuses to send it.

So if Gmail explicitly doesn't support it, you probably don't need to either. But it's technically valid under RFC 5322 Section 3.2.4.

   Strings of characters that include characters other than those
   allowed in atoms can be represented in a quoted string format, where
   the characters are surrounded by quote (DQUOTE, ASCII value 34)
   characters.

   qtext           =   %d33 /             ; Printable US-ASCII
                       %d35-91 /          ;  characters not including
                       %d93-126 /         ;  "\" or the quote character
                       obs-qtext

   qcontent        =   qtext / quoted-pair

   quoted-string   =   [CFWS]
                       DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                       [CFWS]

Reading RFCs isn't always fun, but basically this is saying that any ASCII Character from 33 to 126, except for 34 (the quote character itself) and 92 (the backslash character) can be in a quoted string in the local-part of an email address.

So technically ben"@"[email protected] should be valid, but I haven't found a client that lets me do it, and I've been too lazy to script this out as a pure SMTP send.

"Dots in the username/local-part are optional"

When I was a Very Cool Teenager™, I went and registered [email protected], because that's just what you did those days as a Very Cool Teenager™ with a Gmail beta invite. Over time, though, I got lazy and stopped using the dots when I typed my email address anywhere. [email protected] worked just the same. And it was slightly less embarrassing when early-2000's phpbb-style username aesthetics fell out of style.

Over time, it's become relatively common for people to assume that's the normal, expected behavior. But - and this is about to be a theme - RFC 5321 and the rest of the email RFC family leave a lot of leeway to how the local-part can be implemented by a server. And that includes the dots. It turns out that allowing senders to omit dots is common but by no means universal!

Also, not to brag, but "Very Cool Teenager™ with a Gmail beta invite" may forever stand as my greatest accomplishment in the field of silly oxymorons.

"There are not that many different email domain names"

Do you have access to a production database with user email addresses? If you answered yes, there's a strong chance your employer should lock down your database access. But before they do that, check out this query:

SELECT COUNT(DISTINCT SPLIT_PART(email, '@', 2)) FROM users;

Before you leave that very conspicuous query in your Postgres logs for a DBA to ask you about later, what's the result you expect? Obviously, it's not just a few. You've got your gmail, your outlook, icloud, proton, yahoo, etc. Maybe a few weirdos like that [email protected] character.

So a few dozen email providers, or maybe a hundred or so? Nope. If you have a decent set of users, you'll probably have THOUSANDS of distinct hostnames.

In fairness, a lot of these are just whitelabeled from bigger providers like Gmail or Outlook. For an example, I attended the University of Baltimore (home of the UB Bees, and UB mascot Eubie the Bee. Eubie the UB Bee. Go UB Bees?) and I taught at the Community College of Baltimore county. So I got a few .edu addresses myself! But even these two "very prestigious" and "well-funded" institutions don't actually have their own dedicated email infrastructure.

> dig mx ubalt.edu +noall +answer
ubalt.edu.              350     IN      MX      10 ubalt-edu.mail.protection.outlook.com.
> dig mx ccbc.edu +noall +answer
ccbc.edu.               3398    IN      MX      20 nospam.ccbc.edu.
ccbc.edu.               3398    IN      MX      0 ccbc-edu.mail.protection.outlook.com.

They're really both using outlook.com as the provider.

This one really is mostly trivia, but just as we discovered with dots in Gmail, the local-part of an address can be implemented differently depending on the server or service provider. I've seen teams try to bake assumptions about the provider into their logic. But it's an uphill battle; the number of permutations for email servers, providers, locales, configurations, and versions is infinite. Your time isn't.

"An email address can't end in a dot"

This is more a quirk of DNS than a quirk of email, specifically. In DNS, . represents the root zone. If you've worked with DNS or traffic routing extensively, you may have already known about this. And even if you're not familiar with it, it's pretty likely you've already run into it without realizing it. In dig, the root zone "dot" is actually included in the output by default! Check it out; you can see that when I ask dig for the A record for gitpush--force.com, it shows gitpush--force.com. (with the trailing dot) in the answers.

dig gitpush--force.com +noall +answe 
gitpush--force.com.     273     IN      A       104.21.60.65
gitpush--force.com.     273     IN      A       172.67.192.184

Send an email to someone with a trailing dot generally just works. However, it's worth noting that [email protected] and [email protected]. will be the same mailbox. It's also worth noting that this is another example of RFC5322 violations that actually just work; Section 3.2.3 disallows trailing dots.

"Email addresses are (or are not) case-sensitive"

The hostname of an email address is never case-sensitive. However, the waters are murkier for the local-part of an address. Before I get into the details, I think Mattie B on StackOverflow nailed it, invoking Postel's principle:

It remains wrong to write software that assumes local parts of email addresses are case-insensitive, but yes, given that there is plenty of wrong software out there, it is also less than robust to require case sensitivity if you are the one accepting the mail.

Getting back to the technical definitions here, RFC 5321 includes this very specific line:

Local-part     = Dot-string / Quoted-string
               ; MAY be case-sensitive

So hey, if an email address specifically, explicitly is alloed to be be case-sensitive by definition, we should just always treat it as such, right? This feels like an open-and-shut case, where I can just tell you what assumptions you should make. Unfortunately, this is a lot less clear in practice than it is on paper. I was only able to find a single remotely relevant mailserver where local-part case sensitivity any meaningful sense. In Exim, you can configure local-part to be case-sensitive but it's case-insensitive by default.

Every email server, config, and provider that actually matters in the real world treats the local-part as case-insensitive for very good reasons (Sorry to all you Exim users who set caseful_local_part for some reason): [email protected] and [email protected] simply should point at the same mailbox in the real world; there's no reason to expect anything different.

So then, if every inbox that "actually matters" is case-insensitive, then we don't need to worry about it.c.. right? Well, not quite. That exact assumption is the cause of one of the most common mistakes I see with email address handling. In order to understand why, let's briefly jump into a practical example.

Let's say you're making a website where people can make an account, it's pretty common to have a rule preventing email address duplication across accounts; that is to say, no two accounts may share the same email address. Let's create a table with an oversimplified schema for the sake of our example.

create table users
(
    id    serial primary key,
    email varchar(254) unique not null
);

But even though [email protected] and [email protected] both represent the same user, that unique index doesn't prevent me from accidentally making two accounts!

-- This doesn't trigger any unique index error
insert into users (email) values ('[email protected]');
insert into users (email) values ('[email protected]');

So how do we prevent this? You might just say "force the email to uppercase," or maybe you force case in the index itself,

create unique index on users(upper(email));

This approach is very common. So common, I think, that I'd estimate >50% of systems which handle emails use some version of it. Unfortunately, it's a trap. In fairness, it does work 99% of the time. But it's not foolproof, and with how important reliable communication is, and how tightly we couple email addresses with identity these days, it can lead into messy territory.

For anyone who isn't already familiar, allow me to introduce you to the wonderful world of Unicode case folding. Put simply, forcing an email into a common case is can be a destructive and non-reversible operation in some languages. If my name was Ben Weiß and my email address was benweiß@gitpush--force.com, you could run into problems if you tried to force it to uppercase, because the non-ASCII ß folds up not only into a different letter, but TWO letters, two ASCII characters at that: SS.

But forcing to lowercase doesn't solve the problem either. If you force to lowercase, the Turkish İ folds down into a normal ASCII i. And let me tell you, there are a LOT of guys named İbrahim in the world.

For even more fun, different implementations of toLower()/toUpper() actually fold differently, so you could get different bugs depending on programming language, version, database, system locale, etc (very fun and not at all a huge pain to debug!).

So what's the right thing to do here? I have some bad news: There's no perfect solution. There is a general, good-enough solution, though: use citext in Postgres, or COLLATE utf8mb4_general_ci in MySQL.

Why isn't it perfect, though? To lean on Turkish again, let's say there are two people named İbrahim and for some unholy reason, their ISP allowed them to register İ[email protected] and [email protected]. This is technically allowed by the relevant RFCs, although admittedly unlikely in the real world. But they will fail a uniqueness check in both Postgres and MySQL because İ collides with i. It's fortunately highly unlikely for this to ever happen, but quite frustratingly not technically impossible.

"Plus Tag Subaddressing assumptions that Ben doesn't really know how to write in the form of a lie."

I, in a choice that could fairly be described as "extremely in-character for the guy who wrote this whole blog post about inane email quirks," almost exclusively use plus tag subaddresses whenever I sign up for a new service. The email I use to login to any company's website is ben+thatcompany@{myemailhost}.email. I tell myself that I do it for privacy, so if company's email database gets pwned, or they just sell my email address to marketers, I can tell where the leak happened.

But the real reason I do that is just because I just like to sit in anger whenever this breaks the user experience because of programming errors or inconsistencies. A few years ago, when I was setting up a United Airlines account, I input "ben+united@{myemailhost}.email" and was presented with a "please enter a valid email address" regex failure. Let me be very clear here: the + character is absolutely a valid and legal character in the local-part and it is a valid and working address. It's not even remotely weird or new for this to be allowed: I first started doing this in 2011 with Gmail!

Let's look at RFC5321 Section 4.1.2, RFC5322 Section 3.4.1 and RFC5322 Section 3.2.3 for reference:

RFC5321 specifies the following command syntax...

   Local-part     = Dot-string / Quoted-string
                  ; MAY be case-sensitive


   Dot-string     = Atom *("."  Atom)

   Atom           = 1*atext

That's basically saying that "Local part can be Dot-string or Quoted-string." We already covered Quoted-string's quirks a little bit, but let's focus on Dot-string. It's defined as Atom *("." Atom). So that means it allows any number of "Atom" separated by single dots. And Atom is defined as 1*atext, which means that an Atom can be 1 or more of "atext."

But what's "atext?" RFC 5321 inherits this definition from RFC5322, so let's hop over there where we see another similar definition for local-part...

   local-part      =   dot-atom / quoted-string / obs-local-part

And the definition for "atext", where it explicitly allows the plus character!

   atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials.  Used for atoms.
                       "&" / "'" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"

United became the first in a long and growing list of companies that are officially on my bad side for not allowing my valid and working email address. Most recently, Motorola was added after it let me make an order with ben+motorola@{myemailhost}.email, but the order status page failed with an "invalid email address" message... when clicking the link that Motorola successfully delivered to my email address! (I really like my shiny new Razr Fold though).

This kind of subaddressing delivers [email protected] to the same inbox as [email protected] on my email provider, as well as in Gmail. But it's surprisingly common for people who are aware of this kind of subaddressing to assume that's universal and they can just kinda pretend ben+whateer and ben are always the same inbox. Some advertising and marketing types even try to be clever and merge email based on this assumption. But broadly speaking, it's mostly just a Gmail implementation thing.

In Conclusion...

I'm out of breath at this point so I'll keep it short: We've explored some edge cases and peculiarities of email that are honestly entirely unnecessary for anyone to internalize. While any single one of these "lies" might seem obvious, pointless, or trivially worked around, the set as a whole creates a complex web of pitfalls to avoid. Between international characters, case sensitivity, dots, server-specific features and limitations, unexpected characters, DNS shenanigans, and other caveats, it becomes hard to make safe assumptions about anything with email.

This brings us back to our original point: Don't overthink it. For unique index, use citext in Postgres, or COLLATE utf8mb4_general_ci in MySQL. For verification, send an email with a confirmation code/link.

It really doesn't need to be any more complicated than that.

Tags: