I’ve been with Codility for over seven years. When I joint Codility, we were just a gang of eight, and now there is 100+ of us. This gave me an opportunity to watch how a startup organically grows and turns into a small company.
One of our customs that evolved as the company grew is emails introducing new employees joining us. Initially, they were sent personally to every one of us, but very quickly an alias all@… was created. And naturally, most of us replied to such emails to say ‘Hello!’. And most of us used ‘reply to all’, replying to the last email in a thread. This resulted in building up a thread with each message citing all the previous messages, like this one.
Luckily Gmail collapses such citations to a single ellipsis, that can be expanded on demand. (I thank you and curse you Google for this feature! Thank you, because it really helps reading just the essential part of the email and ignore unnecessary citations. Curse you, because it makes people carelessly cite everything there is, no matter whether they really refer to something that was written before or not.)
As the company grew, getting a new email with every person replying to a ‘Welcome‘ message became an annoyance. So we decided to use all@… alias only in BCC, not in CC. And in most cases, we actually follow this convention. I wondered about how much space do all those unnecessary citations occupy. Let’s analyze it, first using the Big-O notation, and then using concrete numbers.
Let’s say there are N employees in the company. With every new employee, there is a ‘Welcome‘ email. For simplicity, let’s assume that half of the employees reply to such an email (replying to the last message in the thread and citing all the previous messages). Let’s compare two scenarios:
- (a) the ‘Welcome‘ emails are sent to BCC: all@…
- (b) the ‘Welcome‘ emails are sent to CC: all@.. or just TO: all@…
So, what is the difference in the amount of disk space that such messages occupy, in these two cases?
Scenario (a) is simpler: There are N+1 mailboxes (including the newcomer’s one). The message lands in each of them. N/2 people reply and the replies land in their own, and one or two other mailboxes. This means c.a. 2*N, that is O(N) messages in total. Over time, as new people join the company, this cumulates to O(N^2) messages in total.
In scenario (b), the announcement message lands in N+1 mailboxes. N/2 people reply to all, which makes the total of N*(N+1)/2 = O(N^2) messages in mailboxes. However, the messages get longer and longer. As we reply to the last message in the thread, all the previous replies are cited. On average, half of the replies are cited, so each reply contains O(N) messages cited. This means, that one ‘Welcome‘ message triggers O(N^3) cited messages being stored in the mailboxes. Over the history of the company, this cumulates to O(N^4) total cited messages being stored in the mailboxes.
How bad is it? Its O(N^2) times worse than scenario (a), and O(N^2) is a lot! 😱
Really? What if I don’t believe this O(N^2) mumbo-jumbo? A lot of almost nothing can still be negligible. Then we cannot take the Big-O shortcuts and we have to do the math.
Warning: Math ahead! ⚠️Σ😱
Concrete Numbers Analysis
(Simplifying assumptions: we only hire, we send the same announcement for all new hires, half of the currently employed people respond with “welcome aboard” messages, when replying we “reply to all”, the welcome message occupies c.a. 16 KB, each reply adds 1 KB to the message size).
Let’s say there are currently N employees.
Case (a) is again simpler: There are N+1 mailboxes (including the newcomer’s one). The message lands in each of them. N/2 people reply and the reply lands in their own, and one or two other mailboxes. This means c.a. 2*N messages in total, half of them 16 KB and half of them 17 KB large. Over time this cumulates to c.a. N^2 messages in total, each 16.5 KB on average.
- For a 100 people company this is 100*100*16.5KB = 165MB — peanuts.
- For a 1000 people company this is 1000*1000*16.5KB = 16.5GB — still negligible.
- For a 10 000 people company this is 10 000 * 10 000 * 16.5KB = 1,65 TB — not to worry about, fits in a single hard drive.
Nothing to worry about. So how much worse can be scenario (b)? Let’s check it out.
An announcement message lands in N+1 mailboxes. N/2 people reply to all, which makes the total of N*(N+1)/2 messages in mailboxes. However the messages get longer and longer. Each reply adds 1 KB. So the first reply takes 17KB, next 18KB and so on. On average, half of the earlier replies are cited, so each reply takes 16 KB + N/4 KB on average. All the messages in the mailboxes related to one announcement take N*(N+1)/2 * (16 + N/4) KB ≈ 8*N^2 + N^3/8 KB. Over the history of the company it cumulates to 8/3*N^3 + N^4/32 KB.
- For a 100 people company it’s 2.7 GB for the message and 3.1GB for citations, 5.7 GB in total (negligible).
- For a 1 000 people company it’s 2.7 TB for the message and 31.3 TB for citations, 34 TB in total — well, it’s just 10-20 hard drives, what is it for Google?
- For a 10 000 people company it’s 2.7 PB for the message and 312.5 PB for citations, 315 PB in total. This is at least 100 000 hard drives, 50 tons of them (not counting the servers). 😱
Well, maybe Google can still handle it, but it’s definitely a significant load. (Google, if you’re listening, I’d love to hear your feedback.)
Moreover, 99% of this load are just cited replies …
You can also notice, that in these calculations N^4/32 KB quickly dominates 8/3*N^3. This is simply because O(N^4) always dominates O(N^3) (for large enough N).It’s the O(N^4) that makes the final result PetaBytes not GigaBytes.
Next time you send an email to all@…, please consider sending it to BCC: all@…
This blogpost has been substantially edited and reposted by Codility here.