ps aux | grep hoang

Migrating The Daily Free Press

Andrew Hoang — Tue, 27 Mar 2018 05:22:03 GMT

How we moved (and fixed) a Wordpress site with 100k monthly views from BlueHost to DigitalOcean

tldr; we made a previously unusable student newpaper website usable with minimal downtime!

One of the things I really like about the software world is that there are a lot of opportunities to do pro-bono work, often with some interesting technical challenges attached. This is one of those times :)

Background

Earlier this year I got an email from the Daily Free Press, Boston University's Independent Student Newspaper, requesting help with their website. Their issues were simple but severe: page load speed was really, really slow, and it appeared that once every 4-5 requests a user would get an annoying "Error establishing a database connection" page. The database connection errors made it difficult to publish articles and created a terrible user experience for the DFP's ~40k active readers. I felt that this would be good practice for my new job (insert note here) and a good learning experience for all involved, so several members of the BostonHacks tech team and I decided to take ownership of the DFP's web presence and dove in.

Forensics + Fighting BlueHost

Like any other confused sysadmin, the first thing we did was gather information by looking at the deployment configs and logs to try to diagnose what was wrong. The existing site was running on BlueHost's "Optimized Wordpress Hosting" plan (oh, the irony). We were able to get access to DFP's BlueHost account and poke around, with some difficulty - the BlueHost interface (mostly cPanel, with some extra custom configs) was very slow to load (likely because it had to query the barely-breathing server for info). After some finagling we were able to SSH into the server and poke around.

Hard Drive Investigation

The site was running on a standard linux box with 2GB of RAM and a 30GB hard drive. Using df -h, we were able to find out that an additional 30GB backup hard drive was attached, as well as two more 30GB hard drives, with various (wonky) levels of usage:

DFP's wordpress installation was in /home/dailyfr3/public_html, and while the drive was mostly full (25/30GB used), it wasn't close to being 100% full. The two other 30GB drives, mounted on / and /home/dailyfr3, weren't anywhere near full either. However, the /backup drive was 100% full, and upon checking the BlueHost cPanel we realized that someone had enabled four-times-daily backups on the entire /home/dailyfr3 directory, which had ~25GB of stuff in it... meaning the backup hard drive was definitely rejecting 100GB of disk writes a day. We turned off the backups and tried to delete stuff from /backup, but since BlueHost doesn't give you root access on their Optimized Wordpress plan we weren't able to delete anything from the full drive. Site performance stayed the same, though, so moved on.

CPU/Memory Investigation

Using top, we inspected the server's memory and CPU usage. Here's a screenshot:

Three things stand out here: the high RAM usage (1.6/1.9GB), high CPU usag by mysqld (spikes up to 95%...grabbed this screenshot at a relatively high 83%) and, most importantly, huge swap usage by mysqld - you can see in the "VIRT" column it's using 2.2GB of virtual memory (swap), which is greater than the size of server RAM. Swapping is a symptom of an OS running out of memory and deciding to use disk space as backup emergency memory. On a MYSQL DB powering Wordpress, it's especially bad, because Wordpress/PHP has no concept of DB "connection pooling" like some other languages and creates and destroys a new database connection for every query. This causes a pretty big memory initialization on every page load, which, if the OS is out of memory and falling back to swap, means that DB connection has to be paged in and out of RAM from disk, which is a horribly, horribly slow process, meaning queries take forever. After there is a long enough queue of queries to be processed Wordpress will give up waiting for a connection and kick back a "Error establishing database connection." Voila. The culprit.

MYSQL Investigation

With the swap issue in mind, we decided to poke around the MYSQL installation to see if there were any anomalies there. In doing so, we found that DFP had three active websites (dailyfreepress.com, blog.dailyfreepress.com, and hockey.dailyfreepress.com), each with its own separate database and its own separate Wordpress installation. Basically, DFP were running three separately managed Wordpress sites on one box, which was contributing to the high memory/swap usage. We were also not sure if this was a shared server instance - it's possible that there were other users using the same MYSQL DB and server, but we weren't able to find out because we didn't have root access to the server.

Sample mysql connection issue (we had this and the dreaded "Cannot create database connection")

We also weren't able to look at the mysql logs (or any logs in /var/logs) because of the lack of root access. We did notice that while both Apache and nginx were installed in /etc/, only nginx was running. With all this in mind, we decided to migrate to a clean install on DigitalOcean, where we would have root access, better monitoring tools and overall a much more reputable hosting service. (And we would save DFP a moderate amount of $$$ - cutting their bill from $30 to $7/month.)

The Great Migration

It was important to DFP that we do the migration with minimal downtime, so we decided to setup a staging environment on DigitalOcean with a clean Wordpress install, and to copy five years worth of content to this install. After configuring everything, we would shift traffic to the DigitalOcean server and turn off the old one (bye-bye, BlueHost!).

Switching DNS to Cloudflare

There were a few access logs from the old server that showed bots attempting to pwn the Wordpress admin login page, which also didn't have TLS/SSL enabled. To protect the new server and provide some cover to the old one in the process, we imported DFP's DNS settings and started proxying traffic through CloudFlare. This gave us the ability to quickly shift traffic later on, i.e. in moving the staging server to production, as well as some basic TLS/SSL protection.

Previous site login page w/o SSL:

Backing up the DB and the Images

We tried to get a DB backup via phpMyAdmin but the BlueHost control panel failed us. Instead, we generated a database dump for each of DFP's sites using mysqldump. Altogether, the DB dump files were only about 1GB and were only three files, so we just scp'd those over to the new server.

The images were a bit more tricky - there were 23GB worth of images/gifs that DFP had collected and uploaded over the years. After attempting and failing to compress the images into a tarball, we decided to just rsync the images between the servers, which took a few hours.

Setting up nginx, Wordpress and LetsEncrypt

While the content was being copied over, we pushed a clean install of Wordpress to the staging server. Because nginx is a bit more performant when it comes to memory usage (and we had more experience with it) we decided not to use it in lieu of Apache for our proxy. We also setup TLS/SSL certs on staging using the awesome LetsEncrypt/Certbot project. TLS/SSL would help protect the admin login pages as well as increase DFP's SEO ranking.

Importing the Main DFP site

With a clean Wordpress install on the new server, it was time to import the content from the dump files. With a little trial and error, we were able to import the main DFP site. For everything to work, we needed to change some prefixes (the original site had a Wordpress prefix of "wp_main", which we had to change to "wp_") and we needed to change the URLs of the images, which were prefixed with "http://dailyfreepress.com" instead of the new, secure "https://dailyfreepress.com".

I had the bright idea to try to use Visual Studio's find-and-replace tool to convert these prefixes, which prompty bricked my computer for ten minutes when I tried to run find-and replace on a 0.7GB long text file. Fun fact: sed will do the job in less than 10 seconds without crashing your computer.

With the prefixes converted, we ran a MYSQL import on the modified dump file and everything worked!

Creating a wordpress network and connecting the DFP Blog and DFP Hockey sites

Instead of having three separate Wordpress installs (yes, three separate folders with the same source code) like the old server, we wanted to use Wordpress' "network" feature where we could manage multiple sites from the same admin panel. Through a little more trial and error, we were able to enable this feature and create a linked DFP Blog and Hockey site via Wordpress admin. After creating the linked sites, we had to get the content from the old server into the DB, which we were able to do after finagling with sed and Wordpress prefixes. Ta-Da!

Post-Mortem

Root cause: Wordpress/PHP had issues connecting to a shared MySQL database instance because of high RAM and swap usage.

Solution: Clean LAMP and Wordpress install. Also added logging and site backup to our new DigitalOcean setup!

If you've read this far, thank you :) and if you're a BU student interested in working on fun problems like this for other BU students, shoot us an email at dfp [at] bostonhacks.io :)

Thanks to Noah Naiman and Ken Garber for their help in troubleshooting the above.

How to fix "can't access Wordpress admin (/wp-admin) after changing url or migrating database"

Andrew Hoang — Mon, 19 Feb 2018 02:03:55 GMT

Every once in a while, I get the ~~horrible misfortune~~ opportunity to fix a bug that is very small but very annoying. This one took an entire Sunday afternoon.

If you are ever migrating a Wordpress site across servers, and you did a database dump, and you think you renamed the tables correctly, and you CAN login using the old admin credentials on the new site but CAN'T access the Wordpress admin console (yourblog.com/wp-admin), then make sure that the below three queries are correct.

More specifically, your user needs to 1) be an administrator in the DB 2) have level 10 3) the wp_user_roles option needs to be set correctly in the wp_usermeta table.

Re: the third query: sometimes, if your original database had a different prefix i.e. old: wp_site, new: wp_ then the entry in wp_options will NOT be valid (option_name will be wp_site_user_level, not wp_user_level) and your admin will be able to login but will not be able to see /wp-admin. Fix by changing the prefix in the database.

Queries below:

SELECT meta_value FROM wp_usermeta WHERE user_id =[admin user_id] AND meta_key = 'wp_capabilities';

// ^ Should return something like this:

| meta_value                      |
+---------------------------------+
| a:1:{s:13:"administrator";b:1;} |
+---------------------------------+

AND

SELECT meta_value FROM wp_usermeta WHERE user_id =[admin user_id] AND meta_key = 'wp_user_level';

// ^ Should return 10

+------------+
| meta_value |
+------------+
| 10         |
+------------+

AND

SELECT option_name, option_value FROM wp_options WHERE option_name = 'wp_user_roles';

// ^ Should return 1 row, with a massive hash.

Hope this saves someone a Sunday afternoon :)

How to be Emily

Andrew Hoang — Sun, 17 Dec 2017 04:49:21 GMT

I read a lot of Hacker News, and I was scrolling along one day when I stumbled upon a blog post from the wonderful people over at interviewing.io. The post is titled "If you care about diversity, don’t just hire from the same five schools", and compares the experiences of three (fictional) college students looking for a job in software engineering after graduation: Mason, a student at an elite target school, Emily, a student attending a mid-tier institution, and Anthony, a student at a local state college.

The tldr; of the blog post is that Emily and Anthony find themselves at a big disadvantage in getting interviews and also in passing interviews. I won't rehash the specifics here, but I encourage everyone - but especially students at non-target schools - to check out the post, as it effectively explains many of the struggles that students like Emily and Anthony face when looking for a job.

The blog post really spoke to me because I found a lot of similarities to Emily's experience in my time at Boston University. BU ranks somewhere between 35th-40th in US News' annual national rankings and is pretty much non-existent in any CS/CE University rankings. It's a good school, and I'm glad that I chose to attend, but it's definitely not a target school, and that's made some parts of the job/internship hunt very, very difficult. To help my classmates and other students in a similar bind, I wanted to share a few realities of our situation and some tips for being Emily: how to break into elite software companies from non-elite schools.

Unfortunate Realities of Job Hunting at a Mid-Tier School

You Will Be Resume Screened Because of Your School

This sucks, and it happens all the time. Here's why: when recruiters at GoogSoft are doing the ten-second screen of your resume, they are looking for something that says to them "this person will pass the interview." If the "Education" section is a mid-tier school that they are probably not familiar with, they will start looking for something else in the "Experience" section, such as internship at a well-known competitior, such as AmaBook - which is way harder for students at mid-tier schools to get into for the same reasons.

For some companies, it's even more difficult - many boutique startups or hedge funds will not take a pass at you if you don't go to a top school unless you demonstrate some truly remarkable accomplishments (math/computing olympiad, topcoder red, significant contribution to a popular open source project, etc.)

The worst part of this is that you cannot change your resume quickly (except the "Projects/Extracurriculars" section, which counts for much, much less), and the battle is over before it started - you didn't even get a chance to show what you actually know.

Your Peers at Target Schools Will Have Much Better Preparation and Information

Doing technical interviews is like taking the SAT - it's a game, and if you do it enough times, you'll keep doing better and better on average. As interviewing.io points out, students at top schools have easier access to interviews. They have the luxury of being able to take an interview (or three) with a company they might not be so interested in for practice, which really, really helps with being calm under pressure and getting a feel for the kinds of questions asked - two things that you can't really simulate outside of a live interview environment.

I was chatting with one of my friends who is finishing up undergrad at Georgia Tech. He mentioned that after four years of college (four job searches - three intern, one full-time) he had done 53-ish interviews. Fifty. Three. Most seniors I know at BU barely crack double digits in their four years. It's much, much more difficult to perform at a comparable level when the competition has done three to ten times as many practice reps - whether that's crossfit, baking cupcakes or coding interviews.

Students at target schools are also more likely to have classmates or know alumni that have interviewed/worked at top companies. This helps on two fronts: with referrals and with company-specific information, both of which help immensely in the job-hunt process. Referrals can be the difference between getting an interview and ending up in the recycle bin. Company-specific information can mean even more than that: at the offer stage, many elite companies make a lowball offer, expecting the candidate to negotiate (you should ALWAYS negotiate, by the way). Being able to ask alumni/classmates what a fair offer is can literally cost candidates tens of thousands of dollars a year. (It almost happened to me.)

Imposter Syndrome

You got the job! Yay! You beat the odds!

You get to the office on the first day, and after a quick orientation someone hands you a laptop and tells you to checkout the master branch of the microservice that you're supposed to be working on and add a simple GET endpoint to the elephants model. Eager to please, you get right to work.

Wait...what the hell is docker? And what is a git branch? Nervous, you get something running, but there are red error statements coming out of your terminal. Your mentor takes one look and says "the service can't find the database - point it to localhost:8900." You wait until he/she steps away from the desk, and then you anxiously google what the hell they just said. You did some SQL in your databases class, but these people are using an eight-node Cassandra cluster "because we have too much data for postgreSQL." Your mentor is watching you out of the corner of his/her eye, hoping that you will get up and running soon - it's been four hours and you're still trying to set up your dev environment. Meanwhile, the kid across the aisle from Carnegie Institute of Technology has ten terminal prompts open and is typing at 120wpm while head-banging to Hans Zimmer.

(This is me on my first day at a well known NYC company that makes terminals.)

It's easy to get imposter syndrome at a top company, especially when many of your peers come from top schools. Because there is a higher concentration of experienced software engineers at those schools, students have more exposure to good practices and cutting-edge tooling, which is not necessarily exposed in interviews but will be exposed on the job. Target schools are also willing to spend more resources on faculty and competitive curricula. Because of this, students at these universities are likely to be more comfortable transitioning to real-life software engineering - which can make their coworkers from mid-tier schools feel like they're playing catch-up.

How to Even the Playing Field

So - how do you find success when the odds are against you?

Start Small, and Start Early

For the vast majority of people like Emily, getting into the "Big Four" or a comparable elite startup/hedge fund is a 2-4 year project. It's going to be very difficult to jump straight to one of those companies because you'll likely need to demonstrate some work experience before getting a chance at an interview.

The easiest way to get into an elite company for full-time is as a returning intern. This is because full-time interview processes have many more rounds and many more applicants - you're now competing with not only other college students but people with 1-4 years of experience. This means that you should try to get into an elite internship the summer before senior year.

That gives you two years to build your resume up so that you can get those target interviews. Try to do something the summer after your freshman year, if you can - you'll likely have to find work at a smaller local company or do on-campus research. Literally any relevant work experience is good at this point. The next summer, shoot for a more well-known company. By the time junior year rolls around, you should hopefully have two relevant entries in the "Experience" section. In the meantime, buy Cracking the Coding Interview and work through it. Try to take your Intro to Algorithms class as soon as possible, too, so that you have some practice analyzing and implementing algos.

Go to Hackathons and Build Side Projects

This is the "gym" of software engineering - you have a chance to learn modern languages/frameworks, as well as meet a bunch of people who are interested in tech. Many events will pay your travel expenses to attend. There are also free meals, t-shirts and in many cases companies looking to recruit. What's not to like?

Have fun at these events, and try to meet people - company reps, attendees, organizers, everyone. Most importantly, build some projects - you can put these on your resume (and if you stay up all night wondering why your node.js app is throwing an invisible error and still like coding, this is definitely the career for you). Need a starter project? Build your own personal website (like this one!) and deploy it. You'll learn a ton about things that are important for real life but that no one will teach you in school: server configs, DNS settings, SSL certs, and more.

Be Relentless, and Think Outside the Box.

This sounds like one of those corporate mottos that you put on a poster to motivate people. I apologize. However, it's true. To be successful, you'll need to put yourself out there and be relentless - perhaps even straight-up aggressive - when looking for a job. Don't just apply online - your resume will likely not even be looked at. Go to LinkedIn, find a university recruiter, and email them. Don't have their email? Guess it (blog post on how to do this reliably coming soon). Go to career fairs, get business cards, and email. Ask your friends, family, professors - anyone that might have a recruiting connection - to help you out. Have a presence on relevant social networks - obviously LinkedIn, but also AngelList, and (the civil) parts of Reddit. Read Hacker News and Product Hunt so you know what the tech community is buzzing about. Have a public GitHub profile, and know what's trending over there.

Conclusion

I was fortunate to be introduced to programming through an intro to Java course in high school. Many of my classmates from that course attended target schools as CS/CE majors, and a large number of them ended up at GoogSoftAmaBook for an internship (or two, or even three). Meanwhile, I found it very difficult to get an interview at those companies, and it took me my entire college career to get in.

The above is a summary of what I learned along the way. I had difficulty getting into those companies not because I was less talented or qualified than any of my high school classmates, but because I had to take a different approach than my target school peers.

It's not easy to be Emily. But you can do it. Good luck!

How API Request Signing Works (and how to implement HMAC in NodeJS)

Andrew Hoang — Sun, 05 Mar 2017 03:24:48 GMT

Intro:

Web APIs are notoriously hard to secure. As a developer, anytime you expose endpoints/resources to The Internet™ for others to use, it's important to make sure that the people who use those endpoints are who they say they are. We can accomplish this with API Request Signing.

Background knowledge:

This article assumes familiarity with the HTTP protocol, web APIs and a basic understanding of hash functions. I provide a sample implementation in NodeJS/Express. Also, it is important to use HTTPS to protect requests between the client and server, but even if HTTPS is in place you should implement request signing as HTTPS cannot defend against requests from attackers looking to masquerade as real users.

Request Signing Basics

Consider the following system. Tim is our "server", and listens for requests. Alice is our real user, who wants to send a message to Tim. David is the attacker. Alice tries to identify herself to Tim, by sending her name in the request, but if David is able to see her message over the network or otherwise gain access to her "name" field, he can pretend to be her and send false messages to Tim:

This is a common problem that we face when building web APIs - since users' IDs/usernames are relatively well known, we cannot use these as identifiers. To combat this, we need to implement a MAC (Message Authentication Code), an algorithm which confirms that a given message came from its sender and that the data in the message hasn't been altered in transit.

There are several flavors of MAC algorithms, but the one we'll focus on for now is called HMAC, which stands for Hashed-Based Message Authentication Code.

What is HMAC?

HMAC is a MAC algorithm that depends on a cryptographic hash function. (You may remember from your CS texts that a hash function takes input data and maps it to standardized output data, and that good hash functions produce as few collisions as possible, which means that different input is rarely mapped to the same output.)

Hash functions

In the case of HMAC, we need a hash function that takes a variable-size String input and generates a fixed-size String output. The input cannot be easily decoded using the output, and different inputs must map to different outputs - hash("hi tim") must produce a different result than hash("hi timo").

(If you were able to write a function that satisfies all of the above constraints, please contact me. You'll probably win a Turing Award and I would love a share of the prize money.)

Fortunately, several hash functions (written by some very smart people) already exist. We'll use SHA-256 for this exercise (thanks, NSA). SHA-256 gives us a 256-bit output that is pretty much guaranteed to be unique for any string input. In hexadecimal, each character represents 4 bits, so to for our 256-bit string we'll need a hexadecimal string that is 64 digits in length.

Secrets, Secrets and more Secrets

Now that we've got our hash, we need to assign each user a public key and a private key (secret). The public key will be sent in the request as an identifier. (The private key cannot be exposed in the request, and if an attacker were able to gain a user's public and private keys, they would be able to send requests as that user.) If you've worked with web APIs before, this will sound familiar - most APIs assign an identifier key that matches to a secret.

The diagram below shows this revised state - Alice and David each know their own ID and secret, but not each other's. As the server, Tim is the only one who knows everyone's id and secret.

It doesn't really matter how we create these IDs/secrets - they just have to be unique and matchable to each user by the server.

HMAC Signing

Now that we have everything we need, let's sign our request!

First try at a good HMAC

Our first version of the signature will be fairly simple: we'll concatenate the secret key + the message and make a hash of that. We'll attach our signature and the public key to our HTTP request as a header. The server will look for the public key in its database, find the corresponding private key, and calculate the same hash(secret_key + message). If this hash is the same as the signature that we sent in the request, we know 1) the message could have only come from someone with the secret key, and 2) the data in the message hasn't been altered in transit. Wahoo!

Below is our original example, this time with request signing. Alice signs her request with her private key. Tim checks the signature by computing his own hash of the information. Notice how the private key is not passed in the request at all - only the public key is sent, so that Tim can find the corresponding private key. David is angry, because to imitate Alice, he has to 1) successfully guess Alice's private key 2) make his own hash of the message and send it to Tim, which is a lot of work. (In real life, Alice's private key would be much longer than three characters, making the process of guessing the key nearly impossible.)

Issues

This is pretty dandy, and solves the initial problem of proving that the message comes from Alice. But we have a small problem. Since the the signature comes from a simple concatenation of the private key + message body, there is an extremely simple way to send a corrupted message: remove stuff from the key and append it to the message. For the example above, suppose the key was '12' and the message was '3hi'. The hash would still be hash('12'+'3hi') = hash('123hi'), giving us the same (valid) signature as our original message! This is no good.

To combat this, the authors of the HMAC algorithm used a pretty nifty trick involving bit math and hashing twice to produce a secure document signature. (I won't go into the proof here, but if you're interested, you can read more about the algorithm in RFC 2104). The final 'secure' algorithm, from the RFC text, is:

1) B is the block size of the algorithm used in bytes. For SHA-256, the block size = 64 bytes.

2) The key must be at most B bytes in length. If the key is less than B bytes, 0s are appended to the key until there are B bytes in the key.

ipad = the byte 0x36 repeated B times
opad = the byte 0x5C repeated B times
key = private key
text = message to be signed (request body)

signature = hash((key XOR opad) + hash((key XOR ipad) + text))

And that's it. Now we've got a signature that pretty much guarantees that Alice is Alice, and is extremely difficult for David to crack.

Implementation

We don't need to implement this ourselves. Most modern languages/frameworks have crypto libraries that have an HMAC implementation already included, or utility functions which you can quickly stitch together: NodeJS, Java, Golang, etc.

Here's a sample signature in NodeJS using the NodeJS crypto library. It's extremely simple.

var hmac = crypto.createHmac('sha256', secret_key);
hmac.update(request.body.message);
var signature = hmac.digest('hex'));

Conclusion

I hope this was a helpful introduction to the HMAC algorithm and API request signing! As I mentioned earlier, request signing is only one of the strategies that you should use to protect APIs/the integrity of messages. You should explore other strategies to help secure your application, especially HTTPS for encryption during transit - production APIs usually use both request signatures and HTTPS to provide a minimum level of security. Feel free to reach out to me at ahoang18 [at] bu.edu with any questions!