Ubuntu: solve Broadcom Sta slowdown on battery power problem

I’ve found myself forced to replace a new -9 months old- D-Link router, model DIR-835, because all it’s Ethernet ports stopped working. I will never buy another D-Link device, ever in the life. Well, past that point, I bought a Linksys N750 router, even being aware of all the security issues that the Cisco Cloud software has raised.

After doing the setup and everything, I’ve found that all YouTube videos on my MacBook 6,2 running Ubuntu Raring 13.04 were loading at a painfully slow speed. I, undoubtly, blamed the new router, who received a big amount of resets, setting changes, etc. I tried plugging in the Ethernet wire and testing the speed in that way, just to find that was working great.

After some time of testing and testing, the battery on the laptop was drained. So I plugged in the charger -while running a speed test- to find that the download speed increased at the same time I plugged the charger.

I said to myself “no, this can not be true” and kept plugging and unplugging the charger from the laptop, just to see how the speed was going up and down. I issued the iwconfig eth1 command and then, here’s what I’ve got:

With charger:

eth1 IEEE 802.11abgn ESSID:"JuanRomanV2"
Mode:Managed Channel:38 Access Point: C8:D7:19:21:3F:E0
Bit Rate=162 Mb/s Tx-Power:24 dBm
Retry min limit:7 RTS thr:off Fragment thr:off
Power Management:off
Link Quality=5/5 Signal level=-56 dBm Noise level=-88 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0

Without charger:

eth1 IEEE 802.11abgn ESSID:"JuanRomanV2"
Mode:Managed Channel:38 Access Point: C8:D7:19:21:3F:E0
Bit Rate=243 Mb/s Tx-Power:24 dBm
Retry min limit:7 RTS thr:off Fragment thr:off
Power Managementmode:All packets received
Link Quality=4/5 Signal level=-59 dBm Noise level=-88 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0

Oh… so this nice card has a “Power management” feature that, in order to manage the power, manages to make me feel back in the 90’s with those DUN connections, huh?

I’ve investigated about the issue to find that Broadcom included such feature in the recent Broadcom STA Linux driver for models B43224 -and possibly many more. On my other laptop, a Dell Vostro 3360 also running Raring, I have a Broadcom wireless and seems like I don’t have that problem.

So, how to solve this?

Seems like there’s a deamon taking care of the battery status and changing the wireless card setting in consequence. After some tries, my solution was to add this content in the /etc/pm/power.d/wireless file:

#!/bin/sh
sleep 2;
/sbin/iwconfig eth1 power off;
return

Why the sleep? Because there’s somewhere another deamon taking care of the change. With the Sleep 2, I give time enough for the other deamon to be happy, to think the change has been done and to quit; and then my script saves the day by enabling the card again to eat all  the battery power. But well, YouTube works decently at least.

Last, sorry Linksys! I blamed you hard for nothing…

Coding with style. Excessive style, actually.

We were having a bug where a DIV was being hidden by a asynchronous bar with Javascript on top of the page. So one of my front end developers came up with this solution: put the DIV a bit lower to show it completely. This is coding with a lot of style, isn’t it?

Double style tag? Are you joking me, bro?

Weirdest reason in a technical request email ever: Porn is not allowed in Moldova

Some time ago I was working on a porn search engine project -now with more than 10 thousands visits a day and growing- and I found myself writing the weirdest reason ever to ask for a domain change. The site was www.porn.md ; supposed to be “the porn doctor”. You would write your problem in the search box, and the doctor will give you some remedies. Everything was going good, traffic was going up and everything, until we received a notification from the TLD .md managers: Moldova does not allow porn websites.

We needed to move the project to a different domain: pornmd.com. There was no choice.

I went to the email client and I sent an email to my hosting company, the weirdest reason ever.

I will need to change the domain name from porn.md to pornmd.com because porn is not allowed in Moldova.

Thanks!

Nicolás Andrade

 

“Porn is not allowed in Moldova”. Have you ever wrote a weirder reason than that?

Recover millions of rows in production MySQL server from backup

That’s what we did during the last week. The first try was doneby the sysadmin, they tried to insert everything in one move. Imagine an INSERT statement with 20 million values, of course it will fail!

The second try took 2 days, using splitted files, but the sysadmin decided to restart from scratch because of a missing index. I would keep that unindexed table at least and try an alternative way to deliver results to the users as soon as possible. But the sysadmins are that, sys admins, and not business-admins, so the guy took the decision of simply drop the table, again. For the third massive-insert I had the experience from the other two massive insert; so I decided to define some rules to the sysadmin in charge of the import in order to do things in a different way.

The first thing was already known by both Troy and me, the 20 million rows should be splitted in many files and not in a single one as they tried to do once.  So we agreed on splitting the SQL in 19024 files of 1000 rows each one.

Here I’ll document different aspects you could consider if you find yourself in the same situation.

 

Preparing the table to receive the inserts

I’ve read many times that is convenient to insert the information before the indexes are created, to speed up the process. Then you should create the indexes in the table. This is true for inactive servers -this means staging servers or not-in-production servers- but it’s not good at all for live servers. Creating an index implies moving the whole table to a temporary one, creating the index itself, and then move back all the information to the indexed table. Of course the table is blocked during the whole process, so all your clients will be stuck: no inserts, no updates, no selects. The load in the server will go up to the skies, so probably not only that table will fail but the others too.

Even if the server is not active, is better to have at least the PRIMARY KEY / UNIQUE index created. The insert process will be slower, true. But you’ll be safe if for any reason you need to re-insert a file that hang in the middle. If you don’t have the UNIQUE index created you’ll end with duplicated rows, that will be more complicated to remove afterwards.

Having the UNIQUE index created, and using IGNORE option in the INSERT clause is the best way to be able to reparse a file in case you need to.

 

Avoid extended-inserts.

While extended inserts are good in terms of speed and in terms of dump file size, you can not continue the process from the middle in case it breaks. A quick example:

INSERT INTO table (a,b,c) VALUES (1,2,3), (4,5,6), (7,8,9);

INSERT INTO table (a,b,c) VALUES (1,2,3);
INSERT INTO table (a,b,c) VALUES (4,5,6);
INSERT INTO table (a,b,c) VALUES (7,8,9);

The first insert will be much faster than the three inserts in the second line, because MySQL will update the indexes only once at the end of the insert. But both blocks will do the same exact operation (assuming no duplicated keys)

The second line with the 3 statements will be slower because of the three writes and three updates; but in terms of safety is much better since if a statement fails due to an existing key, the others will still be valid.

Another option will be to use INSERT IGNORE to ignore errors while inserting. If you’re trying to insert an existing value in the index, and you are using IGNORE, the error will become a warning and the statement will be parsed anyway. This means that the whole insert will success, except the particular block that was repeated.

The advantage of using one INSERT per row is named split, and is a Linux command.

mysqldump (tablename, options, etc) --skip-extended-insert > dump.sql

At this moment we had a nice dump.sql file with 20 million INSERTs, one per line. So we can move to the next step.

 

Split the big file into many

By running the split command in a file with one insert per line you get multiple files that are still valid. The advantage of this is that you can parse a few of them while monitoring the server status -and all the slaves- and see how they are affected. Also you can stop the process for any reason you found, and then continue from the break point.

A good balance between 100% individual inserts and a single insert will all the values would be using a single insert per file, at the top, and creating extended inserts, one per file. An example of a 1-insert-per-row file could be:

INSERT INTO table (1,2,3);
INSERT INTO table (4,5,6);
INSERT INTO table (7,8,9);

INSERT INTO table (10,11,12);
INSERT INTO table (13,14,15);
INSERT INTO table (16,17,18);

A faster and still safe approach could be

 

INSERT IGNORE INTO table (1,2,3), (4,5,6), (7,8,9);

INSERT IGNORE INTO table (10,11,12), (13,14,15), (16,17,18);

 

The chance of starting from the last breakpoint is still there, while the speed increments significately. MySQL will update the indexes only once every 3 rows; while in the previous example will update the indexes once per row.

While doing the operation we decided to use single inserts to not block the table for a long period of time -inserting many at once blocks the table for longer periods- so we allowed clients to read even when we were doing the import operation.

 

Inserting one file after the other

This is a simple step, you should loop over the files in the directory, and execute file by file. The script will:

1-Get a list of files not starting with done_
2-Loop over that list, executing mysql < $file
3-Rename $file to done_$file.
4-Sleep for a small period of time to allow the server stabilize, and to serve SELECTs to the clients.
5-Loop again, until no files are remaining.

The speed of the whole process depends on many variables, one of the most important is the period of time you’ll have between each file. It will depend on many variables, the most important are:
1-As it’s a live environment, you should still be serving SELECTs from that table. While you’re inserting the table is not locked, but the indexes on it are.
2-If you’re working in a replication environment, you should monitor the slaves SBM (seconds behind master). Having big SBMs means that the slave server is not fast enough to receive all the updates from the master. In that case you should try some things that will be explained later in this post.
3-Load of the server, beyond MySQL connections and processes, you don’t want to melt your live server.

 

Let’s discuss the sleep

According to the actual plan we’re going to insert 19K files in the live server, with 1K rows each one. That’s an enormous amount of information that should be not only inserted in the master, but transferred to the slaves; while they also receive INSERTs, UPDATEs and SELECTs from the site itself and the normal activity. This means that stress in the servers will increase. How much? We don’t know yet, lets try.

Pick the first file and throw it to MySQL. Measure the time the server takes to process the file. Let’s say the server processed the file in one second. In a normal situation the slaves will take the same amount of time to process such insert, so a safe value for the sleep between inserts can be one second too. But as we’re working in a live environment where many other things should be propagated to the slaves, we should leave a small window of time to let the propagation finish. I would use 1.2 seconds in that case.

So create a file named sleep.txt and write 1.2 on it. Then update the script that we talked about before to read the sleep.txt file and sleep the amount of time specified there. This will help us to change the value of sleep during the insertion. Let’s say you see how MySQL processes start to increase a lot, or you see that the files are being parsed slower after some minutes, you can edit sleep.txt and set the value to 2, 3, 10; whatever you need. The server will be very thankful for that. Our updated script will now:

1-Get a list of files not starting with done_
2-Loop over that list, executing mysql < $file
3-Rename $file to done_$file.
4-Sleep for a small period of time to allow the server stabilize, and to serve SELECTs to the clients. This value comes from sleep.txt in this way:
SLEEP=`cat /tmp/sleep.txt `; sleep $SLEEP;
5-Loop again, until no files are remaining.

Rename or remove the already parsed file, and call the insertion script.

 

Monitoring: what to check?

Once the process started, open the monitoring tool you have to control the server activity. I would like to have a command-line utility to check these values, but I hadn’t. I used Sysmon, which shows graphics updated every some minutes, so the information I got was a bit outdated. But having in mind that the process took 2 days, having a delay of some minutes is not an impediment.

As we said, open Sysmon and start to monitor the following things:

  • MySQL Processes, this is how many MySQL processes are being executed on each server. You’ll note an increase on this when you start doing the inserts, but try to keep them in an horizontal line as much as possible. You should know how the server behaves and how much margin you have under normal load. In my case I was able to put it at 1.5 the normal load, and was safe.
  • Seconds Behind Master, if you see that the SMB in the slaves start to increase beyond normal, increase the delay between operations. We’ll see later how to play with the sleep.txt file to get the best balance between speed and stability. When the sleep period is incrementing you should see how the SBM start to decrease. If it does not decrease, put a higher sleep time until you see the SMB drops. In my case I had 2 very slow servers, one of them had a huge SBM, and we decided to exclude it from the application.
  • Statements per second; shows 3 lines representing INSERTs, UPDATEs and SELECTs. You’ll see how the INSERTs line goes higher, while the UPDATEs and SELECTs keep in the same fashion as before.

While watching the graphs you should try to identify issues before they happen. In order to identify issues fast.

 

Identifying possible problems while stressing the server

Open the mysql client in the master server and in 2 or 3 of the slaves. Keep running SHOW PROCESSLIST; on all servers to identify possible issues. I’ve found during the process that the master server was receiving SELECT queries from front-end, even when it was not supposed. That made me realize that ,because of issues with the slaves, a previous developer made changes in the code to query SELECTs from the master. As soon as I saw that I made changes in the code to avoid those kind of selects and updated the live environment right away. The operation was basically:

  • Throw processlist in the master, and identify the selects.
  • Match the selects with your source code, in your development environment, making sure that effectively in the current version you’re querying the master server and not the slaves.
  • Double check that the query is that one you’ve found
  • Download the affecting file from the live environment to your local machine, and recheck the connection is being done to the master
  • Fix the code to connect to the slaves instead
  • Run PHP Lint on the affected file
  • Upload the file to only one of the live servers, and open the error log right after the upload.
  • Depending on the number of front-end webservers you have you may see or not the impact of the change. In my case I have seventeen webservers, so the change was not noticeable.
  • If you don’t get any errors in the error log, propagate the file to the remaining servers, and keep watching the logs.
  • See how processlist returns each time less SELECTs for the master.

It was amazing how the line of the SELECTs in the master decreased, and at the same exact time the line of the INSERTs increased. The INSERTs where slowed down by the SELECTs being run in the master, that was clear. So, to do this kind of operations, you should know the code as if it was 100% written by you. You should know everything everything everything about your code if you plan to do such stressful thing.

The graphic below shows how the INSERTs started to increase since the server stopped reciving SELECTs around 18h30.

INSERTs per second increase since the master does not receive SELECTs

If for any reason -usually SBM problems, and usually in concrete situations- you need to SELECT from a master, make sure you actually need to do that. How? Try first to read from the slaves. If the slave has the information, use it. If the slave does not have the row, now you can query the master. This simple protection will decrease the load in the master by a very low price, which is querying twice instead of once in the worst scenario. But most of times the slave will have the required information.

This can be a patch, but your application is intended to read only from slaves, and never never from the master.

Another issue we’ve found was related to 2 slow slave servers, one of them wasn’t able to receive the big amount of INSERTs and to serve SELECTs at the same time, and decided to leave the INSERTs in buffer. Again, well knowledge of the code will help you in this situation. My project loads the server addresses from a PHP array. The DB layer shuffles that array and tries to connect sequentially to all servers in a loop, until a connection is established. When a server is too loaded it will reject the connection with “Too many connections”, so the next server will be tried. If no servers are able to handle the connection, an exception is thrown and logged.

Resuming, we have an array of servers, and we randomize it before connecting. If we have 5 servers; each servers has 20% chances to being chosen:

$slaveServers = array(
 '10.0.0.101',
 '10.0.0.102',
 '10.0.0.103',
 '10.0.0.104',
 '10.0.0.105'
);

shuffle($slaveServers);

foreach($slaveServers as $slave){
 $conn = tryToConnect($slave);
 if($conn){ return $conn;}
}
die('Can not connect to any slave server');

In this example all servers have 20% chances to being selected. Let’s suppose that 10.0.0.101 is the slow server, we would like to send less clients to that server than the others. To do so we can repeat the other servers in the array, increasing their possibility to be selected.

$slaveServers = array(
 '10.0.0.101' , // Only once
 '10.0.0.102', '10.0.0.102', // Twice, increasing their chances to be selected
 '10.0.0.103', '10.0.0.103',
 '10.0.0.104', '10.0.0.104',
 '10.0.0.105', '10.0.0.105',
);

Our array has now 9 entries, all the slaves except the first have 22% chances to be chosen; and the first one has 11%. The graphic belows shows the impact this update had.

As you can see, the 3rd server stopped receiving SELECTs between 10h and 12h; and other servers took that load. During that moment I’ve put twice server 6 (7 and 8 too, but those are not in the image) in the configuration, so it received the double of the load than other servers.

After 13hs I’ve put again server 3 and 4 again serving SELECTs, but only once in the array, while the others were twice or three times. The impact of these changes is much more evident when watching the SBM graphics (seconds behind master, that’s it the time the slaves take to be 100% a replication)

See how between 9h30 and 11h the server reached the master server, just because it stopped receiving SELECTs. The same operation has been done in server 4 between 12h30 and 13h. Taking it out of the application and leaving it working with replication only.

 

Seconds behind master in the 3rd server goes up to one hour and a half
Seconds behind master in the 3rd server goes up to one hour and a half

Overnight -not monitored- working

As said before, this process took 2 days to be completed, and of course we weren’t there checking the graphics and the activity during the whole process. During day hours I was in the office, so while I was doing other tasks I verified the status very often. But from home I wasn’t able to check that closely. Before leaving the office I checked the graphics to try to identify possible issues, and tried to verify behavior of the servers during the pasts nights in order to find a pattern of how public behaves during these hours and how the server is affected. I realized that the best way to complete the process is alternating the sleep time every some minutes. Idea is to run with a minimal sleep time of 1 second during 200 or 300 inserts, and then give the server a pause of 30 seconds to become stable again. After the pause, the sleep returns back to 1 second.

To do that we used again the sleep command and the sleep.txt file. The script does, more or less, this:

  1. Loop while true
  2. Write “1” in sleep.txt
  3. Sleep during 300 seconds
  4. Write “30” in sleep.txt
  5. Sleep 30 seconds
  6. Repeat loop.

As the inserting script gets the value from the content of sleep.txt, we can switch the value according to our needs in real time without stopping the process; and this allows us to change dynamically the sleep. If we could have access to the real statistic information from MySQL in a command line I could have coded a script to increase or decrease the sleep time according to the load in the server. But that was outside of our possibilites.

The sleeper script, that sets different values for sleep.txt depending on the time, was left running the whole night. The effect of that script during the night hours can be easily noted in the graphics above. In the first graphic, between 20hs and the end; and in the other graphics between midnight and 4h. All the little peaks are very short sleeps.

If you, as me, ever asked “what is the sleep command useful for?”, there you have a good answer. In this situation the sleep command is the key command of the whole thing. Without sleep this could have been much more complicated.

Receiving status of the task 24 hours a day

When the process started I requested the sysadmins to write a simple small script, that will:

  1. Count how many .sql files are in the directory, and store the value as TOTAL
  2. Count how many of them start with “done_”, and store the value as DONE
  3. Send me an email saying “$DONE of $TOTAL Completed” in the body, “Import process” as subject.

Just call that script once every 5 minutes, and there you have your statuses. This is the script we had; but after some time I realized that a much better approach will be this:

Step 0, create an empty file named total.txt

  1. Count how many .sql files are in the directory, and store the value as TOTAL
  2. Count how many of them start with “done_”, and store the value as DONE
  3. echo “$DONE of $TOTAL Completed” > last.txt (Write the phrase in a file)
  4. cat last.txt total.txt | head > total.txt (Concatenate the last phrase with the previous phrases, putting the last in top. Then get the 10 first lines of that, and save it as total.txt)
  5. Mail the content of total.txt to my email account (or make it accessible through NFS, http, etc)

Now use crons to execute that script every 5 minutes. You’ll get an email in this fashion:

910 of 20000 Completed
830 of 20000 Completed
770 of 20000 Completed
710 of 20000 Completed
660 of 20000 Completed
600 of 20000 Completed
580 of 20000 Completed
520 of 20000 Completed
460 of 20000 Completed
390 of 20000 Completed

This is easily parseable, so easy to get good statistical data to see how the process reacts to different changes in the scenario we saw before through the post. You can clearly see the ratio you’re working at by comparing the previous reports.

 

At the end of the recovery

At the end, I would like to mention that the work the Sysadmins have done during this week has been amazing. I’ve learned a lot while discussing with them the planning and considering all the different options to get the task done. After such difficult experience I could say I’m happy for the result we had. Now that’s over I feel it has been a good experience.

Anyway there’s a lot of  work to do to optimize the site as I expect. The goal is to migrate the whole table to a new one with the proper indexes. Yes, we’ll migrate the 20 million rows to another table, in the live environment, but that’s a good story for another post.

Thanks for reading!