Downtime and servers and bears, oh my!

Posted by: pulse [x] - (Moderator)
Date: July 10, 2012 09:59AM
Slight problem on the server this week as most of you will have noticed. We had a physical failure on the DB server, causing an outage - especially bad as we also lost power to the box at the same time.

The failure was a HD in the array, which should normally not have taken down the server, but when power was lost too, the system decided it wouldn't boot with a degraded RAID array. Ho hum.

Queue hours of work from DK and I restoring the array and then restoring the boot partition and devices on the server, it's back online.

The good news is, in the interim, I've ordered another server to replace this one as the core DB server. It's much more powerful, and should be much faster. And now we have this box back up, when I've got time I'll also replace the failed disk in the array. I'll probably split the sites onto different hardware so an outage will only take out 1 site, but it gives us the ability to fail over to alternate systems in case of downtime.

So aside from the downtime (which was, what, 1.5 days? Or was it 2.5 in the end? I can't remember!) it's all good!
Posted by: BlahX3 [x] - (96.39.185.---)
Date: July 10, 2012 01:05PM
Glad it's back. I only noticed 1 day gone and was laughing because it being coincidental with the FBI pulling the DNSChanger servers off the net.
Posted by: DarkKlown [x] - (Moderator)
Date: July 10, 2012 03:49PM
Damn it.. i knew they'd figure it out pulse.. i told you..
Posted by: BlahX3 [x] - (96.39.185.---)
Date: July 10, 2012 11:51PM
Hahahaha! You dorks. I don't know who to believe. Especially about the goats. I am a sure I speak for all the others when I say the site was missed during the outage, as crazy as that may seem...
Posted by: pulse [x] - (Moderator)
Date: July 20, 2012 03:09PM
There'll be an outage for probably around 2-3 hours tomorrow while I migrate things to the new server. The new server is built and setup, so just needs a final DB/image synchronisation & change of DNS.

I don't know exactly what time I'll do it yet, depends on my availability tomorrow. It'll either be morning AEST (say 10am Melbourne time) or around 10-11pm Melbourne time.

[] for those who don't know when that is smiling smiley

We'll see though. Looking forward to it, it's better in every single way. Double the CPU, 4x the RAM, double the disk space. When that's done, I'm going to rebuild the current core DB server and maybe split the sites over physical servers which will massively increase performance across the board. Should be good!
Posted by: pulse [x] - (Moderator)
Date: July 21, 2012 02:57PM
Welcome to the new server!
Posted by: BlahX3 [x] - (96.39.185.---)
Date: July 21, 2012 03:49PM
Yes sir. Looking good so far and was only down a little while. When was the last time You heard "Good Job!"?
Posted by: pulse [x] - (Moderator)
Date: July 21, 2012 03:54PM
It doesn't happen very often. smiling smiley

Load is starting to come in again, and I'm redirecting connections from the old server to the new one so it should be back for everybody now.

I'm actually a bit shocked at the CPU usage on the new one though.. 334% utilised right now (there's 8 cores so it's still less than half), There's loads of RAM left and a lot of CPU still empty so things are pretty good.

Please let me know if you find anything weird, on any of the sites
Posted by: pulse [x] - (Moderator)
Date: July 21, 2012 03:55PM
Actually it's settled down now, might just be as the database finishes paging out the transactions from the restoration.

Whatever, anyway all good I think smiling smiley
Posted by: fossil_digger [x] - (64.251.18.---)
Date: July 21, 2012 06:16PM
the forum loads faster and switches faster, but the top list loads about the same speed
Posted by: woberto [x] - (144.136.97.---)
Date: July 22, 2012 12:21AM
DK could fix that but writing code is such a drag, man.
Posted by: pulse [x] - (Moderator)
Date: July 22, 2012 12:50AM
Yeah DK's a bitch. The top list is such a query hungry monster. It's even worse on the porn side because there's so many more records for it to count.

I did some mysql reconfig last night after putting the site live and it went from 350% CPU use to cruising around 20% now. So we're definitely not CPU bound!

There's still about 30GB of RAM free too.

Pretty happy with how things are going on the new box. I'm going to rebuild the old core as mentioned earlier and maybe look at hosting just porn on the new one, and plus on the old one, or something along those lines - which will make things even faster.
Posted by: BlahX3 [x] - (96.39.185.---)
Date: July 22, 2012 11:49AM
It seems fine from here Pulse. I don't do the other sites much so can't comment on those.
Posted by: pulse [x] - (Moderator)
Date: July 22, 2012 01:38PM
I'm running the weekly backup now, which typically brings the sites to their knees.

Right now the backup is using 100% of 1 CPU core but plenty more where that came from, and things seem pretty responsive. Could probably speed up the backup by letting it run multiple cores but I don't care how long it takes, so long as it completes without killing the sites smiling smiley
Posted by: DarkKlown [x] - (Moderator)
Date: July 22, 2012 01:40PM
The backup script isn't multi threaded winking smiley that's what the --single-transaction will do
Posted by: pulse [x] - (Moderator)
Date: July 22, 2012 01:52PM
It uses mk-parallel-dump with --threads 1. That's what I meant, I could change it to 2, 3, 4... but it's fine like this. Happy for it to take its time, anyway. As I said, so long as it finishes, don't give a toss how long it takes, especially if it doesn't bring the server to its knees while doing so smiling smiley

Funny though, the backup script is using twice the RAM as the database. Might have to look at some further mysql settings which take advantage of the extra RAM, but this box is flying through things compared to the old one.
Posted by: BlahX3 [x] - (96.39.185.---)
Date: July 25, 2012 03:26AM
Very cool. Good job guys. Being an old server guy I know how much work and how under-appreciated an upgrade such as this really is. thumbs down
Posted by: pulse [x] - (Moderator)
Date: August 08, 2012 12:51PM
Mmm! So the new DB server crashed before. Not sure what happened, but I took the outage as an opportunity to do a little software update and housekeeping.

I'll have to check it out some more.
Posted by: BlahX3 [x] - (96.39.185.---)
Date: August 09, 2012 03:36AM
I noticed some downtime. Oh well. Chalk it up to burn in time or sun spots. It's not borked now smiling smiley
Posted by: pulse [x] - (Moderator)
Date: April 29, 2013 03:10AM
And again. It's actually stopped talking to the network 3 times in the last month, requiring a reboot each time

I'm currently doing some more software updates (as I type) and I'll reboot it one more time to see if it helps. I'm not sure what the issue is though.
Posted by: BlahX3 [x] - (96.39.185.---)
Date: April 29, 2013 01:41PM
I hate those undocumented features. Hopefully one of the updates will exorcize it.
Posted by: pulse [x] - (Moderator)
Date: April 30, 2013 08:53AM
Yeah. We run Ubuntu server for it. I did two release upgrades and we're bang up to date now so fingers crossed. It's pissing me off. The server isn't going down, I can still get to it via its alternate connection via the image servers, so why this is happening I have nfi.
Posted by: BlahX3 [x] - (96.39.185.---)
Date: May 02, 2013 05:32AM
How many NICs and in what configuration?
Posted by: pulse [x] - (Moderator)
Date: May 03, 2013 04:34AM
Each server has 4 NICs but only 2 in use (+2 spares). They're Sun X4100 M2s with dual quad core CPUs and 32GB RAM.

Each has a connection to the public internet via switch and a private non-routable interconnect at the back to each other, so all server-server communication (like transferring images, or database queries) doesn't happen over public channels.

I'd like to put in more servers, maybe a couple on the west coast, but it's all too expensive to do much in the way of expansion. These sites haven't only run at a loss, they've had $0 earnings for about 2 years. Even the other site which used to at least make half the hosting costs makes nothing at all these days.

Maybe I should look at refreshing the advertising over there.. ah well. One day smiling smiley
Posted by: BlahX3 [x] - (96.39.185.---)
Date: May 03, 2013 05:46AM
So the problem is with the public NIC.
Posted by: pulse [x] - (Moderator)
Date: May 03, 2013 06:30AM
Yeah, which is slightly frustrating. Though that said if the interconnect nic wasn't working then new images wouldn't display properly either.. but at least the site would be up.

It's pretty annoying. Hopefully all the upgrades have fixed it, they broke a bunch of other shit but that's all fixed now ...
Posted by: BlahX3 [x] - (96.39.185.---)
Date: May 03, 2013 02:41PM
That's always the way it goes. People would ask me why I didn't upgrade the servers to whatever yet and I'd say because I don't feel like wasting a weekend fixing all the shit the upgrades will break.
Posted by: BlahX3 [x] - (96.39.185.---)
Date: May 04, 2013 04:28PM
Looks like some recent pics have vanished from the gallery. Another malfunction?
Posted by: pulse [x] - (Moderator)
Date: May 10, 2013 09:36AM
Not sure, I haven't noticed anything. I'll have a look in a bit, but it's not impossible.

The network issue happened for ~5 hours again on the 8th, and I noticed it just spitting errors today. I've changed the NIC on the server now - hopefully that'll fix it. The server we have has 2 x Intel NICs and 2 x nVidia controlled ones.. so we've changed switch ports, and changed from an Intel to an nVidia NIC..

So fingers crossed NOW it's fixed, whatever it was, whatever the incompatibility. If it's not then I'm out of ideas.
Posted by: quasi [x] - (184.240.107.---)
Date: May 10, 2013 12:47PM
