Annnnddd... we're back...
Posted: Mon Dec 07, 2009 4:06 pm
Christ... well that was a balls up and a half...
Sorry for so much downtime guys, what started out as a minor issue snowballed, about this time last week, and by Wednesday had become four or five seperate problems with the server / software. Not all of which were immediately apparent either, I had to do some digging and got quite a few red herrings, which meant the real problems were disguised and I was off troubleshooting non-existant issues.
It initially started out as a driver problem, crashing the machine about once a day, so I thoroughly updated them all, flashed a new BIOS onto the motherboard etc. Then it developed memory issues (or so I thought) so I replaced the RAM. It then exhibited strange behaviour and disk activity, would reboot for no reason, differing error messages each time, leading to a few WTF moments. At this point it was still serving out the site pages, though it was down more often than up. I finally traced the cause of this to be a very nasty viral infection, a 'rootkit', which had infected the OS at a low level using a 'code injection' method on some Windows system files, and was causing the instability. Luckily, upon identifying and removing it, I'm satisfied that no site data was compromised, as this thing was mainly concerned with peddling spam and desktop shite than anything else, thankfully. How it got onto the machine in the first place remains a mystery, because it isn't in contact with anything on my internal network, or directly to the web, other than serving out the forum pages. The possibility remains that it was some kind of vulneribility in the webserver which allowed it on, in light of that, the OS and webserver have been updated and patched to the latest versions available.
After that was sorted, the machine then began experiencing intermittent network disconnections, which I initially thought was a dodgy LAN card, so that, and the cable were changed. That was dandy for a whole half day or so, then it started again, this time looking like software problems on the machine. After I wasted a day chasing down that one, I discovered that it was actually dodgy firmware on the router, which was failing to route packets in from the internet to the webserver. Just as I resolved that, and thought we were all good to go again (was about Friday / Saturday night by this point), the motherboard in the server died without warning. It was running, I shut it down to change a DVD drive, and it never booted again. I subsequently discovered that a few capicitors on the motherboard had swollen and one or two had actually burst and were leaking fluid. Presumably this had been happening for some time and caused its inevitible demise.
At this point I was loosing the plot and just ripped out what salvagable components were left in the machine, stuffed them into a new motherboard and case, and installed a fresh OS, copying over the site data.
That takes us up to this morning, where the site was working briefly then suddenly stopped. Finally, after spending today troubleshooting weird PHPO permission issues, its finally working properly!
Lesson learned from this, in addition to the regular backups of the database and site itself, I'm planning to keep a redundant machine that can be booted up and kicked into action should this ever happen again in future...hopefully not!
We now do also have a very kind offer of termpory hosting from Jim on FCF, so there's that to fall back on, too, should the very worst ever happen.
Sorry for the extended messing about, it really was an utter, utter bollox!
Ciarán
Sorry for so much downtime guys, what started out as a minor issue snowballed, about this time last week, and by Wednesday had become four or five seperate problems with the server / software. Not all of which were immediately apparent either, I had to do some digging and got quite a few red herrings, which meant the real problems were disguised and I was off troubleshooting non-existant issues.
It initially started out as a driver problem, crashing the machine about once a day, so I thoroughly updated them all, flashed a new BIOS onto the motherboard etc. Then it developed memory issues (or so I thought) so I replaced the RAM. It then exhibited strange behaviour and disk activity, would reboot for no reason, differing error messages each time, leading to a few WTF moments. At this point it was still serving out the site pages, though it was down more often than up. I finally traced the cause of this to be a very nasty viral infection, a 'rootkit', which had infected the OS at a low level using a 'code injection' method on some Windows system files, and was causing the instability. Luckily, upon identifying and removing it, I'm satisfied that no site data was compromised, as this thing was mainly concerned with peddling spam and desktop shite than anything else, thankfully. How it got onto the machine in the first place remains a mystery, because it isn't in contact with anything on my internal network, or directly to the web, other than serving out the forum pages. The possibility remains that it was some kind of vulneribility in the webserver which allowed it on, in light of that, the OS and webserver have been updated and patched to the latest versions available.
After that was sorted, the machine then began experiencing intermittent network disconnections, which I initially thought was a dodgy LAN card, so that, and the cable were changed. That was dandy for a whole half day or so, then it started again, this time looking like software problems on the machine. After I wasted a day chasing down that one, I discovered that it was actually dodgy firmware on the router, which was failing to route packets in from the internet to the webserver. Just as I resolved that, and thought we were all good to go again (was about Friday / Saturday night by this point), the motherboard in the server died without warning. It was running, I shut it down to change a DVD drive, and it never booted again. I subsequently discovered that a few capicitors on the motherboard had swollen and one or two had actually burst and were leaking fluid. Presumably this had been happening for some time and caused its inevitible demise.
At this point I was loosing the plot and just ripped out what salvagable components were left in the machine, stuffed them into a new motherboard and case, and installed a fresh OS, copying over the site data.
That takes us up to this morning, where the site was working briefly then suddenly stopped. Finally, after spending today troubleshooting weird PHPO permission issues, its finally working properly!
Lesson learned from this, in addition to the regular backups of the database and site itself, I'm planning to keep a redundant machine that can be booted up and kicked into action should this ever happen again in future...hopefully not!
We now do also have a very kind offer of termpory hosting from Jim on FCF, so there's that to fall back on, too, should the very worst ever happen.
Sorry for the extended messing about, it really was an utter, utter bollox!
Ciarán