Jump to content

This forum server's data has been ROLLED BACK 18.5 hours :(


Recommended Posts

What a nightmare!  :furious::crying:  Several hours ago I noticed that the forum was down so I submitted an urgent support ticket to Invision Power Services.  They are the company that makes this forum software as well as host it for us.  They replied promptly (within minutes) that they were looking into it.  Unfortunately after about four hours they replied that the server had to be rolled back 18.5 hours to their latest backup snapshot to restore the forum.  In other words, and highly unfortunately, everything that was posted in the last 18.5 hours (as of an hour ago) had been lost.

The perplexing thing is that I have made no configuration changes to the server nor had any FTP activity with the server at all leading up to that.  I have no idea what caused this and I will be questioning them as to exactly what happened and how to prevent it from happening again in the future.  The CEO of Invision Power Services, Lindy Throgmartin, made an apology to me for this mishap and likewise I would like to apologize to all our members for this highly unfortunate incident.

UPDATE:

Just an update from an Invision staff prior to my closing the support ticket:

Yes, we're reasonably certain the root cause was identified and corrected. We aren't able to provide detailed specifics, but it was an issue with the SQL engine on the server.

Link to comment
Share on other sites

I had posted this week's winning lottery numbers for you all, and an XXXX rated Supermodels on EUCs video for @Hunka Hunka Burning Love...

Such a shame I didn't keep copies of either of them.  :huh: 

Link to comment
Share on other sites

Sounds like a pretty serious server farm harddrive failure, if they have to revert to back-ups... either they aren't using hot-swappable RAID-stacks or something went boom big time. Probably not the only forum hit by this?

Link to comment
Share on other sites

 

28 minutes ago, esaj said:

Sounds like a pretty serious server farm harddrive failure, if they have to revert to back-ups... either they aren't using hot-swappable RAID-stacks or something went boom big time. Probably not the only forum hit by this?


Actually they said:
"We're very sorry for the issues. I'm afraid part of our infrastructure at Amazon Web Services has experienced technical difficulties, which we've been working diligently to resolve."

The interesting thing is that I remembered that this forum was hosted at Dreamhost before and now after doing an nslookup followed by a whois (by ip) I found that the server is now indeed on Amazon servers.  Perhaps they stealth migrated us somewhere along the line or perhaps even just several hours ago which caused this whole thing to happen.  Do you remember when the forum was on dreamhost and we were supposedly sharing the same server with another forum?  Do you (or anyone) remember the name of that other forum?

Here's some more info from their last reply to me about an hour ago (after this forum was restored):

Hello, we are still working on this issue actually, once we have more information we would be happy to share this, rest assured that we do anything and everything we can however to protect our communities and ensure this type of thing doesn't happen, however nothing is 100% and why we do take backups as well. I'm very sorry for this issue, I can say this is one of only two times in about 10 years we have ever had to restore a few customers from a backup for an issue such as this. 

The keyword is "a few customers".  So although this is their "second time" I'm thinking maybe you're right that it hit more than one forum.

Anyways, at this point in time I'm actually afraid of whatever causing it to happen to happen again.  So I asked them if they can increase the frequency of their backup snapshots until they figure out the cause.  They have not replied to this yet.

Link to comment
Share on other sites

3 hours ago, John Eucist said:


Actually they said:
"We're very sorry for the issues. I'm afraid part of our infrastructure at Amazon Web Services has experienced technical difficulties, which we've been working diligently to resolve."

So glad it is back up and running! Thanks for all your help! Did you have a few drinks after getting the forum back on line? :cheers: 

Amazon Web Services (AWS) is a big deal! If you’ve tuned into this year’s World Series between the Cubs and Indians you may have noticed some branding for AWS, Amazon’s cloud computing arm. That’s because AWS powers the back-end of Statcast, a high-tech player tracking system used by MLB that measures every single play during a baseball game with radar equipment and HD optical cameras at each stadium. It produces new stats like pitching velocity, launch angles of home runs, acceleration of base runners and more!

Sounds like the EUC Forum's platform is operated by a very reputable company and we have an awesome moderator to keep us updated and straighten out the mess :thumbup:

 

Link to comment
Share on other sites

5 hours ago, John Eucist said:

What a nightmare!  :furious::crying:  Several hours ago I noticed that the forum was down so I submitted an urgent support ticket to Invision Power Services.  They are the company that makes this forum software as well as host it for us.  They replied promptly (within minutes) that they were looking into it.  Unfortunately after about four hours they replied that the server had to be rolled back 18.5 hours to their latest backup snapshot to restore the forum.  In other words, and highly unfortunately, everything that was posted in the last 18.5 hours (as of an hour ago) had been lost.

The perplexing thing is that I have made no configuration changes to the server nor had any FTP activity with the server at all leading up to that.  I have no idea what caused this and I will be questioning them as to exactly what happened and how to prevent it from happening again in the future.  The CEO of Invision Power Services, Lindy Throgmartin, made an apology to me for this mishap and likewise I would like to apologize to all our members for this highly unfortunate incident.

John,

thank you for all you have done for us and we appreciate what explanation you do have.

It is funny how much we have all come to depend upon this wonderful gathering of like-minded folks and how isolated and sad. Most of us felt when the forum was down.  Happy that we are back up running again... And it is funny that you mention whether or not we will care about the loss of 18.5 hours.... Because after the election the United States today I'm sure we will want to roll the clocks back a lot further than that, no matter who wins :w00t2:

Link to comment
Share on other sites

14 hours ago, John Eucist said:

The interesting thing is that I remembered that this forum was hosted at Dreamhost before and now after doing an nslookup followed by a whois (by ip) I found that the server is now indeed on Amazon servers.  Perhaps they stealth migrated us somewhere along the line or perhaps even just several hours ago which caused this whole thing to happen.  Do you remember when the forum was on dreamhost and we were supposedly sharing the same server with another forum?  Do you (or anyone) remember the name of that other forum?
 

I think almost everyone uses AWS anymore, and they are generally very reliable, plus fast (since they own a lot of the pipeline in the US).  But the big ISPs are being hit with unprecedented levels of DDoS attacks, and without knowing the exact cause, it would be hard to say whose fault the outage is.   AWS should be very redundant, but I guess no one knows... I'm surprised that they couldn't find the missing 18.5 hours of EUC Forum transactions somewhere in their system.

Link to comment
Share on other sites

On 8.11.2016 at 8:19 AM, John Eucist said:

Actually they said:
"We're very sorry for the issues. I'm afraid part of our infrastructure at Amazon Web Services has experienced technical difficulties, which we've been working diligently to resolve."

Usually backups or snapshots are made in at least two copies on at least two physically separate media, if you are a pro. Maybe they had a malicious intruder with a privilege escalation who messed up a few filesystems before they pulled the power plug. I would not want to disclose such detail to the public since it might hurt your reputation.

Anyway, whatever happened, thanks to all who helped to get it back online :)!

Now we're left with this other, U.S. home made world wide problem ...

Link to comment
Share on other sites

Just an update from an Invision staff prior to my closing the support ticket:

Yes, we're reasonably certain the root cause was identified and corrected. We aren't able to provide detailed specifics, but it was an issue with the SQL engine on the server.

Link to comment
Share on other sites

On 10.11.2016 at 5:57 PM, John Eucist said:

Just an update from an Invision staff prior to my closing the support ticket:

Yes, we're reasonably certain the root cause was identified and corrected. We aren't able to provide detailed specifics, but it was an issue with the SQL engine on the server.

Broken table data can be a real pain in the ass to fix, and usually the "normal" backup occurs at filesystem-level, which doesn't help (large table-files, backups of different files are from slightly different points in time and don't "act nice" with each other). Surprised that they don't run a real-time readonly-replica for backup-purposes, or if they do, the data got mangled and the errors got replicated on the backup too... hope I won't have to deal with such in the future. Or have switched jobs before it happens :P

 

On 8.11.2016 at 11:55 PM, Chris Westland said:

I think almost everyone uses AWS anymore, and they are generally very reliable, plus fast (since they own a lot of the pipeline in the US).  But the big ISPs are being hit with unprecedented levels of DDoS attacks, and without knowing the exact cause, it would be hard to say whose fault the outage is.   AWS should be very redundant, but I guess no one knows... I'm surprised that they couldn't find the missing 18.5 hours of EUC Forum transactions somewhere in their system.

Yes, very redundant...

Jun 6, 2016 - The torrential rains and winds that hit New South Wales, also took out Amazon Web Services to the cost of many cloud-based businesses.

Sep 23, 2015 - This morning's #AWS outage reminds us that we all have a single point of failure

Sep 21, 2015 - Amazon Web Service's cloud experienced a 6-hour service disruption early Sunday morning.

Aug 10, 2015 - Amazon Web Services, the world's largest public cloud provider, suffered a rare outage in Monday's early-morning hour

Aug 26, 2013 - Amazon blamed the outage on glitches with a single networking device—what it called a “'grey' partial failure” that resulted in data loss

On December 24, 2012, AWS suffered another outage, causing websites such as Netflix instant video to be unavailable for customers

Jun 29, 2012 - An outage of Amazon's Elastic Compute Cloud in North Virginia has taken down Netflix, Pinterest, Instagram, and other services.

Apr 25, 2011 - Last Thursday's Amazon EC2 outage was the worst in cloud computing's history.

;)

Granted, it still is probably one of the most (if not the most) reliable cloud-service.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...