Adventures in systems administration

I would like to post slightly more frequently than this, but was interrupted by some major issues on some systems I work on.

I’m going to detail those problems now, since I couldn’t find much of anything on the Internet and finally solved the issue by cobbling together what bits of information I could find with a lot of trial and error.

Hopefully this will be useful to someone else in the future.

Also, this post is highly apropos today

How it all began: 

There was a login problem with a web application running on our Windows 2003 servers.  After attempting a few fixes to no avail, a reboot was suggested.

I figured it couldn’t hurt.

I figured wrong.

Upon reboot, a large number of services failed to start properly, and the system was mostly useless.  I tried changing a bunch of configuration options to restore services, rebooted again.

Now the SAM (Security Account Manager) database was corrupted and login was impossible.  Fun.

So, I restored a backup and got back to a state where all the services were jacked up, but at least I could login.

The event log was loaded with errors, but the most important one for our purposes was Event ID 333 (Microsoft should have just multiplied this by 2 and made it Event ID 666, since it indicates that your life is going to be hell.)  Here’s the message:

“An I/O operation initiated by the Registry failed unrecoverably. The Registry could not read in or write out or flush one of the files that contain the system’s image of the Registry.”

The system log was filled with this error.  Basically, nothing could be written to the Registry at all.

We. Were. Screwed.

After a lot of research, I settled on the idea that the pagefile (virtual memory space) was corrupted.  I learned that Windows loads the registry into the pagefile,  and if your pagefile goes bad….well very bad things happen.

How did I get out of this mess?

For a while, I felt like Bill Murray in Groundhog Day.  No matter what I changed, as soon as I rebooted the changes were forgotten, and I was right back where I started.  I tried to follow recommendations to wipe out and recreate the pagefile, but the file never went away. (sidenote: the pagefile is stored at the root of the drive as pagefile.sys.  However, it is hidden in such a way that the only method I found for viewing the existence and size of it is to go into a command prompt and enter  “dir /a” )

I finally had my eureka moment while messing around with the pagefile settings.   I couldn’t remove a currently used pagefile, because that command has to occur at reboot, and the command couldn’t be saved to the Registry.  But there was something I could do to immediately impact the pagefile.

I made it bigger.

The corrupted file was pegged at 8GB.  By setting a custom size to be larger, 10GB, it immediately allocated uncorrupted disk space to the pagefile which was retained after rebooting.  That gave me enough functionality to set up a separate drive dedicated to a new pagefile, and then I could completely remove the corrupted pagefile.

In addition, I set the following entry to ensure Windows would allocate enough space for the Registry:

HKLM/SYSTEM/CurrentControlSet/Control/Session Manager/MemoryManagement

set PagedPoolSize = 0xffffffff

I got my systems back online and running better than they were prior to this incident.  Hooray!

Advertisements

One response

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s