Jeremy's Blog

Where is a checked out file saved in SharePoint before it is checked back in?

Consider the following:

‘A-Document.doc’ already exists in SharePoint.
Sally checks out ‘A-Document.doc’, makes changes, saves the file, and does not check the file back in.
Scott opens the file and sees it as it was prior to Sally’s edits.

So, I know what you are thinking… “SharePoint is working as designed.”

Great, but do you know the answer to the question? Where are Sally’s edits saved?

Perhaps you said, “She’s probably using local drafts and her edits are likely saved in ‘My Documents’ in a folder called ‘SharePoint Drafts’”.

Nope. Wrong answer. (although a copy of the document before the edits was likely placed in that location when she checked the file out)

You see, Sally can log on to Scott’s computer (with her user id), open the file (remember, it’s still checked out to her) and see her edits. Clearly, the updated file is stored in SharePoint and visible only to her.

Site collection administrators and even farm administrators cannot access Sally’s edits (still checked out to her).

I got it! If we want to access Sally’s edits (and she has not checked in the file) we can go to "Manage checked out files" in the library settings. Nice try… That only works on a file that has never been checked in (More info here: http://bit.ly/kKxU8l). Sweet.

So, knowing a little something about SharePoint, I’d be willing to bet this file is in the content database.

Diving into the details:

Disclaimer: Never work directly with the content database (as I do in this post). Even though I access it read only, this can still cause unexpected locking. Furthermore, I did this testing in a lab environment with no active users and a content database that will never need to “be supported” by Microsoft.

Ok, looking at a content database, I see 3 tables of interest:

AllDocs, AllDocVersions & AllDocStreams

In my test environment, I created a new content database and then created 1 site collection in that content database (using blank template). Next, I created 1 test library (added no content to the TEST library yet)

I started running queries against the content database.

“SELECT * FROM dbo.AllDocs” returns 108 records.

Why so many records with one blank site in this database? Well, the AllItems.aspx in the library I just created counts as one record. There are many artifacts needed to support a blank site.

“SELECT * FROM dbo.AllDocVersions” returns 0 records.

“SELECT * FROM dbo.AllDocStreams” returns 0 records.

OK, on with my testing…

I log on as User A and create a document called ‘A-Document.doc’ in my test library. I save the file and close it (the file is not checked out)

“SELECT * FROM dbo.AllDocs” now returns 109 records.

Think we have found the document…? Think again. We have actually found information about the document, but not the document.

Let’s look at the other tables before we jump to conclusions.

“SELECT * FROM dbo.AllDocVersions” still returns 0 records.

“SELECT * FROM dbo.AllDocStreams” returns 1 record. (Nothing as ‘obvious’ as above… a few id related columns with what appears to be guid’s, etc…)

I examined the column types in the AllDocStreams table, noticed that there is a column named ‘Content’ and the data type is ‘image’. (in my query, this looked like a string of random characters)

Well, this is exactly the type of column where BLOB data (documents) would get stored. Armed with additional information, I did some more targeted ~~googling~~, I mean binging. I was able to confirm that this table stores the actual file contents.

The field highlighted above contains the document.

We will disregard the AllDocVersions table throughout the rest of this test as it is not relevant in this scenario.

I’ll share the results of one more query to “tie this together” before we proceed. We’ll call this “Big Ugly Query”, I’ll use it more later.

SELECT dbo.AllDocs.LeafName, dbo.AllDocStreams.Content FROM dbo.AllDocs INNER JOIN dbo.AllDocStreams ON dbo.AllDocs.ID = dbo.AllDocStreams.ID AND dbo.AllDocs.Level = dbo.AllDocStreams.Level where LeafName=’A-Document.doc’

The above query returns one row.

So, to be clear, in the above results ’LeafName’ and ‘Content’ are coming from different tables. I am just presenting the results like this for clarity. Also, it’s worth noting that the AllDocs table contains information about the document while the AllDocStreams table contains the actual document.

Now, let’s test the scenario described at the beginning of this post.

I logon to another machine as User B. As that user, I checkout the document, make some edits, save the document and do not check it back in. As we would expect, if User A opens the document they do not see User B’s edits. They see just the original document contents.

We now have two rows in the AllDocs table for this document. The 2nd row is the one associated with User B’s changes (but is not the actual document, remember?)

Ok, remember “Big Ugly Query”? Now, it returns two rows. I have highlighted the document containing User B’s edits.

Ok, I answered the question. It’s probably safe to say that we know exactly where that copy of the document is stored (before it is checked in).

I, however, require even more validation. So, I decided to learn how to use c# to read blob data from SQL and verify that I can gain access to User B’s changes (even though the document is checked out and I am not User B).

Proof of concept

How to get User B’s edits without logging in as User B.

Disclaimer: Never blindly use code you “found on the Internet”. This is especially true if the code accesses SharePoint databases. This would qualify as a CLM (career limiting maneuver). Even though I understand this code, I would not use it in my live SharePoint environments (remember, I performed this work in a lab setup that users do not access and that does not need to be supported by Microsoft)

Also, this code would barely qualify as alpha code. This could be made much more dynamic and robust. I just wanted to quickly validate the findings above.

Essentially, the code below performs a SQL query. For each row returned, it reads the ‘Content’ field (the file) into a byte array and dumps that out to a file on the local file system. Simple? Yes. Supported? No.

This compiles fine in VS2008.

using System;
using System.IO;
using System.Data;
using System.Data.SqlClient;

namespace tmpReadBlob
{
    class Program
    {
        static void Main(string[] args)
        {

            string fileName = "A-Document.doc";           
            String myQuery = "SELECT dbo.AllDocStreams.Content FROM dbo.AllDocs INNER JOIN dbo.AllDocStreams ON dbo.AllDocs.ID = dbo.AllDocStreams.ID AND dbo.AllDocs.Level = dbo.AllDocStreams.Level where LeafName='" + fileName + "'";
            SqlConnection con = new SqlConnection("Server=SQLSERVER;Integrated security=true;database=CONTENTDB");            
            SqlDataAdapter da = new SqlDataAdapter(myQuery, con);
            SqlCommandBuilder MyCB = new SqlCommandBuilder(da);
            DataSet ds = new DataSet("AllDocStreams");
            byte[] MyData = new byte[0];
            da.Fill(ds, "AllDocStreams");
            int i = 1;
            FileStream fs;
            int ArraySize = new int();
            foreach (DataRow myRow2 in ds.Tables["AllDocStreams"].Rows)
            {
                MyData = (byte[])myRow2["Content"];
                ArraySize = MyData.GetUpperBound(0);
                fs = new FileStream(@"c:\test" + i + ".doc", FileMode.OpenOrCreate, FileAccess.Write);
                fs.Write(MyData, 0, ArraySize);
                fs.Close();
                i++;
            }
            Console.WriteLine("done.");
            Console.ReadKey();
        }
    }
}

The results are shown below:

Note: I opened both files from the file system. ‘test1.doc’ contained the original file and ‘test2.doc’ contained User B’s edits.

Q: Is there a supported way to get to get User B’s edits (without logging in as User B)?

A: I *think* this would be supported. Get a full backup of the content database and attach to a test server. Use a process similar to what I have described above to extract the file(s) from that test server. Browse to your SharePoint site using Internet Explorer and upload the files you extracted. (discard the copy you made of the content database on the test server). This seems like a supported approach to me, as the only time you interact with production is when you upload files from your local file system using the web browser (supported operation).

Final thoughts:

Really? Even as a SharePoint 2007 farm administrator I can’t access data stored in SharePoint because it is “checked out”? Also, "Manage checked out files" is really this weak?

I believe that "Manage checked out files" should let owners and/or administrators access User B’s saved content that is not checked in. (Take ownership?)

Now, to be fair, I did this testing on SharePoint 2007. I decided to explore whether "Manage checked out files" was improved in SharePoint 2010.

Nope… It’s still a weak tool. It did, however, get renamed. It is now called “Manage files which have no checked in version” 😉 Wow, seems like allowing someone to access or take ownership of saved content that is not checked in would have required minimal effort.

Anyway, this is not really a huge issue and I can imagine one’s of people out there really desiring this functionality. It’s just odd when, as a systems administrator, you do not have the access or visibility into a system to clearly articulate where data is stored (in all situations) and how to access it if/when needed.

If you’ve made it through this post, you’re probably a lot like me. You like to clearly understand details of how the system works. Also, I added two new categories to my blog as of this this post: pathfinding & minutiae 😉

Thanks!

Jeremy

@jjclore

Well, it was a fun day when this error was reported to me. A bit of research would lead one to believe there are numerous potential causes. As it turns out, in this case the stack trace and ULS log information was not a good indicator of where the problem was.

As it relates to this *very generic* error, LOOK AT THE CONTEXT before you proceed! There are several blog posts where this error is related to custom SharePoint development, list configuration, etc and *IS* fixable in a supported fashion.

In the case I encountered, I was not so lucky.

Problem

Error was occurring in an out of the box site collection using no customization. (stock Teamsite collaboration sites)
Error only occurred when browsing 3 sub-webs deep in one site collection (other site collections in the same content database not impacted)
Examples:
1. http://sharepoint/sites/rootweb/sub1 works
2. http://sharepoint/sites/rootweb/sub1/sub2 fails

Approach

No luck googling, I mean BINGing
Opened Microsoft case (lucky to have patient users)
Was told that I had “Corruption” in the SharePoint content database and I needed to delete/recreate the site collection
I asked what this corruption was? What tables/rows impacted? Can you please be more specific? (It turns out that this is a really great way to get the engineer stuttering & and stammering… This “corruption” is the equivalent of an “I don’t know” answer)
I could see that pressing for a clear description of the corruption was going to be a long process
I had a DBA get a full copy of the affected database in a test environment and I opened an RCA case with Microsoft
Also, while this was going on, I rebuilt the customers site collection and restored their data (used a great 3rd party tool here… AvePoint)
An RCA case is a “Root Cause Analysis” case, and when you attempt to escalate your issue to an RCA it is likely that the support staff will try to talk you out of it. This is due to the low percentage of success rates with RCA cases (BTW: I do consider my RCA case a success, although with a slightly altered scope)

I was of the opinion that I really didn’t need Microsoft support to tell me to rebuild my site collection (I can handle that on my own, and had already read blog posts that inferred that I would need to to that). I did, however, need Microsoft to clearly articulate the problem for me. I asked Microsoft to define this “Content Database Corruption” and identify where that corruption occurred and whether or not it is fixable in a supported way. I mentioned that if we could do that, I did not need to burn hours on when/how this happened and would agree to close the case at that point.

Solution (Update Queries & Supportability)

Never, Never, Never modify your SharePoint databases… period. (unless, of course, you are directed to do so by Microsoft support with product team consent)

I got a “rock star” for my escalation engineer (serious, this guy was GOOD) and was able to get him a copy of the content database. Within a few days, the engineer informed me that the “corruption” was incorrect settings in the contentDB in the webs table in the Ancestry column. (BTW: This would have gone much quicker if escalated to a Sev A, which I did not need to do)

There is an unsupported fix (can be made supportable).

Technically, you can resolve the Ancestry column issue we experienced with an update query. The escalation engineer does not have the authority to sign off on that fix, but does have the authority to escalate to the product team. The product team has actually approved a few cases for customers to fix this issue with an update query and remain supported. This is a case by case basis and the escalation engineer cannot “sign off” because your case is identical to previously approved cases. Any request for ContentDB modification would need product team approval to remain supported. Also, running an update query against a copy of your content database in a test environment and then using stsadm backup/restore to move that site to production is not supported so don’t go there.

Recommendation: I believe, should you encounter this, you now have an opportunity to exercise your SharePoint restore processes (you do test these, right?). Delete/recreate your site collection and restore your site data. Assuming you have solid processes in place, this option should be faster than gaining the approvals you need to fix your content database.

If you find that you have this specific issue, and need to repair your database in production, open a support case with Microsoft. You need to use the information above to ask for the appropriate escalations. You need an escalation engineer to properly validate your issue and, if it applies to you, seek the product team approvals necessary to fix your issue and remain supported.

Additional Notes:

I had some dialogue with the escalation engineer about the fact that some cases had the update query approved. In Microsoft internal documentation, the engineer found that this ancestry tables issue was a known issue of some sort (I forget the acronym he used for the internal MS article). Anyway, the engineer was going to recommend that Microsoft credit back the premier hours we spent on the case as it was a product issue. This got me thinking (and I mentioned to the engineer) that the issue may very well be more pervasive than the few exceptions he found that made it to the product team. I wonder how many people had this issue and stopped at the “your content database is corrupt, rebuild your site collection” answer that the Tier1/Tier2 Microsoft support engineers would suggest (without ever clearly articulating the issue).

Final Thought:

If you run into this issue, please leave a comment. I am very interested how many others have encountered this. Also, I do know the update query but I will not post it or share it privately. If you need to talk update queries, stay off the blogs and talk to Microsoft.

Thanks!

Jeremy

Jeremy's Blog

Where are files saved in SharePoint while checked out?

Exception occurred. (Exception from HRESULT: 0×80020009 (DISP_E_EXCEPTION))

Jeremy’s Tweets

Categories

Blogroll