• Sunday
  • March 22
  • 2009

Selecting GSA targets – Part Three: Quantification, Revision and Finalization

There is a great deal of work proceeding behind the scenes and several key project elements are converging as we move toward finalizing this public project. Among other things, we have been working through modest but practical solutions for better placement and targeting of our existing 300,000+ repository documents, while solidifying all the additional Google Search Appliance (GSA) targets in our enterprise search sights, described in Part Two. At the same time, we are in our own March (and April and May) Madness as we mount a rapid-fire round of trainings for each of our eight remote offices (spread out over 50,000 square miles of Northern California) on their new role in making Google Enterprise search work for them, which is to say for all of us. And as mentioned in earlier posts, we are working every bit as earnestly on our latest in-house build of the Pika CMS.

The infusion of new, future content into the simplified structural taxonomy we created is a separate challenge we will be posting about later. Dealing with our existing files is more immediate, more concrete. Groking those files has been one of the more interesting, at times hilarious parts of this project.

For those legal services programs interested in how the existing files from our eight local offices break out, here are the percentages for the seven most common “document” types:

  • 67% – WordPerfect (WPD)
  • 18% – Word (DOC)
  • 9% – Portable Document Format (PDF)
  • 3% – Excel (XLS)
  • 1% – Text (TXT)
  • 0.9% – Rich Text Format (RTF)
  • 0.6% – PowerPoint (PPT)

(Discretion prohibits us from detailing the other file flotsam discovered on local office servers. That said, allow us to observe that some within our organization have extraordinarily good taste in photos taken by National Geographic, and not such good taste in music.)

We are totally on track for targeting most of our planned GSA targets: The existing office archive files listed above have long been targeted (although we still have a lot of work left to fit them into our structural taxonomy); over the last several months we have worked very hard to refresh and update (and remove, as warranted) the targeted content at LSNC’s various public websites; and we are very pleased with the quality of the GSA results we are getting out of Google Sites.

This is all good news. In addition, we are putting in place a few more content channels: Targeting the content in our organization’s seven private Google discussion groups, and a program-wide canvass for select hard-copy training resource materials for digital conversion and addition to the shared repository. It’s all good.

We have had one major disappointment: We discovered that there are significant, unanticipated technical challenges unique to the Pika CMS that thus far have prevented effective use of the GSA to target Pika content. The problem is not the GSA itself or configuring the GSA to target Pika. The GSA by design performs wholly benign, non-destructive crawls as it indexes targeted records. We did a huge amount of target testing and SERP evaluation, and we were very pleased — actually, thrilled is a better word — with the results we were getting from Pika. The unanticipated problem is that the current version of Pika is not well optimized for use as an enterprise search target. There are code anomalies in Pika that, among other things, cause it to auto-generate new case intakes and case records when it is crawled by the GSA.

After an assessment by Pika Software, it is now apparent it will take something in the neighborhood of 200 hours of work to make Pika more receptive, shall we say, to an external crawl by the GSA. (Ouch!) So for now, we have to put that part of the project to the side. File under: Lessons Learned.

Other posts of possible interest...

2 responses

  1. Thanks for the information about Pika integration. As our organization works with Pika, we have some interesting plans for targeting the data via sharepoint and you have highlighted the need to be cautious in planning what we do.

  2. Brian Lawlor says:

    Aaron Worley at Pika Software, who worked directly with us on the problems with Pika integration, understands the technical issues and has definite ideas of how to resolve them, so you should be sure to check in with them for an assessment on your enterprise search solution. This project has completely convinced us of the need for enterprise search, and integration of Pika and other case management systems commonly used by legal services programs need to be a part of that search solution.

    As for Sharepoint… well, we are going to weigh in on that fairly soon and it may not be what you expect to hear. Our use of Sharepoint on this project was largely driven by our perceptions for the need for an enterprise search metadata model, and our views on that have shifted. And at the time we launched the project, Google Apps had not yet released Google Sites, and that has shifted our views about shared content management. More on all of this in coming weeks. Stay tuned.