Posts tagged: gsa

  • Thursday
  • July 8
  • 2010

Our final 15 minutes of Google fame

LSNC logo

It was a pretty nice surprise for LSNC several months back to be asked by Google to present Advancing Knowledge Sharing with Google: The LSNC Story, with its focus on what we accomplished with The Findability Project.

Prior to but independent of that webinar, Google interviewed LSNC about The Findability Project and LSNC’s larger experience of integrating a Google Search Appliance with Google Apps and the Pika case management system. At its Google Enterprise customer solutions site, Google currently features and has posted its LSNC case study. Sure, it’s a marketing stroke but, still, it’s great to be included.

  • Friday
  • April 2
  • 2010

Pika and the Google Search Appliance make nice

For those who have followed The Findability Project, I am pleased to report we have surmounted the basic technical problems of targeting our Pika CMS with the Google Search Appliance.

The back story is one I have purposefully repeated whenever giving a presentation about the project, namely, that our Pika Plan A did not work. We encountered code anomalies in Pika that, among other things, cause it to auto-generate new case intakes and case records when it is crawled by the GSA. As a result, we were unable to use the GSA to crawl the Pika client case content dynamically generated as web pages. Plan A would have been the easiest, no-brainer way to go but we were not able to do so. So Plan B was to have the GSA target the Pika MySQL database directly. Status report: Mission accomplished.

There are GSA capacity issues for us, since our particular GSA’s one million “record” capacity means one million web pages or database records, inclusive, and these database records are not the same thing as the count for client case records. At any given time, we may have some 130,000 to nearly 200,000 client cases in our Pika system (and even more in archival data storage), but from a database perspective, these add up to multi-millions of “records,” e.g., various types of time records, case notes, contacts, and so on. Part of the challenge for us was to sort out which pieces of those millions of database records were the ones most needed and useful to our users.

The solution? Using a well-tailored query, we have the GSA do a selective crawl of the Pika MySQL database to return the most commonly sought and used Pika content: Case numbers, client names, office designations and case notes… tons of case notes. The basic technical explanation is the GSA performs a database query, returns it as an XML feed, indexes that feed, against which the user’s search terms are queried and ultimately returned as viewable HTML

What does the the search result look like? A Google search result. The clickable link displays the case number, client name, LSNC office and primary advocate name, e.g., “90-10-123456 ~ John Client ~ Sacramento ~ Jane Advocate.” Below that it displays in-context text with the search terms highlighted in bold, essentially like a regular Google search result. Clicking the link dynamically displays the actual Pika case note shown in context. Assuming there are multiple possible matches for a particular Pika case record, there is a link to display all the “omitted results,” akin to how regular Google searches work, so the users can see all possible, not just probable matches. Clicking through the GSA search result link also gives the user direct clickable access to the particular client case record since clicking through takes the user to the actual Pika client case record.

That’s the name of that tune.

  • Sunday
  • January 17
  • 2010

Coda re 2010 TIG Knowledge Management session

Last Wednesday at the 2010 LSC TIG Conference, Chicago-Kent’s Ron Staudt and I did a joint session, Knowledge Management – What It Is, Why It Matters, and (Google) Options For Making What You Know Findable. Ron, of course, was cogent, concise and charismatic and stayed within his presentation window and hit all his marks. Me? Regrettably, after all these years, I still haven’t figured out how to squeeze 10 pounds of cement into a 5 pound bag, and didn’t even get to several key points I had hoped to make about enterprise search and The Findability Project. To make matters worse on my end, at the beginning of my segment the Flash demo of how LSNC’s enterprise search front end works faltered badly since it displayed so poorly when projected. (More than one person mentioned to me afterwards that they were simply not able to see accurately what I was describing at the moment. (Uh, it seemed like a good idea at the time.)

With those apologies out of the way, allow me to annotate a few points now to make up for at least a few things that I did not cover during the presentation:

The LSNC “portal,” “intranet” and “document repository”

I feel I successfully got across the point that there is a broader sense of “search” at play that is important to grok, as an organization works toward enterprise or so-called “universal” search. However, because I ran out my clock and didn’t have time to talk at length, I didn’t quite get to describing the varied content targets that LSNC has identified as valuable, useful and usable and therefore all that which we wanted to make readily, easily findable. In going over all that, in passing I mentioned that The Findability Project originally included a SharePoint component which is now being abandoned, in favor of our relying on components of the Google Apps platform, specifically, Google Sites.

The LSNC Shared Portal demo’d but not successfully displayed during the presentation is itself not part of Google Sites. The portal is itself a point-of-entry front end built on a WordPress PHP installation, and designed to complement our Pika 4.0 installation, which is also a PHP application. The portal is a point-of-entry but not a strictly controlled one, in the sense that users are not required to go through it to access either Pika or their Google Apps. But the portal is a custom user-interface that affords our users quick, efficient access to all the core web-based applications they need to do their work, plus a program calendar and a slew of LSNC-specific newsfeeds. And then there is the portal’s killer app: The enterprise search box, the findability trigger that searches all of the valued, useful, usable shared content. The enterprise search box initially gives you what I described in the session as “horizontal” search; at the (poorly displayed) search result page our users then have access to “vertical” filtering options.

And, as illustrated with the search for my personnel information and photo, our users can use the enterprise search box to do special data queries to get specially tailored search results. For example, when I did the demo search for “staff brian,” here’s what was basically happening: Triggered by the keyword “staff,” the Google Search Appliance (GSA) activates a OneBox module that did a query of our Pika CMS database, returned that query result as XML, which in turn was processed through XSLT and output for display as HTML.

The other private content areas I described are all now, or soon will be, part of our domain’s Google Sites. All of our organization’s “official” intranet content is now positioned at a Google Sites location, as is our new “shared document repository.” The GSA works very well with the Google Apps platform, and natively integrates with Google Analytics, among other things. Great stuff.

SharePoint issues

My observation at the beginning of my segment that LSNC was the first legal services field program to adopt the Google Apps platform and the first to abandon SharePoint was not intended to be provocative. It was intended to be transparent about what we are doing and why. Unfortunately, I never got around to explaining our organization’s views on SharePoint.

The short version is this: Given what we want and need to do with our shared work and collaboration space, we simply no longer see any advantages to using SharePoint. Zero. Zip. Nada. At launch of The Findability Project we viewed SharePoint as a key component for hosting and building and sharing content. And SharePoint is a great option for that. It is a very impressive product. But about six months into The Findability Project, Google unleashed Google Sites as part of the Google Apps platform, and for us it was a game changer. Google Apps is free (for non-profits, for the foreseeable future), we don’t have to host, maintain, secure, update or fix it, and Google continues to aggressively improve its features, along with everything else within Google Apps. And we are able to do pretty much everything we need to be able to do with it. True, SharePoint has an enormous mindshare within corporate America. And organizations do need to evaluate whether SharePoint has features or functionality that are unique or indispensible to it. For us, it has none.

Oh, and did I mention that the GSA works natively with Google Apps?

While not the reasons why we have bailed out on SharePoint, there are these views questioning what role SharePoint has in your future: Peter Campbell’s article, Why SharePoint Scares Me; and more contrariness from Dion Hichcliffe, Sharepoint and Enterprise 2.0: The good, the bad, and the ugly.

More self-criticism: What we don’t like about our user interface

Perhaps I spent too much time trying to drive home the importance of usability as a concept and how it relates to findability. I am fascinated by usability concepts and, after now years of practical experience, sobered by the reality of how challenging it is to do well. We are very pleased with what we have accomplished with our portal (and related search result page and Pika CMS) designs shown in the slides. But I also had planned on taking a few minutes to highlight what are remaining problems with our design, and “usability” thoughts about improving or fixing them. For example, we already plan on altering how we use tags as part of the portal page, and will soon be modifying the vertical filtering options on the enterprise search result page, to expand those options and make them more intuitive. I think we have done good. I think we can do better. And we will.

Knowledge management as poetry

I wasn’t entirely irresponsible about keeping within my allotted time. One thing I considered doing but dropped from my presentation to save time, was my giving a dramatic reading of the most famous poem ever about knowledge management. Yes, there is such a thing:

“The Unknown” by Donald Rumsfeld

As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don’t know
We don’t know.

[U.S. Department of Defense news briefing, February 12, 2002]

As far as I can tell, this was someone who never actually grasped basic concepts of findability.

But that’s me. What do I know.

  • Thursday
  • September 17
  • 2009

Revised: What the LSNC Shared Portal now looks like

We have now posted a further revised Jing video with audio providing a brief, 4-minute overview of the LSNC Shared Portal. This is the actual intro overview video we circulated internally to provide all staff with a basic visual and feature orientation, before our more extended, in-house live demos to be conducted next week.

It’s not so easy to do a public video demo of our new Pika 4.0 case management system design changes, because of confidentiality issues, but we will post select screenshots reasonably soon so you can get a visual idea of changes we have made to that application.

  • Monday
  • August 3
  • 2009

TIG final evaluation report for The Findability Project

For those interested, here is the recently approved TIG final evaluation report for The Findability Project.

This TIG project was funded for an 18-month period from January 2008 through June 2009. Much of the report will ring familiar to those who have followed the project here, since much of what has already been posted mirrors what would be required in a TIG evaluation report. Essentially, this public project site enabled us to give others in the legal services community an ongoing, if lagging, report of progress on the project, while at the same time considerably easing the process of writing up the evaluation report at the end of the project since we had already written most it as we went along.

We’re winding things down here, but we will continue to post here at least through the next TIG conference in early 2010. Among other things, we will be detailing how in finalized form we are integrating our project’s GSA test frontend functionality into a more expansive shared organization portal, part of our current deployment of a heavily customized version of Pika 4.0. We have finished the LSNC redesign of Pika 4.0 as well as a new LSNC shared portal “front door” (built on WordPress), both of which are scheduled to be in place and in use by LSNC staff the day after the Labor Day break.

Stay tuned, people!

  • Sunday
  • July 5
  • 2009

Getting Google-y with the enterprise

As a coda to the post yesterday about findability, the pervasiveness of the Google search paradigm, and what that means for the non-profit enterprise, I want to take a moment to put focus on a question during the session about an online post screenshot highlighted in one of the slides: “Why Enterprise Search Will Never Be Google-y.” I fear I did a poor job of answering the question about how it is that the author viewed Google enterprise search as different from other types of enterprise search. Mea culpa.

A couple of follow-up observations, to better respond:

As mentioned during the presentation, one point of the slide was to draw attention to The Noisy Channel, a very search geeky, characteristically Google-contrary, but always interesting, worthwhile blog helmed by Daniel Tunkelang, chief scientist at Endeca, a high-end direct competitor with Google in the enterprise market. Agree or not, there is a lot to learn about search from The Noisy Channel. It is one of my must-reads.

The title of Daniel Tunkelang’s highlighted post derives directly from Chris Sherman’s pithy, two-page online article with the same name, Why Enterprise Search Will Never Be Google-y (from the Enterprise Search Sourcebook 2008.) The gist of Daniel’s post and Chris’ article that prompted it is this: The “simple search” or “known item” search we all commonly associate with Google (the noun and the verb) short changes what enterprise search can or should be for those who use it. The tension between these two enterprise search models is why I highlighted these two paragraphs from Daniel’s post:

The upshot? There is no question that Google is raising the bar for simple search in the enterprise. I wouldn’t recommend that anyone try to compete with the GSA on its turf.

But information needs in the enterprise go far beyond known-item search, What enterprises want when they ask for “enterprise search” is not just a search box, but an interactive tool that helps them (or their customers) work through the process of articulating and fulfilling their information needs, for tasks as diverse as customer segmentation, knowledge management, and e-discovery.

The irony here is that, contrary to the entertainingly provocative “never will be Google-y” in the title, for some market segments enterprise search is already Google-y. In some respects, Daniel’s post and Chris’ article both actually make the case for, not against, the Google enterprise model, which is to say that for some segments of the enterprise market Google and its search appliance may very well be the way to go. Our experience is that it is a particularly viable way for a non-profit legal services program.

Why do I say that? Even assuming arguendo that Google Search Appliance (GSA) improvements “should be seen in the context of state of the art,” for many organizations this state-of-the-art is a rarified and unobtainable reality. One has to wonder, after costing out a solution with one of the three major market leaders in enterprise search (Autonomy, Endeca and FAST), whether a Google box doesn’t look pretty damn good and pretty damn doable, given what it does. As Daniel himself observes, “I wouldn’t recommend that anyone try to compete with the GSA on its turf.” Is that turf a real solution for some market segments? While Chris invokes a clever if overstated “oil and water” metaphor about the differences between web and enterprise search, he follows it by suggesting the exact opposite: Some enterprise search segments are well served by the Google paradigm, notably including “intranet search” –

Many organizations are encouraging employees to communicate internally via blogs, or to participate in community-based knowledge repositories such as internal wikis. This is one area where there is a genuine parallel between enterprise information systems and web content, and Google excels at understanding and surfacing this type of content.

Tell me about it.

  • Saturday
  • July 4
  • 2009

Findability and the Google search paradigm

Following up on an NTAP presentation I gave last Thursday, Findability and the Google Search Paradigm: Integrating Search as a Organizational Solution, here is a publicly viewable set of the presentation slides, which are in a Google Docs presentation format and include embedded links to a lot of the material I discussed during the presentation. You can find the New York Times article I mentioned about Twitter as an example of “crowd-sourcing” at David Pogue’s post, The Twitter Experiment.

I painted with a broad brush during the presentation. The goal of the presentation was to offer the legal services community a broader view, and an emerging view, of what it means to search, to search on the enterprise, and to suggest what it means to Google search on the enterprise. These are just the slides. While I gave a brief live demonstration of how our GSA installation actually functions when generating and filtering search results, you’ll have to come to the upcoming 2010 LSC Technology Initiative Grants Conference to get a more expansive demonstration and technical explanation of our implementation, including a solution (hopefully) to the problems we’ve had with Pika CMS integration into our enterprise search solution.

As is my bad habit, I went long and so the discussion at slide 72 about the real and imagined obstacles to implementing enterprise search in a non-profit environment got short shrift, and for that I apologize. I promise to do a better job with those issues at the TIG conference. In our experience getting our “stuff” organized, and hammering out practices and protocols, was a much larger time commitment on this project than the strictly technical stuff. And then there are the paralysis-against-progress problems that large organizations may experience since, in my view, they mistakenly think they have to have everything about taxonomy, vocabularies, folksonomies and metadata in place. For example, I have argued here, with our somewhat novel Google Search Appliance implementation in a non-profit environment, that we could do fine for now without relying significantly on metadata to make our project work. Others beg to differ.

In any event, I hope the presentation last Thursday was helpful. Let’s all talk again at TIG in January 2010.

  • Friday
  • June 5
  • 2009

A quick and dirty OneBox using PHP

Arguably the most common, if not first, Google Search Appliance (GSA) OneBox module that organizations implement is a module that returns personnel information or listings of some kind. It is one of the most obviously useful OneBox results one can come up with. As we ramped up to implement our version of it, we were surprised to discover that most publicly available examples or models for creation of OneBox modules rely on technologies (ASP and Java being among the most prevalent) that we do not use. We could not find an example of such a OneBox using simple PHP/MySQL.

Our goal was to build an easily replicable OneBox module that does work with PHP, which we do use. A lot. PHP is at the heart of the Pika CMS as well as our public websites built on WordPress.

Here’s an example of what our OneBox special query result looks like, with the first keyword “staff” being the trigger and the second keyword “ukiah”, the name of one of our local office locations. The query returns a OneBox result listing all the active staff in that office:

Clicking on the link for each person’s name triggers a new display with a photo of the person and his or her vitals. This module also works using the same initial trigger with a staff person’s particular name.

Most simply put, this OneBox module works by querying the MySQL database “users” table in the Pika CMS, the application used by all our active employees, across all positions, to record their time and work activity. More specifically, the module breaks down into five basic steps:

  • the OneBox module sends a query to a targeted PHP file
  • the PHP code runs a query against the targeted MySQL database
  • the PHP code then outputs the returned data as XML
  • the GSA reads that XML output
  • the GSA then formats that output for display as a search result

Within the GSA console, one creates a module by selecting OneBox Modules > Create Module Definition, selecting the Trigger (in our case, “staff’), and then identifying the Provider, which in this example is the PHP file we created and attached to the module as an External Provider, by inserting the URL to the PHP file.

You can download as a ZIP file the PHP code and related GSA stylesheet template used in this example.

The PHP file is annotated, but has select information edited or removed (host, passwords, etc.), for obvious reasons. Looking at the PHP code, in sequence the PHP submits the query, connects to the database, joins data from a combination of data tables in our case management system, then takes the results from the MySQL query and outputs it as XML, i.e., the “OneBoxResults” in the code.

Once the GSA outputs the query results as XML, it can then publish the results to a OneBox Stylesheet Template, which one can edit by clicking on the Edit XSL link at the bottom of the console page for the particular module.

  • Sunday
  • May 17
  • 2009

How we organized our targeted Google Sites content

Since we’re on the subject of revisions and updates today, here’s another about how we finalized our Google Sites content.

As noted earlier, The Findability Project planned integration of select Google Sites content as a GSA target. How we created LSNC’s “official” intranet site with Google Sites was covered (briefly) as part of a recent NTAP presentation.

Since that presentation, we have pretty much completed the migration of all our intranet content over to what LSNC calls its “Shared Private Network” (SPN). For those curious, here is a screenshot of the current site’s home page; and here’s a screenshot of the top levels of the sitemap. As you can see, we have worked to keep the hierarchy simple which means manageable, especially given the number of different folks who have responsibility to maintain its content. Also, we have created a large number of Google Sites file cabinet “upload” pages to make management of those file easier, for the same reasons. So far, so good.

What is great about all this is that the GSA easily targets this selected Google Site, and returns great results from the site. Users can have it both ways, by searching from the GSA frontend but with equal ease from the native search function within the Google Site itself. It’s all good.

  • Sunday
  • April 26
  • 2009

Google Apps, SharePoint and this project

At the outset, let it be acknowledged that SharePoint is a great product. For good reason, many in the legal services community have either adopted or are at least seriously looking at SharePoint as a core component of their network infrastructure. A notable example of this trend from earlier this year is Tom Winter’s video collection of SharePoint Resources for Legal Aid. Impressive.

That said, observant followers of The Findability Project may have noticed our chronic inattention, and now outright de-emphasis of SharePoint. There’s a reason. Actually, several reasons.

When we submitted our TIG proposal in 2007, we proposed SharePoint as a key component of the technical specifications for this project. Once we received the grant in 2008, that is exactly how we proceeded as we put together our so-called blunt-instrument build. At the time, we put in place an open-source Google SharePoint connector that plays nicely with the Google Search Appliance (GSA). (We have documented how we configured the SharePoint side of things; we will eventually document how the Google connector configurations work.)

From the get-go we recognized the basic promise of SharePoint, i.e., it offers an array of enterprise platform options for creating and maintaining organizational portals and managing content. All stuff we wanted as we built out our project, moved toward positioning our content in very purposeful ways, and worked out optimal ways for our organization to communicate, share and find content. True, we were less sanguine about SharePoint’s enterprise search features. Not because it is not effective. It is. But we had greater confidence in the algorithms and effectiveness of Google enterprise search, which natively works with most everything Google, and SharePoint does not. But we will put that tribal view aside, for the moment. We give SharePoint its due: Impressive.

That was late 2007, early 2008. This is now, a little more than a year later. What happened in the interim? Google Apps happened … way more, way better Google Apps including an increasingly impressive array of collaboration features … including domain Google Sites … integration of Google Analytics into Google Apps … and then at the end of 2008 some serious happy with the version 5.2 update for the Google Search Appliance, which now integrates with Google Apps, including Google Sites.

Way impressive.

Even though we had SharePoint in place and could have built out our intranet using it, we all but immediately and instinctively moved on to Google Sites once it became available to us in 2008 and, in short order, built things out that way. (See Google Apps Redux for more about how LSNC currently uses Google Apps, including Google Sites.) It is not that SharePoint is not useful to accomplish many of the same things. It is. But at what cost and at what loss in usability?

For a modestly sized non-profit like ours (about 130 employees and two actual IT staff, not wannabees), the Google Apps platform has proven to be a phenomenal, secure, essentially zero-cost, zero-maintenance way to have access to pretty much all the basic collaborative and communication technologies now deemed baselines for the legal services community. (Oh, yeah, the baselines happened in 2008, also.)

And all this stuff works very nicely with the Google Search Appliance. SharePoint, not so much.

  • Sunday
  • March 22
  • 2009

Selecting GSA targets – Part Three: Quantification, Revision and Finalization

There is a great deal of work proceeding behind the scenes and several key project elements are converging as we move toward finalizing this public project. Among other things, we have been working through modest but practical solutions for better placement and targeting of our existing 300,000+ repository documents, while solidifying all the additional Google Search Appliance (GSA) targets in our enterprise search sights, described in Part Two. At the same time, we are in our own March (and April and May) Madness as we mount a rapid-fire round of trainings for each of our eight remote offices (spread out over 50,000 square miles of Northern California) on their new role in making Google Enterprise search work for them, which is to say for all of us. And as mentioned in earlier posts, we are working every bit as earnestly on our latest in-house build of the Pika CMS.

The infusion of new, future content into the simplified structural taxonomy we created is a separate challenge we will be posting about later. Dealing with our existing files is more immediate, more concrete. Groking those files has been one of the more interesting, at times hilarious parts of this project.

For those legal services programs interested in how the existing files from our eight local offices break out, here are the percentages for the seven most common “document” types:

  • 67% – WordPerfect (WPD)
  • 18% – Word (DOC)
  • 9% – Portable Document Format (PDF)
  • 3% – Excel (XLS)
  • 1% – Text (TXT)
  • 0.9% – Rich Text Format (RTF)
  • 0.6% – PowerPoint (PPT)

(Discretion prohibits us from detailing the other file flotsam discovered on local office servers. That said, allow us to observe that some within our organization have extraordinarily good taste in photos taken by National Geographic, and not such good taste in music.)

We are totally on track for targeting most of our planned GSA targets: The existing office archive files listed above have long been targeted (although we still have a lot of work left to fit them into our structural taxonomy); over the last several months we have worked very hard to refresh and update (and remove, as warranted) the targeted content at LSNC’s various public websites; and we are very pleased with the quality of the GSA results we are getting out of Google Sites.

This is all good news. In addition, we are putting in place a few more content channels: Targeting the content in our organization’s seven private Google discussion groups, and a program-wide canvass for select hard-copy training resource materials for digital conversion and addition to the shared repository. It’s all good.

We have had one major disappointment: We discovered that there are significant, unanticipated technical challenges unique to the Pika CMS that thus far have prevented effective use of the GSA to target Pika content. The problem is not the GSA itself or configuring the GSA to target Pika. The GSA by design performs wholly benign, non-destructive crawls as it indexes targeted records. We did a huge amount of target testing and SERP evaluation, and we were very pleased — actually, thrilled is a better word — with the results we were getting from Pika. The unanticipated problem is that the current version of Pika is not well optimized for use as an enterprise search target. There are code anomalies in Pika that, among other things, cause it to auto-generate new case intakes and case records when it is crawled by the GSA.

After an assessment by Pika Software, it is now apparent it will take something in the neighborhood of 200 hours of work to make Pika more receptive, shall we say, to an external crawl by the GSA. (Ouch!) So for now, we have to put that part of the project to the side. File under: Lessons Learned.

  • Friday
  • March 6
  • 2009

Updated GSA open source CSS

Related to an earlier post today, we have also updated our annotated GSA open-source-CSS list of class and id selectors generated in GSA search results markup when using the Google Code open source GSA XHTML stylesheet. This CSS list is a bit better organized than our earlier version and includes about a half dozen additions. It is organized into the three basic portions of a GSA search result page: the form and navigation elements at the top; the search results themselves; and the portions of the page below the search results.

Hope this as helpful to others as we have found it to be. And please do comment if we have missed anything.

  • Friday
  • March 6
  • 2009

Version 05 of the TFP search result page

We now have in place version 05 of our GSA test frontend. The linked screenshot shows a basic search for the keywords “immigrant eligibility,” with 1090 results in varied file types returned from various GSA target locations. It looks very similar to an earlier version we posted, but we have made changes to advance several purposes. We have simplified the overall page design to make it less distracting, so that the eye is drawn more directly to the search results themselves. As part of the design refresh, we dialed back in a major way on some color and font choices we made in our earlier design that proved distracting for users. We also got user feedback that told us folks here like knowing and being reminded that these are “Google” results, so we’ve gone retro in a deliberate way to make the search results look more like classic Google and Google Sites search results, while standardizing link color with a nice retro blue, relying on bold and normal rather than alternate colors for link emphasis and de-emphasis.

On the right side of the search result page, there are now filters for narrowing the displayed results by file type or by special collections. For example, here is the same result filtered for only PowerPoint files, returning 10 total results, with only four displayed because the Google algorithm discerns the others are likely duplicates.

So, what’s with the window dressing at the top with links to Pika CMS, Gmail and so on? Those are unadorned placeholders for other things we expect to integrate into the user interface as we finalize the GSA frontend. The goal is to integrate GSA-based search and its results as part of a shared portal or point of entry to Pika CMS, the PHP-based case management system we use, and the array of Google Apps and other web-based applications most commonly used at Legal Services of Northern California.

As if we don’t have enough to do already… but at the same time we have been working on this GSA project, we have been updating our existing, heavily customized Pika 3.04 installation to Pika 4.0. We have completed those code changes to Pika, but are a few months away from rebuilding all the Pika template structures and folding in a new visual design we are adopting. Ultimately, the GSA search form will be added into what is now the Pika home page, and the GSA search results will morph into a visual design that conforms to our new Pika design.

  • Thursday
  • February 26
  • 2009

Comparing Google Sites and GSA search results with release 5.2 in place

All went well with the GSA version 5.2 update. The update itself is a humongous 1.53 GB ISO file that, once burned to a DVD disc and loaded, took about 6 hours to install. As recommended, we did a complete crawl refresh which, in our case, took another 72 hours. Other than this considerable but necessary time investment, we had no real problems with the update process.

As mentioned in an earlier post, the principal attraction of this most recent GSA update was the integration of Google Apps, which enables targeting of domain-hosted Google Docs and Google Sites. In that regard we are pleased to report no problema, as well.

In version GSA 5.2 the administrator now sees a menu option for “Google Apps Integration” with a single field for enabling or disabling one’s Google Apps domain as a GSA target:

With Google Apps targeted generally, then it is a matter of constructing URL patterns to include or exclude more specifically what you want targeted within your Google Apps. In our case, that meant our selection of specific Google Sites now serving as our organization’s intranet content platform. More specifically our search goal was to have the GSA index not just pages within those Google Sites but, as importantly, files uploaded to those Google Sites.

There are differences in how search results display between those performed from within Google Sites and those from a GSA frontend. If a search is done from within Sites, it will find and return a search result for keywords or phrases within an uploaded file, but not display the context of the keywords or phrase. For example, using the search law school+"reimburse me" one gets this specific PDF search result from within Google Sites:

The same search done from our test GSA frontend that returns results from everything targeted by our GSA, yields the same search result while showing the keywords and phrase in context:

So, the basic differences in how search results display are these:

An internal Google Site search will find and return results based on keywords and/or phrase within a file uploaded to Google Sites, display the filetype as an icon (in the above example, with a PDF icon), display the link using the file name, but not display the keywords or phrases in context.

In contrast, the GSA search result will find and return the same result but display the keywords and/or phrase in context, display the filetype as an acronym (e.g., “PDF”), and display the link as what the algorithm discerns as the document’s title (in this example, “Law School Loan Reimbursement Request Form”).

  • Friday
  • January 23
  • 2009

Why we like GSA release 5.2: Google Apps integration

A few weeks ago Google Enterprise issued Release 5.2, the latest software update for model GB-1001, the one we are using on this project.

There are a slew of new features in Release 5.2, but there are a couple that are making for some serious happy on this project. The most significant is that, with the update, the GSA now integrates with Google Apps. For those interested, the Google Code site has a detailed explanation of how that works.

For us this is huge. Our modest non-profit organization two years ago adopted Google Apps as a basic building block for a functional, practical, web-based enterprise environment, something we never really had before. (Hey, there are intranets and then there are intranets.) The Google Sites and Google Docs pieces of the no-cost non-profit Google Apps service are a big part of that. And as part of this project, we have moved pretty much all of our existing intranet content over to Google Sites, and use of shared Google Docs throughout the organization is increasing steadily. (Use of the forms features in Google Docs is especially popular among our office managers.)

Before the Release 5.2 update, we had made valiant stabs at getting the GSA to index our Google Sites content, but with muddled success; and with uploaded files, at best it would only return results with keywords that showed up in uploaded files names, not the file content. Now the GSA integrates directly and we can target any “public” Google Sites or Google Docs content we share with others within our domain. (There is some ambiguity in how Google describes GSA integration about so-called “public Google Sites and Google Docs.” To clarify the point, in this context “public” is a Google term-of-art. If you create a Google Site or Google Docs within your domain’s Google Apps and share it with everyone within your domain, then it is “public.” It is not necessary to make that content public to the world.)

The other Release 5.2 feature we are especially excited about is its enhanced advanced search reporting. Now we have a built-in tool that enables us to analyze user search behavior, with reports that “list every query and click made by every user,” plus whether users are finding what they search for within three clicks, or not at all, and which part of the search interface the users, uh, actually use. Aces!

One caveat we are aware of from GSA groups discussions: Release 5.2 is a significant update with features that may warrant a serious review of one’s existing XSLT modifications, to exploit new GSA feature sets. And we have been advised to do a complete crawl refresh. We’ll report back here how it all goes.

  • Tuesday
  • December 9
  • 2008

GSA virtual edition + VMware = experience enterprise search

This post is primarily targeted at my IT allies within the legal services community and other non-profit organizations following this project, who may be interested in trying out the Google Search Appliance but aren’t prepared to make the hardware investment as yet, even for a more modest Google Mini.

Here’s the deal: Now there’s a software alternative to get the full-on GSA experience for free. Sort of. Pretty much. At least up to 50,000 documents.

Early last month Google announced the availability of the fully functional (albeit limited to 50,000 documents) Google Search Appliance virtual edition. The GSA virtual edition is “a developer platform designed for the enterprise development community to build and test applications that use the Google Search Appliance. The Google Search Appliance virtual edition is for non-commercial, development purposes only. Developers can simply download the software-only product, try out the new features and explore the various programmatic interfaces supported by the Google Search Appliance.”

To exploit this opportunity, you need a server that is supported by VMWare virtualization. Here’s the rudimentary technical FAQs about the virtual edition. The Google Code site does incude a page about installing the virtual edition and access to basic GSA documentation, but does not include the extensive additional online documentation that comes with the fully licensed GSA hardware or any direct support.

  • Monday
  • December 8
  • 2008

TFP search portal ~ version 04

We debated internally how best to roll out to LSNC staff the various incremental changes to our project “search portal” and continuing modifications to our GSA-generated search results page. Given enough IT resources, we could have held off debuting enterprise search to staff until we could unload it all at one fell swoop. The practical challenge is that we have modest IT resources for a modest-sized (less than 150 employees) non-profit organization, cannot give undivided attention to any one technology project, and the project was geared to roll out over an 18-month period. And we have six more months to go. In any event, since typically we cannot present everything in a finished state, we have to roll things out incrementally. Fortunately, the internal culture and camaraderie at LSNC supports this practice. Demonstrating incremental progress pays dividends here, and staff seem to appreciate seeing any usable technological progress we can offer them, even if incomplete.

At this juncture, we are on our fourth version of the GSA frontend for The Findability Project. The version 04 search portal page is an incremental, intentional step toward a shared portal page, the access point of entry for all staff that eventually will be integrated with the core web applications essential to our work, including the Pika CMS, various Google Apps, an organizational news feed, and so on. We’re not there, but within the next few months we should be able to close the deal on a basic, first-generation solution for our organization.

As part of all this, this is what our version 04 search result page looks like, with our recent addition of file-type filters that can be applied to GSA generated search results. Upcoming will be our addition of select filtering by collections, a basic OneBox search result prototype, tweaks à la related-queries (the GSA equivalent of synonyms) and GSA keymatch, and more.

  • Thursday
  • December 4
  • 2008

CSS selectors for GSA open source XHTML stylesheet

Recently we described how we built our GSA XSLT stylesheet to give us external CSS control over our search result page. A key element of that build was to rely on the open-source Google Code GSA XHTML Stylesheet, which is both way more web standards compliant and also makes it way easier to modify the XSLT so that you can generate search results without any embedded or inline styles.

The CSS styling or presentational characteristics of a search result page are, of course personal and particular to the designer who codes the CSS. But to make that process at least a touch easier, we’ve created a GSA open-source-CSS list of class and id selectors based on attributes generated in GSA search results markup when specifically using the open-source Google Code GSA XHTML Stylesheet. (This list of selectors will not work with the default GSA XSLT stylesheet.)

What we’ve done is carefully comb through the search result page markup, list all the CLASS and ID attributes in their order of page flow, annotated each with a description and then put them into a reusable CSS stylesheet that can be adapted for use new GSA frontends, as you will. Here’s an example:


Fairly straightforward stuff, but how many times over do you want to “view source” in your browser or invoke Firebug to recall the particular selector and attribute for, say, the definition list markup used in GSA search results? Seriously, unless you did it about 10 minutes ago, can you recall what the dd p.st span.rc a.f selector refers to on the search result page? We didn’t think so.

Here’s the full GSA open source CSS selector stylesheet.

Hope this helpful for those using the open source GSA stylesheet.

  • Wednesday
  • November 12
  • 2008

Converting hard-copy documents for addition to the shared repository

A late October post at the Official Google Blog entitled A picture of a thousand words? prompts me to draw attention to an analogous TFP document protocol we worked out a few months. It is worth highlighting because it is so practical and will be an invaluable source of additional knowledge content targeted by our GSA.

But first, the Google post: Read it and you’ll discover that “In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document, so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format.” (As lawyers are so fond of saying, emphasis added.) As the post illustrates by example, do a Google search for repairing aluminum wiring and at the top you’ll see a PDF listed. If you download the PDF and open it, and you’ll discover it is an image of a text document. The downloaded file is itself not text searchable. But click View as HTML for that same result and you’ll discover that the text is actually indexed and searchable via Google.

Essentially, we are doing the same thing within our own enterprise search ecosystem, but with an added advantage. Not only have we adopted a document handling protocol for using our networked printers/scanners to convert select hard-copy text documents to PDF image files, we also process the resulting PDF images through Adobe Acrobat’s native “OCR Text Recognition” tool, add then save it with some basic metadata added.

Once added to the shared document repository, the scanned and OCR’d text document is then fully indexed and searchable by the GSA. And when the user finds and downloads the file, it is fully text searchable itself when opened in Adobe Acrobat or Adobe Reader. One better than what Google itself now does, superbly.

  • Monday
  • November 10
  • 2008

Selecting GSA targets – Part Two: The Practical Realities

In an earlier post about selecting Google Search Appliance (GSA) targets for this project, the narrative definitely edged toward the more abstract. We highlighted four principal sets of GSA targets: files on our newly created “shared document repositories”; repurposed intranet content being moved over from an MediaWiki installation on an old server; cherry-picked content available on LSNC’s varied public websites (LSNC maintains 13 distinct public websites and subsites); and select records in Pika CMS, the secured, web-based case management system used by all our advocates.

As far as it goes, this abstract list of GSA target sets fairly summarizes what we, as an organization, want to make transparent via enterprise search, which is to say make “findable” in ways not practicable without the GSA. This abstract list of GSA targets, however, fails to convey what we have done at non-abstract, practical level to make those targets useful to our larger search goals.

So, let me hit a few notes about several practical decisions we’ve made at launch as we target the GSA at real files offering real search results.

As described in Part One, when we first unpacked our GSA and aimed it, uh, somewhat aimlessly at any and every file on one of our local servers, the GSA did its job in killer fashion… and blew out our file limit. While one can proceed that way, we were always mindful that we had to sort out how to organize and structure the shared content that we seriously wanted to make searchable and findable. So, one of the first tasks we confronted on this project was to work out our thinking about “taxonomy,” resulting in the basic directory structures we have adopted.

That taxonomic “organization” step was essential to this project, but completing that particular project objective doesn’t translate directly into searchable content organized in a particular way. You see, there is this pesky little detail: Real people need to actually identify the existing and/or newly created files to be included and then somehow get the files in the directories on the shared document repository that are the target of the GSA.

Easier said than done.

In our case — particularly given the limited IT and support staff resources available to us as a typical legal services field program — we had to come up with some practical approaches to move existing files from any number of different locations to the designated shared locations or document repositories. (I will discuss how we handle adding newly created files, in a later post.) Here’s what we did with our existing files to fold them into the content targeted by our GSA:

1. Initially, include all existing “staff-specific” content, with an opt-out

We did find ourselves on the receiving end of a lot of staff enthusiasm about this project. Truly. But it is impractical and unrealistic to expect your individual legal services advocates and other staff to comb through all their thousands of files and then move them over to a different file server location. (Maybe it should be realistic to expect them to do this, but in our experience it just ain’t gonna happen. No way.) But there are tons of content gold in them thar files, so we had to figure out a way to initially get all that good stuff in place, even if not parsed out in a taxonomic sense, so we could target it.

To accomplish this, we first vetted with, and got buy-in from, all our local offices to do the following:

On the local project file server for each local office, we created a special project “archive” directory. Then each local office manager copied each individual staff member’s files wholesale over to a user-specific directory in this so-called archive directory. Having an unequivocal “opt-out” option was important to the success of this approach. Again and again, in formal meetings and informal discussions, we reminded office staff that they could ask that any or all files to be removed from these initial archive-file targets. No questions asked. There were a few such requests, but not a lot: One advocate asked that her files not be targeted at all, so we removed all her files; two others had less than a dozen files they wanted removed as targets, so we did so. No biggie.

The net effect is that this makes the targeted advocate files initially non-taxonomic, but in short order you have a huge repository that has a (allow me to exaggerate here, for literary effect) 99% chance of including pretty much everything the individual staff members would add if they “woulda, coulda, shoulda,” so to speak. In our case, this initially amounts to about 300,000 document files, the vast majority of which are advocate-generated files.

At launch, this does mean that these office-specific, bulk compilations of existing files added as targets include a significant number of drafts and duplicates that one would normally not include as a shared file if it were being added as a newly created file. For example, within our office culture it is not only common but actually expected that advocates not work in isolation on major cases. (We discourage the “lone eagle” model.) So, our early search-results testing shows that often the same file shows up in more than one target location because more than one advocate has a copy of the file in their archive.

It bears mentioning two other factors we kept in mind as part of this initial targeting of shared files: We double- and triple-checked with all management staff to assure nothing management-sensitive or -confidential was moved to a location where it could be targeted. Also, before we moved anything over wholesale, as described, we asked all staff to remove certain types of files that no one would reasonably expect to be part of the searchable content. Examples: Family photos, MP3 music downloads, YouTube videos, yada yada yada. Enough said.

We do have an approach in mind for “peeling off” these office-specific archives over time, to separate out the drafts and duplicates and place them within our taxonomic directory structures. More about that later.

2. Using Google Sites as the platform for our existing intranet content

I recall having a passing conversation with Gabrielle Hammond at last year’s TIG conference about how we were holding off on further intranet development while waiting to see how Google implements its JotSpot-based wiki application, now known as Google Sites.

Well, people, we now know what Google Sites is all about and we love it! For the last several years we had been using MediaWiki as the publication platform for our intranet, but we are in the process of replacing it for our internal wiki needs. We are about half way through that process, which should be completed shortly after the first of the year.

One big bonus of moving our intranet content to Google Sites is that it is quasi-tailor made to work with both the GSA and Google Analytics. I say “quasi” because the interactions between them are good but hardly optimal at the moment. For example, only days ago Google Analytics for Google Apps was rolled out, but the quality of the data we are getting so far is not so easy to get a handle on. More importantly, Google promises GSA integration with Google Sites, but it is still a buggy implementation. We have easily targeted test site pages within our domain’s Google Sites, but have hit a wall with getting the GSA to properly return search results on the indexed content within files uploaded to Google Sites. Turns out we are one of several organizations that have identified this problem and Google Enterprise support assures it will have a fix with its next software upgrade, in about a month or two. We (and our GSA consultant) are confident this will work in due time, but it’s one of those details we have to wait on for now.

3. Updating our public web content

Over the last 10 years, LSNC has placed an enormous amount of its advocate content out on the public Web. But one recent example is the California Food Stamp Guide, a prime example of public content that our advocates can search at that site, but would want to be able to search directly via our GSA shared portal. It is also one example of a content cluster that can be part of or its own GSA collection. (“Collections are logical views of information in the index, as defined by URL patterns. This allows you, for example, to index the entire contents of your intranet, but then divide it up into logical groups of content.”)

Implementation of The Findability Project has prompted some public housekeeping. Our target testing of our public content, predictably, reveals that we have stuff out there that is, well… past its shelf-life, shall we say. So we are working on a systematic way to thoroughly review and clean up that public content. It is obvious but important: Current and correct public content means better search results via the GSA. (Apologies to the larger legal services community for not doing it sooner.)

4. Targeting our case management system

We consider our Pika case management system a key, long-term GSA target. But we are not there yet. We have prioritized getting all the other targeted content organized and in position, with clear protocols in place. We also are busy reworking on our shared portal, which will integrate the GSA search functions and provide users with (hopefully) intuitive ways to filter their search results, search select content collections, and provide the users with some nice Google GSA touches like OneBox searches, among other features.

That all said, being able to target our case management system is a total no-brainer and perhaps the most practical of necessities. In a given day, there is likely nothing more common or more vital to our work for clients than the search for information within our case management system. The native search functions built into the current version 3.07 of Pika are good. But we are optimistic that we can exploit the GSA to make those searches even better. And certainly more integrated with everything else in our new enterprise search universe.

  • Monday
  • October 27
  • 2008

How we built our GSA XSLT stylesheet with 100% external CSS

One of the more confounding challenges for those new to the Google Search Appliance (GSA) — at least for the XSLT-inexperienced such as ourselves — is figuring out how to build a search-results page such that one has 100% external stylesheet control. To be sure, the GSA includes a very handy “page layout helper” that provides interactive dialogs for creating a custom GSA “frontend” output page and/or search result page. One can use the page layout helper to incorporate your own custom header and footer markup, as well as add or exclude or modify a limited set of default GSA output page elements. Essentially, what the GSA native page layout helper does is provide a simple way to customize the underlying XSLT stylesheet without having to code directly in XSLT.

And as far as it goes, the GSA page layout helper is pretty handy. But what it is not so handy at is providing a way, without coding the XSLT directly, to give you complete, 100% external CSS control of how the search results display. True, you can add an external stylesheet link to the header markup you pop into the editor, but you are still stuck with all the native, embedded and inline styles that the GSA, by default, adds to the search results themselves. As those familiar with the order of importance in the cascade know well, inline styles trump embedded styles which in turn trump external styles. What’s a CSS coder who is not XSLT-savvy to do in this situation?

Here’s what we’ve done, in six (mostly) easy steps:

1. Build a design mockup of your search results page

The first thing we did is build a static XHTML page, with conventionally linked external CSS stylesheets, so that we had in-hand the markup we would fold in later to the GSA XSLT stylesheet. How you do this is entirely up to you. Here is a screenshot of our search-result page design. In this design, all the page elements at the top with a black background and to the right in the side-bar are what we will later add to the XSLT as part of the so-called “header,” described below. The markup that comprises the simple, light-gray footer at the very bottom will go into the “footer” section of the same XSLT. The “lorem ipsum” section of the markup in this mockup is just a placeholder, to show where the GSA search results will eventually appear.

2. “Process” your header and footer XHTML through the GSA page layout helper

This is akin to the “preparation” stage of cooking a recipe. What you need to do at this step is use the GSA page layout editor to convert the XHTML “header” and “footer” markup you created in the prior step, into a format that will play nice with the custom version of your GSA XSLT stylesheet described below.

To do that, login to your GSA admin panel, select “Serving” and then create a new GSA frontend. Click on the “Edit” link for the new frontend and then select the “Output Format” tab to display the Page Layout Helper. Click on “Global Attributes” and paste into the “Header” field all the XHTML code from the very top of your design mockup page (starting with the DOCTYPE) down to where the “header” markup ends. Do the same with your “footer” XHTML to the very end of the page markup (including the closing “body” and “html” tags) by pasting it into the “Footer” field. Click the “Save Page Layout Code” button, which prompts the GSA Page Layout Helper to process your markup and add it to the raw XSLT code.

You need to put your newly processed “header” and “footer” code to the side temporarily, for a later stage in this recipe. To do that, click on the “Edit underlying XSLT code” and then copy the raw XSLT code into your code editor of choice. Search for the line of code that begins and you’ll see that your header markup has been converted for use in the XSLT stylesheet. Copy everything between the opening and closing “my_page_header” template tags. Ditto, for the code between the “my_page_footer” template tags. Set these two code excerpts aside for a few minutes while you work on the next few steps. (Don’t worry about the default GSA XSLT stylesheet you just created, since it is going to be replaced, as described below.)

[Update: See the comments below, about the importance of assuring the header and footer XHTML "processed" via the GSA page layout helper are enclosed in each case by xsl:text tags.]

3. Download the Google Code open-source version of the GSA XSLT stylesheet

Download the open-source Google Code GSA XHTML Stylesheet. The advantage of this open-source XSLT stylesheet is that it offers a web-standards compliant version that generates well-formed, valid markup, something that natively generated Google pages are infamous for not doing. This open-source version of the GSA XSLT also makes it easier to modify the XSLT so that you can generate search results without any embedded or inline styles, and therefore subject completely to an external stylesheet. (A tip-of-the hat here to our GSA consultant, Michael Cizmar and his trusty sidekick Igor Taran for giving us the heads-up on this XSLT option.)

4. Edit the open source GSA XSLT stylesheet to turn off embedded and inline styles

Only two simple sets of edits are required to give you the external CSS style controls you want. The first is to edit the open source GSA XSLT stylesheet at line 67 to turn off embedded styles, by turning on external CSS styles; and at line 68 to provide a pointer to the specific directory (but not the CSS stylesheet itself) where your external CSS stylesheet resides.

Without modification, these two lines look like this:

<xsl:variable name="style_include">0</xsl:variable>
<xsl:variable name="style_include_prefix"></xsl:variable>

Modified for this example, the same two lines would look something like this, with the “style_include” set to true and the “style_include_prefix” set to the path of the external CSS stylesheet:

<xsl:variable name="style_include">1</xsl:variable>
<xsl:variable name="style_include_prefix">https://yourdomain.com/css/</xsl:variable>

The second edit is an optional edit. Sort of. At line 507 you’ll see this reference:

<link href="{$style_include_prefix}search.css" rel="stylesheet" type="text/css" media="screen,print"/>

Line 507 refers to the specific external CSS “screen” stylesheet that line 68 points toward. Actually, you can simply use that same “search.css” name for your external stylesheet, or change it to another name. It’s your call.

5. Edit the open source GSA XSLT stylesheet to add your custom “header” and “footer” template markup

Remember the processed XHTML “header” and “footer” template code you set aside, above? Here’s where you use it. In the same open-source GSA XSLT stylesheet, go to line 244 and add your processed “header” XHTML code and at line 248 add your processed “footer” XHTML code. The former goes between the “my_page_header” template tags:

<xsl:template name="my_page_header">
    <!-- add your processed xhtml here -->
</xsl:template>

And the latter goes between the “my_page_footer” template tags:

<xsl:template name="my_page_footer">
  <!-- add your processed xhtml here - -->
</xsl:template>

6. Substitute your edited version of the same XSLT stylesheet for the default GSA XSLT stylesheet

Now go back to the Page Layout Helper. If the XSLT code is not open for your custom frontend, click the “Edit underlying XSLT code” link, paste your edited version of the open source GSA XSLT stylesheet into the editor and then click the “Save XSLT Code” button to save your changes.

You should be good to go!

Now you can go to your external CSS stylesheet and change how any and all elements of your search results display, without any interference from GSA native embedded or inline styles.

It worked for us. After using {display:none} in the external CSS stylesheet to selectively hide some of the page elements in the search results page, and then tweaking the styles to get the results to display the way we wanted, we ended up with a very nice, customized look to our search results.

  • Tuesday
  • October 7
  • 2008

"GSA Blunt-Instrument" Tour begins

We are still most definitely in “blunt-instrument” mode, which is to say we have our enterprise search platform built out, the local and centralized content servers are all talking nicely to each other and, most importantly, to the Google Search Appliance (GSA). But we have no refinements in place. Hence, it is at this stage just a blunt instrument for searching throughout our organization.

Even so, we want to actively draw folks into this grand adventure. Tomorrow, we are doing an on-site orientation and training for our Sacramento Office, a k a “The Mothership.” In almost all things tech here at LSNC, the Sacramento Office is almost always the guinea pig, the office where we first test things out. At that training, we are for the first time formally introducing folks to what is essentially an alpha version of a shared portal page that is pretty self-explanatory (hopefully). But even before the training, we have circulated to everyone in that office the URI to the portal and have encouraged them to play with its search functions and examples. And bring their sense of search wonder and befuddlement. Our motto for the training: “There are no wrong searches, only better ones.”

This test portal page includes some modest jQuery scripting so that if users click on “Show|Hide sample searches,” they will see a reveal with active links to all the examples used during the training (illustrated, at right), so they can easily reconstruct what was shown. Assuming all goes well, we will then promote the test portal to the rest of our core offices, concurrent with a rapid-fire series of LegalMeetings sessions to introduce-orient-train users on how basic organization-wide search works.

Wish us luck.

  • Tuesday
  • September 23
  • 2008

The Findability Project Taxonomy – Part One: The Theory

First, a recommendation. Get your hands on a copy of Information Architecture for the World Wide Web (also linked on the right, under “Biblio”) and read chapter 5 about “Organization Systems.”

Why? Well, let me put it to you this way.

We did a lot of homework and scoured a lot of books and, of course, talked to our GSA consultant on what is popularly (if imprecisely) referred to as “taxonomy.” You know, how should we organize all the “stuff” we want our users to be able to find? How hard is that?

As we canvassed widely to get an answer to that basic, practical question, we discovered you can get totally befuddled and sidetracked, not only by any number of levels of abstraction, for example, should you choose to wallow in construction of controlled vocabularies; but also by all too “inside-baseball” discussions by the taxonomy community; or, by yielding to the dark side and joining a formal organization for this sort of thing. Of course, there is also the emerging school of “social organization” of content referred to as folksonomy, more popularly known as tagging. And then there is the school of thought within some sectors of the search community that, after all is said and done, taxonomy may not be particularly useful for enterprise search design.

Needless to say, these initial forays into this subject prompted the thought bubble … “Just shoot me now.”

On this point, the GSA consultant was not as directly helpful as I thought he would be. The short story is that he was supportive of what we thought we needed, but at the end of the day he was essentially agnostic on this point, a view that mirrors Google’s online GSA resources. In discussing how to plan for a GSA implementation, Google says not much more on this point other than “analyze your business’s content and decide which directories and files you want indexed.” (In fairness to our GSA consultant — whose name, by the way, is Igor — you should be sure to read below, for his helpful guidance on simplifying the taxonomy we adopted, and the reasons for doing so.)

Which begs the question, how should we do that?

There are online articles that are straightforward and helpful in grasping, at a rudimentary level, the basics of information architecture, one recent example being Better Living Through Taxonomies, at Digital Web Magazine. But based on our experience, I recommend you pass Go and head straight for Peter Morville and Louis Rosenfeld’s Information Architecture for the World Wide Web, a book that is part of the IA canon, and deservedly so. It is a superbly clear-headed, well written overview of what information architecture is all about, and Chapter 5 on organization systems, specifically, is a model of how to explain a technical and complex subject like “taxonomy,” among other things, in plain, accessible language. And it will hit the mark on the main issues you need to think through to get “stuff” organized.

What are those practical issues? Indulge me a bit, since several of my observations here simply echo what I am recommending you read, but for LSNC we distilled our theoretical approach to taxonomy or organizing our content to these four basic precepts:

1. The directory structures need to be a hierarchical or “top-down” organization of simplified, familiar categories.

In the broadest sense of “organizing” things on a file server, and how that same “organization” is reflected in page menus or page navigation or dialog boxes, users need to know where they are and what the folders or subfolders mean. Lawyers, by training and practice, work in an especially pronounced hierarchical environment. (Can you say, “I, II-A, etc.”) While the work environments of legal services programs are famously “anti-hierarchal,” the practical truth is that almost everyone in that environment organizes their work in some hierarchical fashion. (Certainly, there are exceptions.) Simply put, this is the most common way in which most people organize things, lawyers and non-lawyers alike.

2. Names for content folders, subfolders or categories need to be consistent with the shared vocabulary of your organization.

This may seem self-evident, but in practice may not be what users in your program do or are accustomed to. I actually took the time to look at the folder organization of about a dozen advocates in our Sacramento Office, and while there were predictable folder organizations (for example, organizing files by case or project or substantive area), much of the naming was ambiguous. While no doubt obvious to the advocate who created the directory or subdirectories, to others the same structure or organization may be too subjective, ambiguous or confusing to be useful to anyone other than the person who created it — and even possibly for him or her at some later time, when the subjective rationale for the organization has been long forgotten. So, when working out the naming conventions for folders and subfolders, it was important to focus on commonly understood, familiar shared vocabulary or terminology.

From the perspective of the GSA, the particular names, as such, of directory folders or subfolders is of no consequence. The GSA does not care what you call things, which explains the agnosticism of Google and our GSA consultant on this point. At the blunt-instrument level, all it cares about is the URL, the path to where the content resides. You deal with the Tower of Babel; that’s your problem. The GSA will ferret out the content wherever it resides, regardless.

To be detailed in the next post on this subject, LSNC has adopted the most conventional names for its directories it could come up with, including … I pause, for the pain it causes me to say this … the LSC substantive problem code categories, which comprise roughly half of the directories on our shared document repository. If one were organizing legal services practice today, I am confident it would be organized differently than how LSC organizes it. But roughly 40 years in, LSC still uses an extraordinarily unsubtle and somewhat uninformed organization of legal services practice. But it is what it is, and it is what field programs must use, and it is what users within those organizations know and understand, after decades of use. For better or worse, it is the “shared vocabulary” of our organization, and its use offers consistency with how other information and data is handled, most notably client case data.

3. “Lean toward a broad-and-shallow rather than narrow-and-deep hierarchy.”

That’s a quote from Morville’s book. And his observation is consistent with the advice our GSA consultant gave us. The consultant’s advice was not to go more than two levels down, and really pushed for only one level down. The rationale was two-fold: The more subfolders you have, the less likely users will locate or use content in those folders whenever they are navigating the directory structure, in whatever form it is viewed. From the user side, a deeper vertical hierarchy actually reduces findability.

From the GSA side, deeper hierarchy does little or nothing to improve search results. While the search algorithms baked into the GSA exploit the URL path at the directory and subdirectory and sub-sub-directory to improve search results, having third or fourth or more levels does essentially nothing to improve those results. There’s no harm to doing so. It just doesn’t help you.

A counterpart to this issue is the importance of striking a balance. By going broad-and-shallow, one gets the practical advantage of being able to add content without the need for major restructuring. Assuming you have figured out a set of top-level directories that pretty much covers, in a broad sense, the content your users will want and need to search for, from there on out you can focus on adding content below that level, as warranted.

But if you go too broad, from the user side, things get more cumbersome and impractical. Think about it. Whether your users are advocates or office managers or volunteers, whatever, it is going to be more practical and useful if they can visually and cognitively grok the organization scheme. So it needs to be broad enough to cover the bases, but not so broad that it becomes incomprehensible.

Sure, we could have gone totally nuts with the taxonomy and, say, adopted the thousands-of-points-of-substantive-light offered by the well intentioned but ill fated National Subject Matter Index. (Don’t get me started.) We’re more practical. As detailed in the next article, LSNC is going with a simplified 29 top-level directory structure, and each only going one-level deeper. Works for the users. And works for the GSA.

4. It’s not all about taxonomy.

Having a basic, practical, commonly shared taxonomy or organization structure is essential to a project like this. LSNC content needs to be located somewhere to be targeted by the GSA, and those who add or contribute or remove that content need to be able to comprehend what is where. The practical side of what that all means will make more sense in later articles about the document protocols we have come up for LSNC users to locate and add content and how to add metadata to that content.

But having a traditional taxonomy is not the whole picture. There are other types of content you may want to target that don’t fit the taxonomic model: targeted database content (case management systems come to mind, but are not the only example); external site content (such as select public website content to which your organization has access or permission); and alternate content sites that you would want to target but over which you don’t have the same level of control (a current example would be domain-hosted Google Sites, a subset of Google Apps, which you can “organize” in a superficial way but which at the level that matters to the Google Search Appliance, not so much).

What this means for LSNC is that we are targeting the GSA at more than just a nominal taxonomy on our shared document repository.

  • Sunday
  • August 31
  • 2008

Selecting GSA targets – Part One: Four abstract targets

It is, of course, not enough to simply build an enterprise search platform. Sure, you can do what we did on day one, when our Google Search Appliance (GSA) arrived and we gleefully hooked it up to our local Sacramento Office network and did a global target of everything. You know, just to see if our GSA worked. It did. And in short order, as we blew out its one-million file crawl limit, we discovered the obvious: LSNC has a whole lot of documents and other files strewn about on various file servers and desktops, like so much digital flotsam. Needless to say, we did not need a TIG-funded GSA to reveal that fact. To know that, all one has to do is invoke Windows Explorer and peruse one’s local office file server. Enough said.

From the perspective of our enterprise search goals, most of these files do not contain content that has what we refer to as “shared value.” Namely, advocacy or other work-related content or information that LSNC staff would want to search for because they want it or need it to get the job done.

This observation does not suggest that all the other individual documents or files have no worth. They do, but to other purpose. For example, on a practical level, an advocate may have any number of drafts or versions of a document or file, but what the organizations will want to target and what users will want to get their hands on is the final or more polished version of that content. And that is likely what the original author will intend to share.

But if the organization targets everything, well, in the broadest sense what those who search will get is a lot of extraneous or incorrect or incomplete content. And a less serious but real-world challenge is the organization’s need to separate the true wheat (even if marginal) from the inevitable digital chaff on local office file servers and desktops. (Oh, come on — you know what we’re talking about here! All those personal photos, MP3s, YouTube videos, recipes from the Food Network, National Geographic wallpapers, long forgotten software downloads, … need I go on?)

There is a separate set of challenges to initially identify existing content that one would want to target with a GSA that has, after all, a set file limit. And then one has to work out practical policies and protocols for how to handle new content to be added to those targets. In upcoming posts, we will document how LSNC has approached both of these challenges.

But for now, here is a macro breakdown of what content we value and are initially targeting with the GSA. It is actually more simple to do than we initially thought it would be:

  • Designated document repository master directory structures – that’s a mouthful, but it turns out that’s how we refer to it. We have worked out what we consider to be a basic, workable “taxonomy” for organizing files, to be detailed in an upcoming post. The short version is that both existing and new content that has been identified as valued will reside on project-specific files servers that have purposefully organized directory structures. This will make more sense once we explain (fairly soon) why we are adopting the structures or organizations we have worked out, and why, and how they will serve the overarching goal of “findability.” Stay tuned.
  • Shared intranet content – within LSNC, we refer to our intranet as the “secured network,” the lingua franca here for what other organizations refer to as their intranet. At this juncture, most legal services programs have some sort of intranet structure already in place, with varied user-side implementations to give staff access to its content. (Currently, ours is built out with MediaWiki as the principal content management tool, but soon to be supplanted with either WordPress and/or Google Sites. (I have posted details on that side story at LSNC’s tech blog, Webdogs 2.0.) By historical definition, everything on our existing intranet is valued. It’s fairly lean, mean, to the point, well organized and includes among other things, in no particular order:
    • Administrative manual
    • Case management manual
    • Development and funding-raising resources
    • LSC policy archive
    • LSNC forms (administrative and case-related)
    • LSNC policy archive
    • MCLE – Training resources and forms
    • Personnel and other shared human resource information
    • Specialized Regional Counsel content (content subject to gatekeeper function)
    • Specialized client content (content targeted for LawHelp access)
  • Select LSNC public web content – LSNC is now reaping dramatic benefits from its decade-long focus on using its public web presence to create and share usable content for advocates. We are still in the process of parsing out those portions of the LSNC public content we want to target with the GSA, but these include our rich reservoir of advocate content on CalWorks (the name of California’s TANF program) and Food Stamps, and special project-specific content that derives from our Race Equity Project and housing and economic development work. The point here is that our enterprise search model will include not just valued content behind our firewall but also select public content that is every bit as valuable to our staff in getting the job done.
  • Pika Case Management System – this will likely be the last piece of the enterprise search puzzle for us, but a major chunk of our GSA file limit will be devoted to exploiting the GSA to alter dramatically how LSNC staff search and locate data within Pika. We have already run some initial targeting tests on Pika and we really, really liked what the search results looked like. It is not a technical challenge to target Pika with a GSA, not at all, but there are some significant challenges in sorting out how best to limit the GSA crawl to target precisely what we really want to make searchable, without blowing out our GSA file limit. Once we work out those kinks, we will likely replace the native Pika search functions (which is little more than a raw SQL search function) with a customized subset of GSA functions.

In the scheme of this project, content is king, knowledge content rules, and the Google Search Appliance is Gandalf, the wizard asking “What do you see? Can you see anything?” Indeed.

  • Thursday
  • August 21
  • 2008

Basic tech specs for the Findability Project

We have completed the initial “blunt instrument” build of the hardware and software infrastructure for the Findability Project. Refinements will be detailed here later — and there are many that must and will be done — but here is a breakdown of our basic enterprise search platform:

  • Windows Server 2003 – this is the software platform backbone, so to speak. To be detailed later in a separate posting, all local LSNC office-specific network shared file servers have been built out using Windows 2003 Server, which allows installation of SharePoint Server 2007, the open source Google SharePoint Connector and Windows Server 2003 Active Directory, described below. Windows 2003 Server is a robust secure server which allows local and subdomain user authentication for multiple locations. It also provides centralized authentication for SharePoint access as part of the universal, shared domain login available to all LSNC offices.
  • Windows Server 2003-certified ASUS PM52-M motherboard – if you’re going to build a platform suited to working with Windows Server 2003, it is not just a matter of purchasing a server that nominally meets basic system requirements. You also need to ensure it is certified to do just that. It is not the only such option, but the ASUS PM52-M motherboard is so certified. There are numerous other Intel chipsets that are not. This is a distinction that matters.
  • SharePoint Server 2007 – this will be detailed in later posts as well, but the elevator pitch is that SharePoint Server 2007 has been installed on a single server in the LSNC Sacramento office (a k a “The Mothership”). SharePoint Server provides an array of features and functionality, some but not all of which will be exploited as part of the Findability Project, with their integration with the Google Search Appliance (GSA). Among other things, SharePoint Server provides a web-based interface for sharing and managing domain-site content across a common network. (You know, like documents and files and web pages and stuff like that.) Out of the box, SharePoint also provides a capable, basic document management system for data indexing, creating collaborative sites, and adding metadata.
  • Google SharePoint Connector – available as an open source project at the Google Code site, the Google SharePoint Connector provides a seamless connection between the GSA and the Windows and Sharepoint servers. This connection uses Active Directory authentication for managing content permissions as well as the interface for the GSA to access and crawl all the domain content.
  • Google Search ApplianceGoogle Search Appliance (GSA) – think of the GSA as “He-Who-Must-Be-Named.” The Google Search Appliance is, quite literally the brain, the cerebral cortex, the heart and the soul of the Findability Project. Some call it the “Godhead.” Others call it “Big Yellow.” Whatever you call it, it is Google search mojo and magic in a high-end computer with the capacity to provide enterprise-level search capabilities for crawling and indexing any targeted content within our organization as well as external sites. If the organization has permission to a site, the GSA can crawl and index it, purée and shake and bake it six ways to Sunday and then return search results in all sorts of different ways. (Obviously, we will be writing up more about all that, later.)
  • Microsoft Office 2003 – huh? Why are we not listing Microsoft Office 2007 here? We know this will come as a shocker to all, but it turns out that some of the key features in Microsoft SharePoint work better with the Microsoft Office 2003 suite than they do with the 2007 version. (Insert your own Microsoft joke here.) We expected issues with non-Microsoft apps (WordPerfect, Acrobat, and so on) but did not expect some of the hurdles we have encountered with Microsoft Office. Not to worry. We will get into the details and work-arounds in later posts, but the short version is that the reason for using Microsoft Office 2003 is that it allows itself all but universal and seamless integration of all the above technologies, using Active Directory authentication with the document management features available in SharePoint — most notably desktop application integration of features that enable users to add metadata.
  • Microsoft Internet Explorer 6.0 or later – IE is listed here because, at the system level, it is required for the type of enterprise search platform we’ve built. Not to worry. From the user side, everything works just fine with Firefox. But the network, system-level issue here is that IE 6.0 or later must be installed and configured on each workstation in a way that allows intranet and SharePoint integration with trusted domain-site functionality. The short version is that adding a network local site to trusted sites binds Active Directory authentication to the intranet web interface in SharePoint. Once properly configured, users can then rely on either IE or Firefox as their browser of choice. (We’ll explain it all later.)
  • Microsoft ISA Server 2006 – still with us? OK, one more piece: the Microsoft Internet Security and Acceleration (ISA) Server allows offsite access to content returned in search results by providing an Active Directory authenticated external web interface. (If you got this far, odds are you know what that means.)