Posts tagged: tfp

  • Thursday
  • July 8
  • 2010

Our final 15 minutes of Google fame

LSNC logo

It was a pretty nice surprise for LSNC several months back to be asked by Google to present Advancing Knowledge Sharing with Google: The LSNC Story, with its focus on what we accomplished with The Findability Project.

Prior to but independent of that webinar, Google interviewed LSNC about The Findability Project and LSNC’s larger experience of integrating a Google Search Appliance with Google Apps and the Pika case management system. At its Google Enterprise customer solutions site, Google currently features and has posted its LSNC case study. Sure, it’s a marketing stroke but, still, it’s great to be included.

  • Wednesday
  • January 27
  • 2010

Findability slides and video from 2010 TIG conference

I’m not sure what happened with the slides or recording of the Knowledge Management session at the recent 2010 TIG Conference. The session doesn’t show up in the LSC documentation of the event.

In any event, here’s a set of the slides for my “findability” segment, about search paradigms, findability as a concept and what we’ve done to implement enterprise search using a Google one-two punch: the Google Search Appliance in combo with the Google Apps platform. Also, here’s the brief flash video of our portal front end and search result/filtering examples that I ran during the presentation but displayed so poorly. The point of the video was to give the audience a real-world feel for how it all works. Again, my apologies for how bad the video displayed in that setting. Lesson learned.

  • Sunday
  • November 29
  • 2009

The Findability Project Archive

Within the next few days the domain-specific site for The Findability Project will disappear. With completion of the project we have moved all of its content over to this site into The Findability Project Archive.

We are currently rebuilding some of the Webdogs 2.0 site infrastructure, a piece of which is the tag cloud that normally would appear to the right but is MIA at the moment. But it will reappear soon and when it does you can simply click on the “tfp” tag to view all The Findability Project posts.

  • Monday
  • October 26
  • 2009

The Findability Project site goes dark

The domain-specific Findability Project site will go dark the first week of December 2009. The formal aspects of the TIG-funded project were completed months ago. We have posted a few items since then, but we are now, in a purposeful way, winding down the public aspects of the project.

The project content will endure but in a different location, here at Webdogs 2.0, the LSNC technology blog where we have long archived all of our public web development projects. The Findability Project is the latest, no doubt not the last, to find its archival home at Webdogs 2.0. For now, we have simply duplicated the site over to a subdirectory there. Eventually all TFP content will be integrated natively into the Webdogs 2.0 site.

From time to time, we will continue this conversation about search, enterprise search, and making organizational content findable, and therefore authentically usable, over at Webdogs 2.0.

Watch the skies, people. Or at least your search patterns. We do.

TFP, out.

  • Thursday
  • September 17
  • 2009

Revised: What the LSNC Shared Portal now looks like

We have now posted a further revised Jing video with audio providing a brief, 4-minute overview of the LSNC Shared Portal. This is the actual intro overview video we circulated internally to provide all staff with a basic visual and feature orientation, before our more extended, in-house live demos to be conducted next week.

It’s not so easy to do a public video demo of our new Pika 4.0 case management system design changes, because of confidentiality issues, but we will post select screenshots reasonably soon so you can get a visual idea of changes we have made to that application.

  • Monday
  • August 3
  • 2009

TIG final evaluation report for The Findability Project

For those interested, here is the recently approved TIG final evaluation report for The Findability Project.

This TIG project was funded for an 18-month period from January 2008 through June 2009. Much of the report will ring familiar to those who have followed the project here, since much of what has already been posted mirrors what would be required in a TIG evaluation report. Essentially, this public project site enabled us to give others in the legal services community an ongoing, if lagging, report of progress on the project, while at the same time considerably easing the process of writing up the evaluation report at the end of the project since we had already written most it as we went along.

We’re winding things down here, but we will continue to post here at least through the next TIG conference in early 2010. Among other things, we will be detailing how in finalized form we are integrating our project’s GSA test frontend functionality into a more expansive shared organization portal, part of our current deployment of a heavily customized version of Pika 4.0. We have finished the LSNC redesign of Pika 4.0 as well as a new LSNC shared portal “front door” (built on WordPress), both of which are scheduled to be in place and in use by LSNC staff the day after the Labor Day break.

Stay tuned, people!

  • Sunday
  • July 5
  • 2009

Getting Google-y with the enterprise

As a coda to the post yesterday about findability, the pervasiveness of the Google search paradigm, and what that means for the non-profit enterprise, I want to take a moment to put focus on a question during the session about an online post screenshot highlighted in one of the slides: “Why Enterprise Search Will Never Be Google-y.” I fear I did a poor job of answering the question about how it is that the author viewed Google enterprise search as different from other types of enterprise search. Mea culpa.

A couple of follow-up observations, to better respond:

As mentioned during the presentation, one point of the slide was to draw attention to The Noisy Channel, a very search geeky, characteristically Google-contrary, but always interesting, worthwhile blog helmed by Daniel Tunkelang, chief scientist at Endeca, a high-end direct competitor with Google in the enterprise market. Agree or not, there is a lot to learn about search from The Noisy Channel. It is one of my must-reads.

The title of Daniel Tunkelang’s highlighted post derives directly from Chris Sherman’s pithy, two-page online article with the same name, Why Enterprise Search Will Never Be Google-y (from the Enterprise Search Sourcebook 2008.) The gist of Daniel’s post and Chris’ article that prompted it is this: The “simple search” or “known item” search we all commonly associate with Google (the noun and the verb) short changes what enterprise search can or should be for those who use it. The tension between these two enterprise search models is why I highlighted these two paragraphs from Daniel’s post:

The upshot? There is no question that Google is raising the bar for simple search in the enterprise. I wouldn’t recommend that anyone try to compete with the GSA on its turf.

But information needs in the enterprise go far beyond known-item search, What enterprises want when they ask for “enterprise search” is not just a search box, but an interactive tool that helps them (or their customers) work through the process of articulating and fulfilling their information needs, for tasks as diverse as customer segmentation, knowledge management, and e-discovery.

The irony here is that, contrary to the entertainingly provocative “never will be Google-y” in the title, for some market segments enterprise search is already Google-y. In some respects, Daniel’s post and Chris’ article both actually make the case for, not against, the Google enterprise model, which is to say that for some segments of the enterprise market Google and its search appliance may very well be the way to go. Our experience is that it is a particularly viable way for a non-profit legal services program.

Why do I say that? Even assuming arguendo that Google Search Appliance (GSA) improvements “should be seen in the context of state of the art,” for many organizations this state-of-the-art is a rarified and unobtainable reality. One has to wonder, after costing out a solution with one of the three major market leaders in enterprise search (Autonomy, Endeca and FAST), whether a Google box doesn’t look pretty damn good and pretty damn doable, given what it does. As Daniel himself observes, “I wouldn’t recommend that anyone try to compete with the GSA on its turf.” Is that turf a real solution for some market segments? While Chris invokes a clever if overstated “oil and water” metaphor about the differences between web and enterprise search, he follows it by suggesting the exact opposite: Some enterprise search segments are well served by the Google paradigm, notably including “intranet search” –

Many organizations are encouraging employees to communicate internally via blogs, or to participate in community-based knowledge repositories such as internal wikis. This is one area where there is a genuine parallel between enterprise information systems and web content, and Google excels at understanding and surfacing this type of content.

Tell me about it.

  • Saturday
  • July 4
  • 2009

Findability and the Google search paradigm

Following up on an NTAP presentation I gave last Thursday, Findability and the Google Search Paradigm: Integrating Search as a Organizational Solution, here is a publicly viewable set of the presentation slides, which are in a Google Docs presentation format and include embedded links to a lot of the material I discussed during the presentation. You can find the New York Times article I mentioned about Twitter as an example of “crowd-sourcing” at David Pogue’s post, The Twitter Experiment.

I painted with a broad brush during the presentation. The goal of the presentation was to offer the legal services community a broader view, and an emerging view, of what it means to search, to search on the enterprise, and to suggest what it means to Google search on the enterprise. These are just the slides. While I gave a brief live demonstration of how our GSA installation actually functions when generating and filtering search results, you’ll have to come to the upcoming 2010 LSC Technology Initiative Grants Conference to get a more expansive demonstration and technical explanation of our implementation, including a solution (hopefully) to the problems we’ve had with Pika CMS integration into our enterprise search solution.

As is my bad habit, I went long and so the discussion at slide 72 about the real and imagined obstacles to implementing enterprise search in a non-profit environment got short shrift, and for that I apologize. I promise to do a better job with those issues at the TIG conference. In our experience getting our “stuff” organized, and hammering out practices and protocols, was a much larger time commitment on this project than the strictly technical stuff. And then there are the paralysis-against-progress problems that large organizations may experience since, in my view, they mistakenly think they have to have everything about taxonomy, vocabularies, folksonomies and metadata in place. For example, I have argued here, with our somewhat novel Google Search Appliance implementation in a non-profit environment, that we could do fine for now without relying significantly on metadata to make our project work. Others beg to differ.

In any event, I hope the presentation last Thursday was helpful. Let’s all talk again at TIG in January 2010.

  • Friday
  • June 5
  • 2009

A quick and dirty OneBox using PHP

Arguably the most common, if not first, Google Search Appliance (GSA) OneBox module that organizations implement is a module that returns personnel information or listings of some kind. It is one of the most obviously useful OneBox results one can come up with. As we ramped up to implement our version of it, we were surprised to discover that most publicly available examples or models for creation of OneBox modules rely on technologies (ASP and Java being among the most prevalent) that we do not use. We could not find an example of such a OneBox using simple PHP/MySQL.

Our goal was to build an easily replicable OneBox module that does work with PHP, which we do use. A lot. PHP is at the heart of the Pika CMS as well as our public websites built on WordPress.

Here’s an example of what our OneBox special query result looks like, with the first keyword “staff” being the trigger and the second keyword “ukiah”, the name of one of our local office locations. The query returns a OneBox result listing all the active staff in that office:

Clicking on the link for each person’s name triggers a new display with a photo of the person and his or her vitals. This module also works using the same initial trigger with a staff person’s particular name.

Most simply put, this OneBox module works by querying the MySQL database “users” table in the Pika CMS, the application used by all our active employees, across all positions, to record their time and work activity. More specifically, the module breaks down into five basic steps:

  • the OneBox module sends a query to a targeted PHP file
  • the PHP code runs a query against the targeted MySQL database
  • the PHP code then outputs the returned data as XML
  • the GSA reads that XML output
  • the GSA then formats that output for display as a search result

Within the GSA console, one creates a module by selecting OneBox Modules > Create Module Definition, selecting the Trigger (in our case, “staff’), and then identifying the Provider, which in this example is the PHP file we created and attached to the module as an External Provider, by inserting the URL to the PHP file.

You can download as a ZIP file the PHP code and related GSA stylesheet template used in this example.

The PHP file is annotated, but has select information edited or removed (host, passwords, etc.), for obvious reasons. Looking at the PHP code, in sequence the PHP submits the query, connects to the database, joins data from a combination of data tables in our case management system, then takes the results from the MySQL query and outputs it as XML, i.e., the “OneBoxResults” in the code.

Once the GSA outputs the query results as XML, it can then publish the results to a OneBox Stylesheet Template, which one can edit by clicking on the Edit XSL link at the bottom of the console page for the particular module.

  • Sunday
  • May 17
  • 2009

How we organized our targeted Google Sites content

Since we’re on the subject of revisions and updates today, here’s another about how we finalized our Google Sites content.

As noted earlier, The Findability Project planned integration of select Google Sites content as a GSA target. How we created LSNC’s “official” intranet site with Google Sites was covered (briefly) as part of a recent NTAP presentation.

Since that presentation, we have pretty much completed the migration of all our intranet content over to what LSNC calls its “Shared Private Network” (SPN). For those curious, here is a screenshot of the current site’s home page; and here’s a screenshot of the top levels of the sitemap. As you can see, we have worked to keep the hierarchy simple which means manageable, especially given the number of different folks who have responsibility to maintain its content. Also, we have created a large number of Google Sites file cabinet “upload” pages to make management of those file easier, for the same reasons. So far, so good.

What is great about all this is that the GSA easily targets this selected Google Site, and returns great results from the site. Users can have it both ways, by searching from the GSA frontend but with equal ease from the native search function within the Google Site itself. It’s all good.

  • Sunday
  • May 17
  • 2009

TFP Taxonomy – Part Four: Revisions to the project's structural taxonomy

We’ve made minor revisions to the project’s structural taxonomy described earlier. With only slight changes in wording, we’ve retained the same basic 29 top-level project directories but we have more significantly, although not dramatically, revised the second-level subdirectories so that they conform a bit better to how most of our advocates organize and think of substantive categories in our line of work. Here they are:

To recap, our original thinking was to keep the structural taxonomy sufficiently broad (horizontal), to be reasonably inclusive of the content categorizations in common use by a legal services program, and purposefully shallow in depth (vertical), to offer modest granulation so as to keep the structural organization and navigation simple and practical.

We struck a balance between using all ten of the very familiar LSC legal problem categories as top-level directory names, and adding additional categories to address obvious gaps. The ten LSC legal categorizations are definitely part of the shared, commonly understood “vocabulary” of the organization. But we added another 19 top-level directories that are consistent with the broader range of topics and tasks at play in our work environment. For example, we have “Housing,” yes, but LSNC does a huge amount of work in “Land Use” and related issues (e.g., housing element, inclusionary zoning, etc.), so we added that category and related sub-categories to the structure. (The existing LSC “Other Housing” subcategory just doesn’t cut it. Land use is not a catch-all category for us, if you get the drift.)

There are several categories in this revised structural taxonomy that reflect this shift in our thinking. A good example is under the LSC “Income Maintenance” category, where we retained the basic LSC sub-categories but added new ones for “Child Care,” “General Assistance” and “Refugee Cash Assistance.” We also tweaked the wording of many of the sub-categories to correspond more accurately to how users here refer to things, for example, by changing “Unemployment Compensation” to “Unemployment Insurance.” Another example is where we retained the LSC category for “Individual Rights,” but concluded that the LSC sub-categories are somewhat muddled, so we created a different if still simple subset. We also dropped some of the LSC sub-categories that have little or no anticipated use. (Really, you do a lot of “name changes” in your program?) We then simplified the directory and subdirectory names by eliminating the redundant references to “LSC Code,” eliminated the LSC problem code numbers, and dropped the cumbersome “Not_*” labeling also used with some LSC problem code names.

Basic housecleaning stuff.

  • Sunday
  • April 26
  • 2009

Google Apps, SharePoint and this project

At the outset, let it be acknowledged that SharePoint is a great product. For good reason, many in the legal services community have either adopted or are at least seriously looking at SharePoint as a core component of their network infrastructure. A notable example of this trend from earlier this year is Tom Winter’s video collection of SharePoint Resources for Legal Aid. Impressive.

That said, observant followers of The Findability Project may have noticed our chronic inattention, and now outright de-emphasis of SharePoint. There’s a reason. Actually, several reasons.

When we submitted our TIG proposal in 2007, we proposed SharePoint as a key component of the technical specifications for this project. Once we received the grant in 2008, that is exactly how we proceeded as we put together our so-called blunt-instrument build. At the time, we put in place an open-source Google SharePoint connector that plays nicely with the Google Search Appliance (GSA). (We have documented how we configured the SharePoint side of things; we will eventually document how the Google connector configurations work.)

From the get-go we recognized the basic promise of SharePoint, i.e., it offers an array of enterprise platform options for creating and maintaining organizational portals and managing content. All stuff we wanted as we built out our project, moved toward positioning our content in very purposeful ways, and worked out optimal ways for our organization to communicate, share and find content. True, we were less sanguine about SharePoint’s enterprise search features. Not because it is not effective. It is. But we had greater confidence in the algorithms and effectiveness of Google enterprise search, which natively works with most everything Google, and SharePoint does not. But we will put that tribal view aside, for the moment. We give SharePoint its due: Impressive.

That was late 2007, early 2008. This is now, a little more than a year later. What happened in the interim? Google Apps happened … way more, way better Google Apps including an increasingly impressive array of collaboration features … including domain Google Sites … integration of Google Analytics into Google Apps … and then at the end of 2008 some serious happy with the version 5.2 update for the Google Search Appliance, which now integrates with Google Apps, including Google Sites.

Way impressive.

Even though we had SharePoint in place and could have built out our intranet using it, we all but immediately and instinctively moved on to Google Sites once it became available to us in 2008 and, in short order, built things out that way. (See Google Apps Redux for more about how LSNC currently uses Google Apps, including Google Sites.) It is not that SharePoint is not useful to accomplish many of the same things. It is. But at what cost and at what loss in usability?

For a modestly sized non-profit like ours (about 130 employees and two actual IT staff, not wannabees), the Google Apps platform has proven to be a phenomenal, secure, essentially zero-cost, zero-maintenance way to have access to pretty much all the basic collaborative and communication technologies now deemed baselines for the legal services community. (Oh, yeah, the baselines happened in 2008, also.)

And all this stuff works very nicely with the Google Search Appliance. SharePoint, not so much.

  • Wednesday
  • April 1
  • 2009

More CSS code for open source GSA XHTML stylesheet

In an earlier post a few weeks back we updated our list of class and id selectors generated in GSA search results markup when using the Google Code open source GSA XHTML stylesheet. What we hadn’t noticed before at that Google Code site was a subpage with the nine different corresponding stylesheets used to style the display examples at the project site. You can download them for use as examples of how to do CSS coding of screen, print and handheld media. These optional CSS files include some for targeting style presentation of GSA search results in IE and IE7.

Hidden in plain sight, and akin to what we did with our list, the Google Code site also provides an annotated list all the Classes and IDs generated by the open source GSA XHTML stylesheet. It is organized differently than ours, which is organized by the three principal vertical areas of a search result page, while the Google Code annotated list is organized by “XSL template/description.” The Google Code annotated list also would need curly brackets added to each selector before use as an actual CSS file, but it is exhaustively comprehensive, which ours is not.

  • Sunday
  • March 22
  • 2009

Selecting GSA targets – Part Three: Quantification, Revision and Finalization

There is a great deal of work proceeding behind the scenes and several key project elements are converging as we move toward finalizing this public project. Among other things, we have been working through modest but practical solutions for better placement and targeting of our existing 300,000+ repository documents, while solidifying all the additional Google Search Appliance (GSA) targets in our enterprise search sights, described in Part Two. At the same time, we are in our own March (and April and May) Madness as we mount a rapid-fire round of trainings for each of our eight remote offices (spread out over 50,000 square miles of Northern California) on their new role in making Google Enterprise search work for them, which is to say for all of us. And as mentioned in earlier posts, we are working every bit as earnestly on our latest in-house build of the Pika CMS.

The infusion of new, future content into the simplified structural taxonomy we created is a separate challenge we will be posting about later. Dealing with our existing files is more immediate, more concrete. Groking those files has been one of the more interesting, at times hilarious parts of this project.

For those legal services programs interested in how the existing files from our eight local offices break out, here are the percentages for the seven most common “document” types:

  • 67% – WordPerfect (WPD)
  • 18% – Word (DOC)
  • 9% – Portable Document Format (PDF)
  • 3% – Excel (XLS)
  • 1% – Text (TXT)
  • 0.9% – Rich Text Format (RTF)
  • 0.6% – PowerPoint (PPT)

(Discretion prohibits us from detailing the other file flotsam discovered on local office servers. That said, allow us to observe that some within our organization have extraordinarily good taste in photos taken by National Geographic, and not such good taste in music.)

We are totally on track for targeting most of our planned GSA targets: The existing office archive files listed above have long been targeted (although we still have a lot of work left to fit them into our structural taxonomy); over the last several months we have worked very hard to refresh and update (and remove, as warranted) the targeted content at LSNC’s various public websites; and we are very pleased with the quality of the GSA results we are getting out of Google Sites.

This is all good news. In addition, we are putting in place a few more content channels: Targeting the content in our organization’s seven private Google discussion groups, and a program-wide canvass for select hard-copy training resource materials for digital conversion and addition to the shared repository. It’s all good.

We have had one major disappointment: We discovered that there are significant, unanticipated technical challenges unique to the Pika CMS that thus far have prevented effective use of the GSA to target Pika content. The problem is not the GSA itself or configuring the GSA to target Pika. The GSA by design performs wholly benign, non-destructive crawls as it indexes targeted records. We did a huge amount of target testing and SERP evaluation, and we were very pleased — actually, thrilled is a better word — with the results we were getting from Pika. The unanticipated problem is that the current version of Pika is not well optimized for use as an enterprise search target. There are code anomalies in Pika that, among other things, cause it to auto-generate new case intakes and case records when it is crawled by the GSA.

After an assessment by Pika Software, it is now apparent it will take something in the neighborhood of 200 hours of work to make Pika more receptive, shall we say, to an external crawl by the GSA. (Ouch!) So for now, we have to put that part of the project to the side. File under: Lessons Learned.

  • Friday
  • March 6
  • 2009

Updated GSA open source CSS

Related to an earlier post today, we have also updated our annotated GSA open-source-CSS list of class and id selectors generated in GSA search results markup when using the Google Code open source GSA XHTML stylesheet. This CSS list is a bit better organized than our earlier version and includes about a half dozen additions. It is organized into the three basic portions of a GSA search result page: the form and navigation elements at the top; the search results themselves; and the portions of the page below the search results.

Hope this as helpful to others as we have found it to be. And please do comment if we have missed anything.

  • Friday
  • March 6
  • 2009

Version 05 of the TFP search result page

We now have in place version 05 of our GSA test frontend. The linked screenshot shows a basic search for the keywords “immigrant eligibility,” with 1090 results in varied file types returned from various GSA target locations. It looks very similar to an earlier version we posted, but we have made changes to advance several purposes. We have simplified the overall page design to make it less distracting, so that the eye is drawn more directly to the search results themselves. As part of the design refresh, we dialed back in a major way on some color and font choices we made in our earlier design that proved distracting for users. We also got user feedback that told us folks here like knowing and being reminded that these are “Google” results, so we’ve gone retro in a deliberate way to make the search results look more like classic Google and Google Sites search results, while standardizing link color with a nice retro blue, relying on bold and normal rather than alternate colors for link emphasis and de-emphasis.

On the right side of the search result page, there are now filters for narrowing the displayed results by file type or by special collections. For example, here is the same result filtered for only PowerPoint files, returning 10 total results, with only four displayed because the Google algorithm discerns the others are likely duplicates.

So, what’s with the window dressing at the top with links to Pika CMS, Gmail and so on? Those are unadorned placeholders for other things we expect to integrate into the user interface as we finalize the GSA frontend. The goal is to integrate GSA-based search and its results as part of a shared portal or point of entry to Pika CMS, the PHP-based case management system we use, and the array of Google Apps and other web-based applications most commonly used at Legal Services of Northern California.

As if we don’t have enough to do already… but at the same time we have been working on this GSA project, we have been updating our existing, heavily customized Pika 3.04 installation to Pika 4.0. We have completed those code changes to Pika, but are a few months away from rebuilding all the Pika template structures and folding in a new visual design we are adopting. Ultimately, the GSA search form will be added into what is now the Pika home page, and the GSA search results will morph into a visual design that conforms to our new Pika design.

  • Thursday
  • February 26
  • 2009

Comparing Google Sites and GSA search results with release 5.2 in place

All went well with the GSA version 5.2 update. The update itself is a humongous 1.53 GB ISO file that, once burned to a DVD disc and loaded, took about 6 hours to install. As recommended, we did a complete crawl refresh which, in our case, took another 72 hours. Other than this considerable but necessary time investment, we had no real problems with the update process.

As mentioned in an earlier post, the principal attraction of this most recent GSA update was the integration of Google Apps, which enables targeting of domain-hosted Google Docs and Google Sites. In that regard we are pleased to report no problema, as well.

In version GSA 5.2 the administrator now sees a menu option for “Google Apps Integration” with a single field for enabling or disabling one’s Google Apps domain as a GSA target:

With Google Apps targeted generally, then it is a matter of constructing URL patterns to include or exclude more specifically what you want targeted within your Google Apps. In our case, that meant our selection of specific Google Sites now serving as our organization’s intranet content platform. More specifically our search goal was to have the GSA index not just pages within those Google Sites but, as importantly, files uploaded to those Google Sites.

There are differences in how search results display between those performed from within Google Sites and those from a GSA frontend. If a search is done from within Sites, it will find and return a search result for keywords or phrases within an uploaded file, but not display the context of the keywords or phrase. For example, using the search law school+"reimburse me" one gets this specific PDF search result from within Google Sites:

The same search done from our test GSA frontend that returns results from everything targeted by our GSA, yields the same search result while showing the keywords and phrase in context:

So, the basic differences in how search results display are these:

An internal Google Site search will find and return results based on keywords and/or phrase within a file uploaded to Google Sites, display the filetype as an icon (in the above example, with a PDF icon), display the link using the file name, but not display the keywords or phrases in context.

In contrast, the GSA search result will find and return the same result but display the keywords and/or phrase in context, display the filetype as an acronym (e.g., “PDF”), and display the link as what the algorithm discerns as the document’s title (in this example, “Law School Loan Reimbursement Request Form”).

  • Friday
  • January 23
  • 2009

Why we like GSA release 5.2: Google Apps integration

A few weeks ago Google Enterprise issued Release 5.2, the latest software update for model GB-1001, the one we are using on this project.

There are a slew of new features in Release 5.2, but there are a couple that are making for some serious happy on this project. The most significant is that, with the update, the GSA now integrates with Google Apps. For those interested, the Google Code site has a detailed explanation of how that works.

For us this is huge. Our modest non-profit organization two years ago adopted Google Apps as a basic building block for a functional, practical, web-based enterprise environment, something we never really had before. (Hey, there are intranets and then there are intranets.) The Google Sites and Google Docs pieces of the no-cost non-profit Google Apps service are a big part of that. And as part of this project, we have moved pretty much all of our existing intranet content over to Google Sites, and use of shared Google Docs throughout the organization is increasing steadily. (Use of the forms features in Google Docs is especially popular among our office managers.)

Before the Release 5.2 update, we had made valiant stabs at getting the GSA to index our Google Sites content, but with muddled success; and with uploaded files, at best it would only return results with keywords that showed up in uploaded files names, not the file content. Now the GSA integrates directly and we can target any “public” Google Sites or Google Docs content we share with others within our domain. (There is some ambiguity in how Google describes GSA integration about so-called “public Google Sites and Google Docs.” To clarify the point, in this context “public” is a Google term-of-art. If you create a Google Site or Google Docs within your domain’s Google Apps and share it with everyone within your domain, then it is “public.” It is not necessary to make that content public to the world.)

The other Release 5.2 feature we are especially excited about is its enhanced advanced search reporting. Now we have a built-in tool that enables us to analyze user search behavior, with reports that “list every query and click made by every user,” plus whether users are finding what they search for within three clicks, or not at all, and which part of the search interface the users, uh, actually use. Aces!

One caveat we are aware of from GSA groups discussions: Release 5.2 is a significant update with features that may warrant a serious review of one’s existing XSLT modifications, to exploit new GSA feature sets. And we have been advised to do a complete crawl refresh. We’ll report back here how it all goes.

  • Thursday
  • December 18
  • 2008

Presumptive Shareability

After the first of the year, we’ll be cranking up as we complete porting of our existing target documents into our new taxonomic organization, resolve some filtering and usability touches we want to integrate into our default GSA front end, primp and polish the layout and presentation of the front end, implement a few basic OneBox modules, and set in motion what we’re now referring to as the “Rolling Thunder Roadshow” to all our eight office locations.

The RTR will be our way to recognize and promote among all our staff the changes in how documents and other files are made easily and intuitively findable, and given a new level of access and usability throughout our non-profit organization. After all, that is the core purpose of enterprise search. And a key element of all this is changing deeply rooted individual notions or assumptions about what can or should be “shareable.”

In working on this project within a non-profit environment, we have learned that most employees have an inclination to undershare, not overshare. Not because they are selfish or secretive; rather, because the type of transparent sharing that enterprise search makes possible is foreign to most of them. It is familiar to them to be asked to provide a document to others on request in person, by phone or by email. It is foreign to them to decide in advance that a document they created or have received from someone else should be transparent to the rest of the entire organization. The concepts of creation and possession are severed from the concept of findability.

To be sure, the increasing use of collaborative web-based document tools within our organization — principally our adoption two years ago of Google Apps — has helped us on this journey. Most staff at this point are familiar with the concept, if not the practice in their individual work, of creating or editing or uploading documents that can be “shared” from a common web location. They get that, even if they don’t do it themselves, because increasingly others demand they do so… when they get a “share” message email from Google Docs about a document someone created or edited there; when they get an email with a link to something someone else posted in our domain’s Google Sites; or when they get a message to fill out a Google Docs form for, well, whatever.

As we prepare for the RTR, the team working on this project have brainstormed about what we can say or demonstrate to the staff in each office, to prompt them to rethink (OK, in some cases just think) what types of documents should be shared with others by adding them to the new document repositories.

We now refer to this as “presumptive shareability.” In particular situations, it may not be appropriate to make the document or file transparent through enterprise search, but in most cases it will be because all are situations where the document or file has served a shareable purpose, i.e, use by more than one person or re-use by one or more persons.

Among the situations we think should trigger staff to think to add the document or file in question to the shared repository are the following:

  • An attachment to an email message you send or forward to someone else.
  • You request or receive a file as an email attachment from someone within the organization.
  • You receive a non-confidential file attachment from someone outside the organization.
  • Every time you re-use a document or form as part of your work.
  • You learn that the PowerPoint (or other presentation format) for a training or conference event you attended is now available for viewing or downloading.
  • You lug home substantive hard-copy handouts distributed from a training or conference.
  • Can you say, “presentation” and/or “portable”? Whatever it is, if it is a PDF or PPT file it is presumptively shareable.
  • If it is the “final” version of a case-related pleading, memorandum, exhibit or correspondence and you think others may find it usable, share it.
  • Usable documents you discover and think to save to your desktop as part of research on the Web, regardless of file type (PDF, DOC, XLS, etc.)
  • Similarly, when doing work-related research on the Web, anytime you think to bookmark a web page or save the page to your desktop as an HTML or TXT file.
  • … you get the drift.

Shareability promotes findability. That’s our story and we’re stickin’ to it.

  • Tuesday
  • December 9
  • 2008

GSA virtual edition + VMware = experience enterprise search

This post is primarily targeted at my IT allies within the legal services community and other non-profit organizations following this project, who may be interested in trying out the Google Search Appliance but aren’t prepared to make the hardware investment as yet, even for a more modest Google Mini.

Here’s the deal: Now there’s a software alternative to get the full-on GSA experience for free. Sort of. Pretty much. At least up to 50,000 documents.

Early last month Google announced the availability of the fully functional (albeit limited to 50,000 documents) Google Search Appliance virtual edition. The GSA virtual edition is “a developer platform designed for the enterprise development community to build and test applications that use the Google Search Appliance. The Google Search Appliance virtual edition is for non-commercial, development purposes only. Developers can simply download the software-only product, try out the new features and explore the various programmatic interfaces supported by the Google Search Appliance.”

To exploit this opportunity, you need a server that is supported by VMWare virtualization. Here’s the rudimentary technical FAQs about the virtual edition. The Google Code site does incude a page about installing the virtual edition and access to basic GSA documentation, but does not include the extensive additional online documentation that comes with the fully licensed GSA hardware or any direct support.

  • Monday
  • December 8
  • 2008

TFP search portal ~ version 04

We debated internally how best to roll out to LSNC staff the various incremental changes to our project “search portal” and continuing modifications to our GSA-generated search results page. Given enough IT resources, we could have held off debuting enterprise search to staff until we could unload it all at one fell swoop. The practical challenge is that we have modest IT resources for a modest-sized (less than 150 employees) non-profit organization, cannot give undivided attention to any one technology project, and the project was geared to roll out over an 18-month period. And we have six more months to go. In any event, since typically we cannot present everything in a finished state, we have to roll things out incrementally. Fortunately, the internal culture and camaraderie at LSNC supports this practice. Demonstrating incremental progress pays dividends here, and staff seem to appreciate seeing any usable technological progress we can offer them, even if incomplete.

At this juncture, we are on our fourth version of the GSA frontend for The Findability Project. The version 04 search portal page is an incremental, intentional step toward a shared portal page, the access point of entry for all staff that eventually will be integrated with the core web applications essential to our work, including the Pika CMS, various Google Apps, an organizational news feed, and so on. We’re not there, but within the next few months we should be able to close the deal on a basic, first-generation solution for our organization.

As part of all this, this is what our version 04 search result page looks like, with our recent addition of file-type filters that can be applied to GSA generated search results. Upcoming will be our addition of select filtering by collections, a basic OneBox search result prototype, tweaks à la related-queries (the GSA equivalent of synonyms) and GSA keymatch, and more.

  • Thursday
  • December 4
  • 2008

The search box as a findability (design) concept

Fair to say that without a “search box” there is no enterprise search? That being true, consider Designing The Holy Search Box: Examples And Best Practices, yet another interesting design compilation/distillation article from Smashing Magazine. True, this is not The Big Wroblewski (the form abides, dude), but it’s a pretty good read on what to think about when designing, labeling and positioning a basic search form.

  • Thursday
  • December 4
  • 2008

CSS selectors for GSA open source XHTML stylesheet

Recently we described how we built our GSA XSLT stylesheet to give us external CSS control over our search result page. A key element of that build was to rely on the open-source Google Code GSA XHTML Stylesheet, which is both way more web standards compliant and also makes it way easier to modify the XSLT so that you can generate search results without any embedded or inline styles.

The CSS styling or presentational characteristics of a search result page are, of course personal and particular to the designer who codes the CSS. But to make that process at least a touch easier, we’ve created a GSA open-source-CSS list of class and id selectors based on attributes generated in GSA search results markup when specifically using the open-source Google Code GSA XHTML Stylesheet. (This list of selectors will not work with the default GSA XSLT stylesheet.)

What we’ve done is carefully comb through the search result page markup, list all the CLASS and ID attributes in their order of page flow, annotated each with a description and then put them into a reusable CSS stylesheet that can be adapted for use new GSA frontends, as you will. Here’s an example:


Fairly straightforward stuff, but how many times over do you want to “view source” in your browser or invoke Firebug to recall the particular selector and attribute for, say, the definition list markup used in GSA search results? Seriously, unless you did it about 10 minutes ago, can you recall what the dd p.st span.rc a.f selector refers to on the search result page? We didn’t think so.

Here’s the full GSA open source CSS selector stylesheet.

Hope this helpful for those using the open source GSA stylesheet.

  • Wednesday
  • November 19
  • 2008

Rough-cut evaluation of our GSA test-bed installations

As part of the current reporting cycle for our LSC TIG milestones, the primary funding source for this project, we did three things to evaluate our TFP test-bed installations in Sacramento and Chico. First, we talked to staff in both offices to hear them out about how the basic orientation to enterprise search and the temporary search portal went for them. We’ve already written up some about that here.

Second, in Sacramento we also took the time to observe some staff actually using the temporary search portal to see what their experience was actually like. Why? Well, not only is there is simply no substitute for actually talking to people about an application you are striving to evaluate, more importantly, there is no substitute for observing them while using it. Jakob Neilsen aptly sums up the differences in this regard as the contrast between “what people say” and “what people do.”

Third, we know that Bristow Hardin at LSC is going to want data because he has a Ph.D. in social science (from my alma mater; go Slugs!) and data is his life. So we also used SurveyMonkey.com to conduct a survey with an intentionally light touch to get some basic data, to see how users in our two test-bed locations were dealing with this whole enterprise (pun intended).

We put the survey up for one week and got a modest 17 respondents from the two offices. The survey had 18 questions and participation dropped off from 17 to 11 respondents from the beginning to the end of the survey. Among the results, in the order asked:

  • 86% knew how to locate the TFP search portal; 14% did not.
  • Asked their reaction to the elements on the temporary search portal page, they said they were “organized” (33%), “well organized” (42%) or “superbly organized” (17%); no one reported the portal page as anything less than organized.
  • 92% were confident they understood what the elements on the search portal are and why they are there.
  • When asked what they would want added to a search portal, only one person made a suggestion: “Can you add something so I can sort by file type”. (We have already added this filter feature to our next iteration of the GSA search results page.)
  • When asked about locating the file servers where shared documents reside, 100% of 12 respondents said they knew how to locate the local file server directories; 11 of 12 said they also knew how to locate the program-wide shared file server.
  • 100% knew how to locate their personal user directory and the user directories of others on the shared file server in the local office. (Okay, not a surprising result but we thought we should ask. You never know.)
  • Asked how well they understood how the directory structures are organized on the program-wide shared document repository, one person out of 12 respondents said they understood some but not all of the directory names. All other respondent’s said they “get it” (50%); the organization “makes sense but I would do some of them differently” (17%); or the directory structure “makes perfect sense” (25%).
  • Asked whether they are familiar with how to search using Google, 100% selected either “yes” (83%) or “are you serious?” (17%). (We have yet to encounter anyone within our organization who does not know what Google is or how to use if for search. Hey, but again, we thought we should ask.)
  • We also asked whether the use of the TFP search portal was similar or dissimilar to using Google on the Web. Of 11 respondents, two thought it was “nothing like Google” and two others thought it was “something like Google.” The remainder thought it was “way Googley” (56%) or “exactly like Google” (9%).
  • A key question, of course, is whether there is proof in the search pudding. Again, of 11 people who responded, one person felt searching was “not successful at all” and two were only “partially successful.” Of the remainder, four reported their searches were “successful” (37%); two reported their searches were “very successful” (18%); and two felt their searches were “dead-on successful” (18%), with an overall success rate of 73%.
  • Finally, we asked staff whether they understood the search results. One person said “not at all.” Two said they “only partially understand” the search results. The rest said they either “do understand the search results” (9%); have a “very good understanding” of the results (55%); or “absolutely understand” the search results (9%), again an overall rate of 73% for those “understanding” the search results.

That’s the rough-cut survey, without any actual polish of our installation. We will be doing other surveys once we get everything in place, including the newly tweaked shared portal, and complete our roadshow to assure everyone knows how to exploit the bells and whistles we will be adding in over the next few months. Stay tuned.

[Update: In response to a question I got after posting this, there are 31 total staff in our local Sacramento and Chico offices. So, about 55% (17/31) of the staff participated in the survey and 35% (11/31) finished it, for a completion rate of 65% (11/17).]

  • Wednesday
  • November 19
  • 2008

Advocate search anecdotes

We are at the last stages of our GSA “blunt instrument” mode. Within a few weeks, at the outside, we should have a spiffier, notably more user-friendly shared search portal in place for all LSNC staffers. By January, we should (hopefully) have a few select Google Sites properly targeted and the basic beginnings of a nice set of user-friendly file-type and collections filters in place. With those in place, we then plan on rolling out a formal roadshow to all our local branch offices to train everyone on the new shared search portal, the new search enhancements, and assist directly with the offices to get targeted files in place and properly organized on our shared document repositories.

Before getting to that point, to be sure, we have had some modest hesitation about unleashing the GSA on our staffers in its current, incomplete state of implementation. But the staff experiences with our temporary shared portal have been very positive, as detailed in the earlier post, Talk about zero learning curve. (And see the next post for the data results of our “test-bed” evaluations in our Sacramento and Chico offices.)

My instinctive temperament is to remain skeptical, even when faced with some measure of success. The day after we did the “blunt instrument” training, one of the younger, hot-shot lawyers stopped by my office and raved about how effective the GSA was at finding stuff. (It is sooooo hackneyed for any of us to emote this way, but at least a couple of times in the conversation he exclaimed, “This is sooooo awesome.”) Okay… but this is also an advocate who is only two doors down from me and the two IT staff I supervise, so he’s always picking up the positive, can-do, tech vibe that radiates out from the paid IT staff. (Remember, we’re a 140+ employee but still seriously tech-understaffed legal aid program.) He saw some of the stuff we were doing with the GSA even before we showed it to anyone else, so he was already on board with the project, big time. It works for him. But what about everyone else?

This real-world example from today tells me it is:

On the other side of the building, one of our stalwart paralegals was working on a client’s case involving an alleged overpayment in the CalWORKs cash-aid program. [CalWORKs is California's name for the TANF program.] He needed to quickly find a good example of a request for administrative rehearing to review the alleged overpayment. The paralegal went to the temporary search portal and, quite sensibly, typed in the search box the following, without quotes: “request for rehearing overpayment”.

He got about 500 search results and found a good, usable example on the first page of results. He also made this observation to me in a subsequent email: “I used several of the documents in the search. I did not have to download them. I just used the text version. Which is very useful and quick.”

Tell me about it.

  • Wednesday
  • November 12
  • 2008

Converting hard-copy documents for addition to the shared repository

A late October post at the Official Google Blog entitled A picture of a thousand words? prompts me to draw attention to an analogous TFP document protocol we worked out a few months. It is worth highlighting because it is so practical and will be an invaluable source of additional knowledge content targeted by our GSA.

But first, the Google post: Read it and you’ll discover that “In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document, so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format.” (As lawyers are so fond of saying, emphasis added.) As the post illustrates by example, do a Google search for repairing aluminum wiring and at the top you’ll see a PDF listed. If you download the PDF and open it, and you’ll discover it is an image of a text document. The downloaded file is itself not text searchable. But click View as HTML for that same result and you’ll discover that the text is actually indexed and searchable via Google.

Essentially, we are doing the same thing within our own enterprise search ecosystem, but with an added advantage. Not only have we adopted a document handling protocol for using our networked printers/scanners to convert select hard-copy text documents to PDF image files, we also process the resulting PDF images through Adobe Acrobat’s native “OCR Text Recognition” tool, add then save it with some basic metadata added.

Once added to the shared document repository, the scanned and OCR’d text document is then fully indexed and searchable by the GSA. And when the user finds and downloads the file, it is fully text searchable itself when opened in Adobe Acrobat or Adobe Reader. One better than what Google itself now does, superbly.

  • Monday
  • November 10
  • 2008

Selecting GSA targets – Part Two: The Practical Realities

In an earlier post about selecting Google Search Appliance (GSA) targets for this project, the narrative definitely edged toward the more abstract. We highlighted four principal sets of GSA targets: files on our newly created “shared document repositories”; repurposed intranet content being moved over from an MediaWiki installation on an old server; cherry-picked content available on LSNC’s varied public websites (LSNC maintains 13 distinct public websites and subsites); and select records in Pika CMS, the secured, web-based case management system used by all our advocates.

As far as it goes, this abstract list of GSA target sets fairly summarizes what we, as an organization, want to make transparent via enterprise search, which is to say make “findable” in ways not practicable without the GSA. This abstract list of GSA targets, however, fails to convey what we have done at non-abstract, practical level to make those targets useful to our larger search goals.

So, let me hit a few notes about several practical decisions we’ve made at launch as we target the GSA at real files offering real search results.

As described in Part One, when we first unpacked our GSA and aimed it, uh, somewhat aimlessly at any and every file on one of our local servers, the GSA did its job in killer fashion… and blew out our file limit. While one can proceed that way, we were always mindful that we had to sort out how to organize and structure the shared content that we seriously wanted to make searchable and findable. So, one of the first tasks we confronted on this project was to work out our thinking about “taxonomy,” resulting in the basic directory structures we have adopted.

That taxonomic “organization” step was essential to this project, but completing that particular project objective doesn’t translate directly into searchable content organized in a particular way. You see, there is this pesky little detail: Real people need to actually identify the existing and/or newly created files to be included and then somehow get the files in the directories on the shared document repository that are the target of the GSA.

Easier said than done.

In our case — particularly given the limited IT and support staff resources available to us as a typical legal services field program — we had to come up with some practical approaches to move existing files from any number of different locations to the designated shared locations or document repositories. (I will discuss how we handle adding newly created files, in a later post.) Here’s what we did with our existing files to fold them into the content targeted by our GSA:

1. Initially, include all existing “staff-specific” content, with an opt-out

We did find ourselves on the receiving end of a lot of staff enthusiasm about this project. Truly. But it is impractical and unrealistic to expect your individual legal services advocates and other staff to comb through all their thousands of files and then move them over to a different file server location. (Maybe it should be realistic to expect them to do this, but in our experience it just ain’t gonna happen. No way.) But there are tons of content gold in them thar files, so we had to figure out a way to initially get all that good stuff in place, even if not parsed out in a taxonomic sense, so we could target it.

To accomplish this, we first vetted with, and got buy-in from, all our local offices to do the following:

On the local project file server for each local office, we created a special project “archive” directory. Then each local office manager copied each individual staff member’s files wholesale over to a user-specific directory in this so-called archive directory. Having an unequivocal “opt-out” option was important to the success of this approach. Again and again, in formal meetings and informal discussions, we reminded office staff that they could ask that any or all files to be removed from these initial archive-file targets. No questions asked. There were a few such requests, but not a lot: One advocate asked that her files not be targeted at all, so we removed all her files; two others had less than a dozen files they wanted removed as targets, so we did so. No biggie.

The net effect is that this makes the targeted advocate files initially non-taxonomic, but in short order you have a huge repository that has a (allow me to exaggerate here, for literary effect) 99% chance of including pretty much everything the individual staff members would add if they “woulda, coulda, shoulda,” so to speak. In our case, this initially amounts to about 300,000 document files, the vast majority of which are advocate-generated files.

At launch, this does mean that these office-specific, bulk compilations of existing files added as targets include a significant number of drafts and duplicates that one would normally not include as a shared file if it were being added as a newly created file. For example, within our office culture it is not only common but actually expected that advocates not work in isolation on major cases. (We discourage the “lone eagle” model.) So, our early search-results testing shows that often the same file shows up in more than one target location because more than one advocate has a copy of the file in their archive.

It bears mentioning two other factors we kept in mind as part of this initial targeting of shared files: We double- and triple-checked with all management staff to assure nothing management-sensitive or -confidential was moved to a location where it could be targeted. Also, before we moved anything over wholesale, as described, we asked all staff to remove certain types of files that no one would reasonably expect to be part of the searchable content. Examples: Family photos, MP3 music downloads, YouTube videos, yada yada yada. Enough said.

We do have an approach in mind for “peeling off” these office-specific archives over time, to separate out the drafts and duplicates and place them within our taxonomic directory structures. More about that later.

2. Using Google Sites as the platform for our existing intranet content

I recall having a passing conversation with Gabrielle Hammond at last year’s TIG conference about how we were holding off on further intranet development while waiting to see how Google implements its JotSpot-based wiki application, now known as Google Sites.

Well, people, we now know what Google Sites is all about and we love it! For the last several years we had been using MediaWiki as the publication platform for our intranet, but we are in the process of replacing it for our internal wiki needs. We are about half way through that process, which should be completed shortly after the first of the year.

One big bonus of moving our intranet content to Google Sites is that it is quasi-tailor made to work with both the GSA and Google Analytics. I say “quasi” because the interactions between them are good but hardly optimal at the moment. For example, only days ago Google Analytics for Google Apps was rolled out, but the quality of the data we are getting so far is not so easy to get a handle on. More importantly, Google promises GSA integration with Google Sites, but it is still a buggy implementation. We have easily targeted test site pages within our domain’s Google Sites, but have hit a wall with getting the GSA to properly return search results on the indexed content within files uploaded to Google Sites. Turns out we are one of several organizations that have identified this problem and Google Enterprise support assures it will have a fix with its next software upgrade, in about a month or two. We (and our GSA consultant) are confident this will work in due time, but it’s one of those details we have to wait on for now.

3. Updating our public web content

Over the last 10 years, LSNC has placed an enormous amount of its advocate content out on the public Web. But one recent example is the California Food Stamp Guide, a prime example of public content that our advocates can search at that site, but would want to be able to search directly via our GSA shared portal. It is also one example of a content cluster that can be part of or its own GSA collection. (“Collections are logical views of information in the index, as defined by URL patterns. This allows you, for example, to index the entire contents of your intranet, but then divide it up into logical groups of content.”)

Implementation of The Findability Project has prompted some public housekeeping. Our target testing of our public content, predictably, reveals that we have stuff out there that is, well… past its shelf-life, shall we say. So we are working on a systematic way to thoroughly review and clean up that public content. It is obvious but important: Current and correct public content means better search results via the GSA. (Apologies to the larger legal services community for not doing it sooner.)

4. Targeting our case management system

We consider our Pika case management system a key, long-term GSA target. But we are not there yet. We have prioritized getting all the other targeted content organized and in position, with clear protocols in place. We also are busy reworking on our shared portal, which will integrate the GSA search functions and provide users with (hopefully) intuitive ways to filter their search results, search select content collections, and provide the users with some nice Google GSA touches like OneBox searches, among other features.

That all said, being able to target our case management system is a total no-brainer and perhaps the most practical of necessities. In a given day, there is likely nothing more common or more vital to our work for clients than the search for information within our case management system. The native search functions built into the current version 3.07 of Pika are good. But we are optimistic that we can exploit the GSA to make those searches even better. And certainly more integrated with everything else in our new enterprise search universe.

  • Monday
  • October 27
  • 2008

How we built our GSA XSLT stylesheet with 100% external CSS

One of the more confounding challenges for those new to the Google Search Appliance (GSA) — at least for the XSLT-inexperienced such as ourselves — is figuring out how to build a search-results page such that one has 100% external stylesheet control. To be sure, the GSA includes a very handy “page layout helper” that provides interactive dialogs for creating a custom GSA “frontend” output page and/or search result page. One can use the page layout helper to incorporate your own custom header and footer markup, as well as add or exclude or modify a limited set of default GSA output page elements. Essentially, what the GSA native page layout helper does is provide a simple way to customize the underlying XSLT stylesheet without having to code directly in XSLT.

And as far as it goes, the GSA page layout helper is pretty handy. But what it is not so handy at is providing a way, without coding the XSLT directly, to give you complete, 100% external CSS control of how the search results display. True, you can add an external stylesheet link to the header markup you pop into the editor, but you are still stuck with all the native, embedded and inline styles that the GSA, by default, adds to the search results themselves. As those familiar with the order of importance in the cascade know well, inline styles trump embedded styles which in turn trump external styles. What’s a CSS coder who is not XSLT-savvy to do in this situation?

Here’s what we’ve done, in six (mostly) easy steps:

1. Build a design mockup of your search results page

The first thing we did is build a static XHTML page, with conventionally linked external CSS stylesheets, so that we had in-hand the markup we would fold in later to the GSA XSLT stylesheet. How you do this is entirely up to you. Here is a screenshot of our search-result page design. In this design, all the page elements at the top with a black background and to the right in the side-bar are what we will later add to the XSLT as part of the so-called “header,” described below. The markup that comprises the simple, light-gray footer at the very bottom will go into the “footer” section of the same XSLT. The “lorem ipsum” section of the markup in this mockup is just a placeholder, to show where the GSA search results will eventually appear.

2. “Process” your header and footer XHTML through the GSA page layout helper

This is akin to the “preparation” stage of cooking a recipe. What you need to do at this step is use the GSA page layout editor to convert the XHTML “header” and “footer” markup you created in the prior step, into a format that will play nice with the custom version of your GSA XSLT stylesheet described below.

To do that, login to your GSA admin panel, select “Serving” and then create a new GSA frontend. Click on the “Edit” link for the new frontend and then select the “Output Format” tab to display the Page Layout Helper. Click on “Global Attributes” and paste into the “Header” field all the XHTML code from the very top of your design mockup page (starting with the DOCTYPE) down to where the “header” markup ends. Do the same with your “footer” XHTML to the very end of the page markup (including the closing “body” and “html” tags) by pasting it into the “Footer” field. Click the “Save Page Layout Code” button, which prompts the GSA Page Layout Helper to process your markup and add it to the raw XSLT code.

You need to put your newly processed “header” and “footer” code to the side temporarily, for a later stage in this recipe. To do that, click on the “Edit underlying XSLT code” and then copy the raw XSLT code into your code editor of choice. Search for the line of code that begins and you’ll see that your header markup has been converted for use in the XSLT stylesheet. Copy everything between the opening and closing “my_page_header” template tags. Ditto, for the code between the “my_page_footer” template tags. Set these two code excerpts aside for a few minutes while you work on the next few steps. (Don’t worry about the default GSA XSLT stylesheet you just created, since it is going to be replaced, as described below.)

[Update: See the comments below, about the importance of assuring the header and footer XHTML "processed" via the GSA page layout helper are enclosed in each case by xsl:text tags.]

3. Download the Google Code open-source version of the GSA XSLT stylesheet

Download the open-source Google Code GSA XHTML Stylesheet. The advantage of this open-source XSLT stylesheet is that it offers a web-standards compliant version that generates well-formed, valid markup, something that natively generated Google pages are infamous for not doing. This open-source version of the GSA XSLT also makes it easier to modify the XSLT so that you can generate search results without any embedded or inline styles, and therefore subject completely to an external stylesheet. (A tip-of-the hat here to our GSA consultant, Michael Cizmar and his trusty sidekick Igor Taran for giving us the heads-up on this XSLT option.)

4. Edit the open source GSA XSLT stylesheet to turn off embedded and inline styles

Only two simple sets of edits are required to give you the external CSS style controls you want. The first is to edit the open source GSA XSLT stylesheet at line 67 to turn off embedded styles, by turning on external CSS styles; and at line 68 to provide a pointer to the specific directory (but not the CSS stylesheet itself) where your external CSS stylesheet resides.

Without modification, these two lines look like this:

<xsl:variable name="style_include">0</xsl:variable>
<xsl:variable name="style_include_prefix"></xsl:variable>

Modified for this example, the same two lines would look something like this, with the “style_include” set to true and the “style_include_prefix” set to the path of the external CSS stylesheet:

<xsl:variable name="style_include">1</xsl:variable>
<xsl:variable name="style_include_prefix">https://yourdomain.com/css/</xsl:variable>

The second edit is an optional edit. Sort of. At line 507 you’ll see this reference:

<link href="{$style_include_prefix}search.css" rel="stylesheet" type="text/css" media="screen,print"/>

Line 507 refers to the specific external CSS “screen” stylesheet that line 68 points toward. Actually, you can simply use that same “search.css” name for your external stylesheet, or change it to another name. It’s your call.

5. Edit the open source GSA XSLT stylesheet to add your custom “header” and “footer” template markup

Remember the processed XHTML “header” and “footer” template code you set aside, above? Here’s where you use it. In the same open-source GSA XSLT stylesheet, go to line 244 and add your processed “header” XHTML code and at line 248 add your processed “footer” XHTML code. The former goes between the “my_page_header” template tags:

<xsl:template name="my_page_header">
    <!-- add your processed xhtml here -->
</xsl:template>

And the latter goes between the “my_page_footer” template tags:

<xsl:template name="my_page_footer">
  <!-- add your processed xhtml here - -->
</xsl:template>

6. Substitute your edited version of the same XSLT stylesheet for the default GSA XSLT stylesheet

Now go back to the Page Layout Helper. If the XSLT code is not open for your custom frontend, click the “Edit underlying XSLT code” link, paste your edited version of the open source GSA XSLT stylesheet into the editor and then click the “Save XSLT Code” button to save your changes.

You should be good to go!

Now you can go to your external CSS stylesheet and change how any and all elements of your search results display, without any interference from GSA native embedded or inline styles.

It worked for us. After using {display:none} in the external CSS stylesheet to selectively hide some of the page elements in the search results page, and then tweaking the styles to get the results to display the way we wanted, we ended up with a very nice, customized look to our search results.

  • Saturday
  • October 25
  • 2008

What Wordle says TFP is saying

A weighted word cloud of the current feed for The Findability Project, courtesy of the wonderful Wordle:


Search? Google? GSA? Go figure.

  • Thursday
  • October 23
  • 2008

Going Forward: Document “best practices” and protocols

In earlier posts I have shared memoranda distributed at a recent organization-wide meeting, including an explanation of our taxonomic structures and details of various file-naming conventions adopted for this project. Attached to this post are two additional memos:

As I so often like to say, allow me to explain:

In making practical decisions about handling files targeted by our Google Search Appliance (GSA), we look both backward and forward in time. This dichotomy between the past and the future is one that Google Enterprise itself promotes with its cursory recommendations that its customers decide for themselves where to locate existing content and new content.

Based on our experience working on this project, there are considerable differences in how to handle the “past.” A separate post on those issues will be coming forth, soon. But going forward into the “future,” we have thrashed out the practices and protocols detailed in the two memos linked, above. While there are institutional contexts for some things described in the memos that may be lost on those not part of our organization, the memos are (hopefully) self-explanatory. There are other practical details about the document protocols that will be expanded on in later posts, including how the Shared Repository web interface works, the integration of our metadata models, and so on. (All good things come to those who wait, at least with this project.)

Among the most practical observations in these memos, I think, is breaking through the common but incorrect perception that one needs to save a document to “the” correct directory, as opposed to “a” correct directory, however it is done. And while staff are instintively bewildered, somewhat, by concepts of “taxonomy” and “metadata” and wonder how they will be able to find things if they are not located in “the” correct directory, it is also extraordinarily reassuring to them to know that we’re talking about Google here. Even if they do not understand how it all works, typically they have great faith that Google search will find it for them, as described in an earlier ancecdote.

The memos also attempt to address some of the practical realities and limits of a non-profit, legal services work environment. LSNC has neither the resources nor motivation to micro-manage how users organize their own file directories. Life is too short. But as detailed in the “advocate-user directories” memo, we do now require all LSNC staff to have a user-specific, user-named directory, and that the name used be the user’s full name. The primary motivation for this requirement is a practical need to standardize directory-name conventions throughout the organization, so that the location and targeting of files is predictable, manageable and findable. And, if for no other reason, doing so eliminates the need to guess whether something is located in the Shareen, Shari, Shelly or Sherri directory — a real-world example from our Auburn office, illustrated to the right.

  • Tuesday
  • October 14
  • 2008

Talk about zero learning curve

The “blunt instrument” training in our Sacramento office last week really drove home — and vindicated — one of the core rationales we made in our TIG proposal, namely, that the simplicity and pervasive familiarity of Google search would enable LSNC staff to hit the enterprise-search deck running. What an understatement.

We anticipated doing a presentation that would last about 30 minutes to explain the temporary project portal page and then demonstrate how to do basic searches, using standard Google search syntax. It took 30 minutes alright, but only because we insisted on walking through all the examples while the staff sat there respectfully listening and watching. Personally, I got a touch nervous because there were no questions. After debriefing folks after we finished our presentation, we learned that we could have done it all in about 10 minutes or less.

How so? Here is what we learned:

  • “We get it.” We spoke personally with about a half dozen of those attending. All said they already understood how to do all the basic Google search syntax. They know from years of using Google the effect of using unadorned keywords, i.e., it means that Google finds the page that includes all those words; the effect of using “OR” to search for alternate keywords; the effect of placing keyword phrases in quotes, and the pruning effect of a minus sign before a keyword. And that adding keywords narrows the search.
  • To quote more directly one of the folks who attended: “We already know how to do this with Google. It seems to me that if there was someone in the room who doesn’t know how to do this, it’s probably someone who would never use the new search portal anyway.”
  • While familiar with how Google search results display file types, those attending were not familiar with the Google “filetype” parameter. They really liked how search results could be filtered that way, and we explained (but could not yet demonstrate) how the search portal would include clickable filters so they did not have to key in a filetype parameter.
  • Other than maybe having done an occasional OR search, few ever use the Google advanced search options. So little, in fact, that most were somewhat amazed to view Google’s advanced search dialog page that dynamically constructs the search syntax for you.
  • One correction: Actually, there was one question after we finished the presentation: “Where do I go to find the link to this findability portal thing?”

What we concluded from this experience is that we can do our upcoming online presentations with the rest of the staff in just a few minutes, just hitting five marks:

  • Tell them “Here’s the temporary search portal” and explain briefly what’s up with its layout.
  • Tell them it works just like Google because, uh, it is Google.
  • Walk through one example to demonstrate that this is, indeed, Google land and leave it at that.
  • Explain that we are still experimenting with Google Sites, which is part of the whole Google Apps thing, and how to search or navigate it.
  • “Questions?”

Ten minutes, tops.

  • Tuesday
  • October 7
  • 2008

"GSA Blunt-Instrument" Tour begins

We are still most definitely in “blunt-instrument” mode, which is to say we have our enterprise search platform built out, the local and centralized content servers are all talking nicely to each other and, most importantly, to the Google Search Appliance (GSA). But we have no refinements in place. Hence, it is at this stage just a blunt instrument for searching throughout our organization.

Even so, we want to actively draw folks into this grand adventure. Tomorrow, we are doing an on-site orientation and training for our Sacramento Office, a k a “The Mothership.” In almost all things tech here at LSNC, the Sacramento Office is almost always the guinea pig, the office where we first test things out. At that training, we are for the first time formally introducing folks to what is essentially an alpha version of a shared portal page that is pretty self-explanatory (hopefully). But even before the training, we have circulated to everyone in that office the URI to the portal and have encouraged them to play with its search functions and examples. And bring their sense of search wonder and befuddlement. Our motto for the training: “There are no wrong searches, only better ones.”

This test portal page includes some modest jQuery scripting so that if users click on “Show|Hide sample searches,” they will see a reveal with active links to all the examples used during the training (illustrated, at right), so they can easily reconstruct what was shown. Assuming all goes well, we will then promote the test portal to the rest of our core offices, concurrent with a rapid-fire series of LegalMeetings sessions to introduce-orient-train users on how basic organization-wide search works.

Wish us luck.

  • Friday
  • October 3
  • 2008

Don't overthink file naming conventions

This is a quick post to discuss some basics about file naming conventions worked out as part of this project. At a recent program-wide meeting to discuss details of this project with all our office managers, among the memos distributed was the following:

If you take a look at the files in your advocate staff directories, you are likely to see individualized albeit typical patterns in the file names. There is a discernible Darwinism to the conventions individual users adopt: They use both directory structures and name files in a way that makes them “findable” for them later, if not for others. (OK, basically “unfindable” by others, in a lot of instances.) Common naming patterns include, in almost all instances, at least a generic name descriptive of the type of document (e.g., petition or complaint or writ), plus other descriptive elements that help the user to later locate it, such as a client or project name, the date of the document, and/or whether the file is a draft or a final or a version copy.

There are any number of ways one can go with file naming conventions, as well illustrated in the article at CompuJurist, Are there any recognized “best” practices for file naming conventions? Akin to what is discussed in that article, we’ve adopted the following template for use by advocates to name their files:

[draft/final]  [document type]  [party/case]  [subject]  [date] [file extension]

Yes, there are other longstanding concerns about how files are named, beyond what may concern your attorneys and other advocates. These concerns include naming conventions driven by the demands of particular operating systems, all too often non-intuitive but technical project requirements, or the recommendations of 800-pound gorillas like Google (explaining why Google favors dashes over underscores).

All that said, if you look at the memo linked above, you may notice that the examples promote the use of underscores, rather than spaces, between words in the filenames.

Our thinking is this:

  • As a practical matter, using spaces between words in file names creates file transfer problems when moving the file from one server to the other. Not a good thing. Especially when you are dealing with relocating files that count in the hundreds of thousands.
  • Using spaces creates readability problems when viewing the path of the file in a GSA search result, because the GSA normalizes the URL by inserting special characters wherever the file has a space in its name. Even if you didn’t know what to call it, you’ve seen this phenomenon. Here’s a real world example in a GSA search result from one of our test bed sites:

    /Ukiah%20Office/Former%20Staff/Kan's%20Transfer%20File/letter%20to%20jake.wpd

    Look familiar?
  • File names with underscores are easier to read than files with dashes. They just are, OK? To be fair, not everyone is going to agree with that proposition. But we did some admittedly unscientific user testing (hey, Glenn, you get what you pay for), where we asked staff to read the same file name listed three ways: With spaces, with underscores, and with dashes. Without exception, our crack team of testers said they found it easiest to read the file names if they had spaces (duh!); less easy to read if there were underscores; and least easy where dashes were used.
  • A usability corollary: If you use underscores, a linked file name in a search result is easier to read because as a link the file name appears underlined, so words appear as if they have spaces, which is the easiest of the three formats to read (see test results, above).

This is fairly prosaic stuff and bears some thought, but is not worth overthinking. Or much of an enforcement regime. The project goal is not to have nice, neat, compliant file names. That’s an objective. It is not the goal. We are not investing a lot of time worrying about those who paint outside the lines. The point of the project is to get the files targeted properly so that users can find the content they contain.

We are confident that, as users see file names displayed in the GSA search results, it will sink in why it matters how one names the files, and they will adapt.

  • Tuesday
  • September 30
  • 2008

The Findability Project Taxonomy – Part Three: The Anecdotes

This is a non-extra credit read, somewhat tangentially related to “taxonomy.” But, hey, this project is hard work and I’m entitled, as are you, to have some fun, no?

I previously alluded to how I sat down with each of the advocates in our flagship Sacramento office to view and discuss how each organized their files. I’m not suggesting you need to do this with everyone in your organization. But doing so with at least a fair cross-section of your people will teach lessons not likely learned otherwise. Let’s call it “reality.”

Three particular experiences in doing this are favorites of mine.

The first relates to the same advocate who, the good sport that he is, agreed to let me post a photo of his hard-copy file organizational scheme. When I sat down with him to take a look at how he had organized his files on a local server, it was a gloriously indulgent vision of horizontal organization. The guy (who is one of our best welfare lawyers) had 595 MB in 2,623 document files … wait for it … in one folder. Whew, talk about going “broad-and-shallow”! Because his file-naming conventions include the relevant client name, I really can’t give you a screenshot of this Ripley’s moment. There was something really extraordinary about this encounter, almost anthropological about it, akin to witnessing an indigenous tribe in the primeval, untouched by the outside world.

The second involves the polar opposite, another highly regarded lawyer who is hyper-organized. And well he should be, with 3,616 work files totalling 3 GB tightly organized in 671 folders and subfolders. Peter Morville would not likely approve of his organization scheme, I don’t think, since this advocate went for a “narrow-and-deep” hierarchy, with nine top levels and 662 subfolders, as many as five levels down.

And then there is the third anecdote, my favorite of all. As I went through this attorney’s files, I was authentically impressed by how sensible and well organized her directory structure was. While I organize my personal directories differently, hers were organized much the way many advocates in the program do (by cases or projects or substantive area), easily understood and well suited to how she works. Broad enough to hit all her bases, yet with enough subfolder depth for her to “navigate” to find particular files. A good, functional result for her.

As I showed her the project taxonomy, she was fine with the top-level selections. She understood immediately and instinctively why those choices had been made and had no quarrel with them. But when I showed her that the subfolder organization only went one-level deep, her facial expression changed noticeably. She said nothing but I could see her anxiety. So I asked her, “You look worried, a bit. What are you thinking?”

She paused and then she asked, “If you organize the shared directories this way, with only one level below the topics, how can anyone ever find anything? I don’t work that way.”

This was my response:

“Have you ever used Google?” (She good-naturedly looks back at me with her best “give me a break” smirk on her face.) “Well,” I continued, “when you search with Google you are usually able to find what you are looking for, right?”

“Of course,” she answered.

Then I said, rhetorically, “Do you think Google ‘organizes’ the Web in subfolders like you do?”

“Oh, I get it,” she said. “Everything doesn’t have to be organized that way if I have a way to Google it, right?”

  • Tuesday
  • September 30
  • 2008

The Findability Project Taxonomy – Part Two: The Practice

We’ve laid out our take on the theoretical approach to the TFP taxonomy. But in practice, how is LSNC actually implementing those organizational concepts or principles? That is what this post is about.

I’ll just give you the end-product upfront and then explain how LSNC sorted out the basic taxonomic structures for its shared document repository. The two PDF files linked below are copies of what was distributed at a program-wide meeting a few weeks ago to address and resolve what the basic organization structures would look like.

It was actually quite easy to come up with an initial (if bloated) proposed list of likely substantive advocacy content targets, their location, and how the content would be organized, but even that required process.

LSNC has what it calls a “regional counsel” model, which means there are three designated advocacy leaders with senior substantive, litigation and advocacy experience who are expected to provide just that, “leadership.” (One of the three, by the way, is Mona Tawatao who is the recipient of the 2007 NLADA Reginald Heber Smith award.) The regional counsel, with feedback from other management leadership (including the executive director, a few local office managing attorneys interested in this particular project detail, and the senior office manager representing support staff interests) worked up the list, later vetted more broadly with the entire management team, who in turn vetted it with each of their local offices or other program unit.

In the initial proposal, the substantive advocacy content was organized based on the ten LSC Problem categories in current use by legal services programs, plus roughly an additional 30 or so other general categories. The latter included additional substantive categories (economic development, disaster relief, etc.), practice matters (e.g., federal and state court practice issues, discovery, etc.), and other work-related content (self-help clinic content, specialized training materials, etc.) that reflect what LSNC and other legal services field programs actually do for a living. In response to any number of discussions and comments by the smaller group thrashing the details out, the list at times expanded and contracted, went deeper and then sometimes more shallow. This initial organization structure also included targeted content related to local office and central administrative office work. A similar vetting process was undertaken by the senior office manager with all the other office managers in all the core local offices, as well as administrative and business office managers. As mentioned earlier, each of those, in turn, were asked to vet the structures with their respective staff.

This process did not operate in a project vacuum. As not only one of the three regional counsel but also the person responsible for managing this project, I also did what I think managers should always do: I talk to the people affected. I took the time, a lot of it, to speak directly and individually with all of the forgoing to explain the overall project and its technical demands, and in a non-technical fashion (well, at least I tried) the significance of developing an organizational structure, and other, related issues, such as the use metadata models to attribute value to the targeted content, and so on. The point being, to take the time to assure leadership understood from more than a memo what the project is about, why it matters, and answer their questions or concerns. In response to the vetting and these dialogs, real changes were made in the proposed organization and additional content targets were identified. Time investments paid dividends, at least in this case.

By the time our GSA consultant showed up for a scheduled three-day thrashing of our test-bed installation in Sacramento, we had a taxonomy with over 40 top-level directories and a lot of two- and three-level deep subdirectories. He looked at this, in a non-committal fashion said “that’s fine,” and then began to suggest reasons why it should be simplified. This push by the GSA consultant was prompted by notions of usability and manageability of the content areas. As mentioned in the prior post on project taxonomy, there are not significant advantages or improvements to search results in a repository structure beyond a second-level directory. The consultant also emphasized that most users are not likely to locate or use a directory substructure below the second level. (This has to do with users navigating directory structures to add, remove or modify files, for whatever reason.)

Since a significant portion of the metadata models we are adopting rely on the organization structures in order to build logical, searchable “collections,” we simplified the structures in response to the consultant’s recommendation in this regard. Hence, the 29 top-level directories and the “simplified” taxonomy you can see in the memos linked at the beginning of this post, and the reliance on only one-level deeper for those directories.

As we get life experience with this organization structure, my guess is that we may expand to add a few additional top-level directories but not many, if any. I think we have things pretty much covered at the top-level, at this point. But apart from the rigid yet practical exploitation of the dated — but undeniably familiar — LSC Problem Codes for a large chunk of the substantive organization, my guess also is that the one-level down subdirectory structures will likely change as users give us feedback, and we discover that some subdirectories are not particularly used or useful. Proof’s in the pudding, people.

This all came full circle with our program-wide meeting a few weeks back. By the time of that meeting, every manager within LSNC had seen the organization proposals, every manager had a one-on-one conversation with project staff about the project and the organization structure, every manager had vetted the proposal to his or her people, and the memos you see linked here had been distributed to all offices.

That’s how we roll.

  • Tuesday
  • September 23
  • 2008

The Findability Project Taxonomy – Part One: The Theory

First, a recommendation. Get your hands on a copy of Information Architecture for the World Wide Web (also linked on the right, under “Biblio”) and read chapter 5 about “Organization Systems.”

Why? Well, let me put it to you this way.

We did a lot of homework and scoured a lot of books and, of course, talked to our GSA consultant on what is popularly (if imprecisely) referred to as “taxonomy.” You know, how should we organize all the “stuff” we want our users to be able to find? How hard is that?

As we canvassed widely to get an answer to that basic, practical question, we discovered you can get totally befuddled and sidetracked, not only by any number of levels of abstraction, for example, should you choose to wallow in construction of controlled vocabularies; but also by all too “inside-baseball” discussions by the taxonomy community; or, by yielding to the dark side and joining a formal organization for this sort of thing. Of course, there is also the emerging school of “social organization” of content referred to as folksonomy, more popularly known as tagging. And then there is the school of thought within some sectors of the search community that, after all is said and done, taxonomy may not be particularly useful for enterprise search design.

Needless to say, these initial forays into this subject prompted the thought bubble … “Just shoot me now.”

On this point, the GSA consultant was not as directly helpful as I thought he would be. The short story is that he was supportive of what we thought we needed, but at the end of the day he was essentially agnostic on this point, a view that mirrors Google’s online GSA resources. In discussing how to plan for a GSA implementation, Google says not much more on this point other than “analyze your business’s content and decide which directories and files you want indexed.” (In fairness to our GSA consultant — whose name, by the way, is Igor — you should be sure to read below, for his helpful guidance on simplifying the taxonomy we adopted, and the reasons for doing so.)

Which begs the question, how should we do that?

There are online articles that are straightforward and helpful in grasping, at a rudimentary level, the basics of information architecture, one recent example being Better Living Through Taxonomies, at Digital Web Magazine. But based on our experience, I recommend you pass Go and head straight for Peter Morville and Louis Rosenfeld’s Information Architecture for the World Wide Web, a book that is part of the IA canon, and deservedly so. It is a superbly clear-headed, well written overview of what information architecture is all about, and Chapter 5 on organization systems, specifically, is a model of how to explain a technical and complex subject like “taxonomy,” among other things, in plain, accessible language. And it will hit the mark on the main issues you need to think through to get “stuff” organized.

What are those practical issues? Indulge me a bit, since several of my observations here simply echo what I am recommending you read, but for LSNC we distilled our theoretical approach to taxonomy or organizing our content to these four basic precepts:

1. The directory structures need to be a hierarchical or “top-down” organization of simplified, familiar categories.

In the broadest sense of “organizing” things on a file server, and how that same “organization” is reflected in page menus or page navigation or dialog boxes, users need to know where they are and what the folders or subfolders mean. Lawyers, by training and practice, work in an especially pronounced hierarchical environment. (Can you say, “I, II-A, etc.”) While the work environments of legal services programs are famously “anti-hierarchal,” the practical truth is that almost everyone in that environment organizes their work in some hierarchical fashion. (Certainly, there are exceptions.) Simply put, this is the most common way in which most people organize things, lawyers and non-lawyers alike.

2. Names for content folders, subfolders or categories need to be consistent with the shared vocabulary of your organization.

This may seem self-evident, but in practice may not be what users in your program do or are accustomed to. I actually took the time to look at the folder organization of about a dozen advocates in our Sacramento Office, and while there were predictable folder organizations (for example, organizing files by case or project or substantive area), much of the naming was ambiguous. While no doubt obvious to the advocate who created the directory or subdirectories, to others the same structure or organization may be too subjective, ambiguous or confusing to be useful to anyone other than the person who created it — and even possibly for him or her at some later time, when the subjective rationale for the organization has been long forgotten. So, when working out the naming conventions for folders and subfolders, it was important to focus on commonly understood, familiar shared vocabulary or terminology.

From the perspective of the GSA, the particular names, as such, of directory folders or subfolders is of no consequence. The GSA does not care what you call things, which explains the agnosticism of Google and our GSA consultant on this point. At the blunt-instrument level, all it cares about is the URL, the path to where the content resides. You deal with the Tower of Babel; that’s your problem. The GSA will ferret out the content wherever it resides, regardless.

To be detailed in the next post on this subject, LSNC has adopted the most conventional names for its directories it could come up with, including … I pause, for the pain it causes me to say this … the LSC substantive problem code categories, which comprise roughly half of the directories on our shared document repository. If one were organizing legal services practice today, I am confident it would be organized differently than how LSC organizes it. But roughly 40 years in, LSC still uses an extraordinarily unsubtle and somewhat uninformed organization of legal services practice. But it is what it is, and it is what field programs must use, and it is what users within those organizations know and understand, after decades of use. For better or worse, it is the “shared vocabulary” of our organization, and its use offers consistency with how other information and data is handled, most notably client case data.

3. “Lean toward a broad-and-shallow rather than narrow-and-deep hierarchy.”

That’s a quote from Morville’s book. And his observation is consistent with the advice our GSA consultant gave us. The consultant’s advice was not to go more than two levels down, and really pushed for only one level down. The rationale was two-fold: The more subfolders you have, the less likely users will locate or use content in those folders whenever they are navigating the directory structure, in whatever form it is viewed. From the user side, a deeper vertical hierarchy actually reduces findability.

From the GSA side, deeper hierarchy does little or nothing to improve search results. While the search algorithms baked into the GSA exploit the URL path at the directory and subdirectory and sub-sub-directory to improve search results, having third or fourth or more levels does essentially nothing to improve those results. There’s no harm to doing so. It just doesn’t help you.

A counterpart to this issue is the importance of striking a balance. By going broad-and-shallow, one gets the practical advantage of being able to add content without the need for major restructuring. Assuming you have figured out a set of top-level directories that pretty much covers, in a broad sense, the content your users will want and need to search for, from there on out you can focus on adding content below that level, as warranted.

But if you go too broad, from the user side, things get more cumbersome and impractical. Think about it. Whether your users are advocates or office managers or volunteers, whatever, it is going to be more practical and useful if they can visually and cognitively grok the organization scheme. So it needs to be broad enough to cover the bases, but not so broad that it becomes incomprehensible.

Sure, we could have gone totally nuts with the taxonomy and, say, adopted the thousands-of-points-of-substantive-light offered by the well intentioned but ill fated National Subject Matter Index. (Don’t get me started.) We’re more practical. As detailed in the next article, LSNC is going with a simplified 29 top-level directory structure, and each only going one-level deeper. Works for the users. And works for the GSA.

4. It’s not all about taxonomy.

Having a basic, practical, commonly shared taxonomy or organization structure is essential to a project like this. LSNC content needs to be located somewhere to be targeted by the GSA, and those who add or contribute or remove that content need to be able to comprehend what is where. The practical side of what that all means will make more sense in later articles about the document protocols we have come up for LSNC users to locate and add content and how to add metadata to that content.

But having a traditional taxonomy is not the whole picture. There are other types of content you may want to target that don’t fit the taxonomic model: targeted database content (case management systems come to mind, but are not the only example); external site content (such as select public website content to which your organization has access or permission); and alternate content sites that you would want to target but over which you don’t have the same level of control (a current example would be domain-hosted Google Sites, a subset of Google Apps, which you can “organize” in a superficial way but which at the level that matters to the Google Search Appliance, not so much).

What this means for LSNC is that we are targeting the GSA at more than just a nominal taxonomy on our shared document repository.

  • Wednesday
  • September 10
  • 2008

Found search humor

Not to worry. We have several posts coming over the next two weeks on TFP taxonomy and file-naming conventions. But in the interim, here’s a bit of found humor, part of a test search done today as TFP’s Team Gizmo preps for a program-wide meeting next week about technical aspects of the project and various document protocols. This is a screenshot of a temporary portal page created for one of the demos we are doing at that meeting:

OK. Admittedly, I may just be a touch punchy on an early Wednesday evening, but here’s the search result for “Where’s the love” … out of a test-bed repository with about 300,000 documents in it:

Yep. One document: The text of a hearing transcript on a Vacaville “mobilehome rent stabilization ordinance” which includes testimony that “I believe there’s a sign that says to pick up after your dog, and there was a sign that said ‘We love our residents,’ and those are the only two signs that I’m aware of.”

Sorry, folks. That’s all the love Google was able to find in our document repository.

  • Sunday
  • August 31
  • 2008

Selecting GSA targets – Part One: Four abstract targets

It is, of course, not enough to simply build an enterprise search platform. Sure, you can do what we did on day one, when our Google Search Appliance (GSA) arrived and we gleefully hooked it up to our local Sacramento Office network and did a global target of everything. You know, just to see if our GSA worked. It did. And in short order, as we blew out its one-million file crawl limit, we discovered the obvious: LSNC has a whole lot of documents and other files strewn about on various file servers and desktops, like so much digital flotsam. Needless to say, we did not need a TIG-funded GSA to reveal that fact. To know that, all one has to do is invoke Windows Explorer and peruse one’s local office file server. Enough said.

From the perspective of our enterprise search goals, most of these files do not contain content that has what we refer to as “shared value.” Namely, advocacy or other work-related content or information that LSNC staff would want to search for because they want it or need it to get the job done.

This observation does not suggest that all the other individual documents or files have no worth. They do, but to other purpose. For example, on a practical level, an advocate may have any number of drafts or versions of a document or file, but what the organizations will want to target and what users will want to get their hands on is the final or more polished version of that content. And that is likely what the original author will intend to share.

But if the organization targets everything, well, in the broadest sense what those who search will get is a lot of extraneous or incorrect or incomplete content. And a less serious but real-world challenge is the organization’s need to separate the true wheat (even if marginal) from the inevitable digital chaff on local office file servers and desktops. (Oh, come on — you know what we’re talking about here! All those personal photos, MP3s, YouTube videos, recipes from the Food Network, National Geographic wallpapers, long forgotten software downloads, … need I go on?)

There is a separate set of challenges to initially identify existing content that one would want to target with a GSA that has, after all, a set file limit. And then one has to work out practical policies and protocols for how to handle new content to be added to those targets. In upcoming posts, we will document how LSNC has approached both of these challenges.

But for now, here is a macro breakdown of what content we value and are initially targeting with the GSA. It is actually more simple to do than we initially thought it would be:

  • Designated document repository master directory structures – that’s a mouthful, but it turns out that’s how we refer to it. We have worked out what we consider to be a basic, workable “taxonomy” for organizing files, to be detailed in an upcoming post. The short version is that both existing and new content that has been identified as valued will reside on project-specific files servers that have purposefully organized directory structures. This will make more sense once we explain (fairly soon) why we are adopting the structures or organizations we have worked out, and why, and how they will serve the overarching goal of “findability.” Stay tuned.
  • Shared intranet content – within LSNC, we refer to our intranet as the “secured network,” the lingua franca here for what other organizations refer to as their intranet. At this juncture, most legal services programs have some sort of intranet structure already in place, with varied user-side implementations to give staff access to its content. (Currently, ours is built out with MediaWiki as the principal content management tool, but soon to be supplanted with either WordPress and/or Google Sites. (I have posted details on that side story at LSNC’s tech blog, Webdogs 2.0.) By historical definition, everything on our existing intranet is valued. It’s fairly lean, mean, to the point, well organized and includes among other things, in no particular order:
    • Administrative manual
    • Case management manual
    • Development and funding-raising resources
    • LSC policy archive
    • LSNC forms (administrative and case-related)
    • LSNC policy archive
    • MCLE – Training resources and forms
    • Personnel and other shared human resource information
    • Specialized Regional Counsel content (content subject to gatekeeper function)
    • Specialized client content (content targeted for LawHelp access)
  • Select LSNC public web content – LSNC is now reaping dramatic benefits from its decade-long focus on using its public web presence to create and share usable content for advocates. We are still in the process of parsing out those portions of the LSNC public content we want to target with the GSA, but these include our rich reservoir of advocate content on CalWorks (the name of California’s TANF program) and Food Stamps, and special project-specific content that derives from our Race Equity Project and housing and economic development work. The point here is that our enterprise search model will include not just valued content behind our firewall but also select public content that is every bit as valuable to our staff in getting the job done.
  • Pika Case Management System – this will likely be the last piece of the enterprise search puzzle for us, but a major chunk of our GSA file limit will be devoted to exploiting the GSA to alter dramatically how LSNC staff search and locate data within Pika. We have already run some initial targeting tests on Pika and we really, really liked what the search results looked like. It is not a technical challenge to target Pika with a GSA, not at all, but there are some significant challenges in sorting out how best to limit the GSA crawl to target precisely what we really want to make searchable, without blowing out our GSA file limit. Once we work out those kinks, we will likely replace the native Pika search functions (which is little more than a raw SQL search function) with a customized subset of GSA functions.

In the scheme of this project, content is king, knowledge content rules, and the Google Search Appliance is Gandalf, the wizard asking “What do you see? Can you see anything?” Indeed.

  • Thursday
  • August 21
  • 2008

Basic tech specs for the Findability Project

We have completed the initial “blunt instrument” build of the hardware and software infrastructure for the Findability Project. Refinements will be detailed here later — and there are many that must and will be done — but here is a breakdown of our basic enterprise search platform:

  • Windows Server 2003 – this is the software platform backbone, so to speak. To be detailed later in a separate posting, all local LSNC office-specific network shared file servers have been built out using Windows 2003 Server, which allows installation of SharePoint Server 2007, the open source Google SharePoint Connector and Windows Server 2003 Active Directory, described below. Windows 2003 Server is a robust secure server which allows local and subdomain user authentication for multiple locations. It also provides centralized authentication for SharePoint access as part of the universal, shared domain login available to all LSNC offices.
  • Windows Server 2003-certified ASUS PM52-M motherboard – if you’re going to build a platform suited to working with Windows Server 2003, it is not just a matter of purchasing a server that nominally meets basic system requirements. You also need to ensure it is certified to do just that. It is not the only such option, but the ASUS PM52-M motherboard is so certified. There are numerous other Intel chipsets that are not. This is a distinction that matters.
  • SharePoint Server 2007 – this will be detailed in later posts as well, but the elevator pitch is that SharePoint Server 2007 has been installed on a single server in the LSNC Sacramento office (a k a “The Mothership”). SharePoint Server provides an array of features and functionality, some but not all of which will be exploited as part of the Findability Project, with their integration with the Google Search Appliance (GSA). Among other things, SharePoint Server provides a web-based interface for sharing and managing domain-site content across a common network. (You know, like documents and files and web pages and stuff like that.) Out of the box, SharePoint also provides a capable, basic document management system for data indexing, creating collaborative sites, and adding metadata.
  • Google SharePoint Connector – available as an open source project at the Google Code site, the Google SharePoint Connector provides a seamless connection between the GSA and the Windows and Sharepoint servers. This connection uses Active Directory authentication for managing content permissions as well as the interface for the GSA to access and crawl all the domain content.
  • Google Search ApplianceGoogle Search Appliance (GSA) – think of the GSA as “He-Who-Must-Be-Named.” The Google Search Appliance is, quite literally the brain, the cerebral cortex, the heart and the soul of the Findability Project. Some call it the “Godhead.” Others call it “Big Yellow.” Whatever you call it, it is Google search mojo and magic in a high-end computer with the capacity to provide enterprise-level search capabilities for crawling and indexing any targeted content within our organization as well as external sites. If the organization has permission to a site, the GSA can crawl and index it, purée and shake and bake it six ways to Sunday and then return search results in all sorts of different ways. (Obviously, we will be writing up more about all that, later.)
  • Microsoft Office 2003 – huh? Why are we not listing Microsoft Office 2007 here? We know this will come as a shocker to all, but it turns out that some of the key features in Microsoft SharePoint work better with the Microsoft Office 2003 suite than they do with the 2007 version. (Insert your own Microsoft joke here.) We expected issues with non-Microsoft apps (WordPerfect, Acrobat, and so on) but did not expect some of the hurdles we have encountered with Microsoft Office. Not to worry. We will get into the details and work-arounds in later posts, but the short version is that the reason for using Microsoft Office 2003 is that it allows itself all but universal and seamless integration of all the above technologies, using Active Directory authentication with the document management features available in SharePoint — most notably desktop application integration of features that enable users to add metadata.
  • Microsoft Internet Explorer 6.0 or later – IE is listed here because, at the system level, it is required for the type of enterprise search platform we’ve built. Not to worry. From the user side, everything works just fine with Firefox. But the network, system-level issue here is that IE 6.0 or later must be installed and configured on each workstation in a way that allows intranet and SharePoint integration with trusted domain-site functionality. The short version is that adding a network local site to trusted sites binds Active Directory authentication to the intranet web interface in SharePoint. Once properly configured, users can then rely on either IE or Firefox as their browser of choice. (We’ll explain it all later.)
  • Microsoft ISA Server 2006 – still with us? OK, one more piece: the Microsoft Internet Security and Acceleration (ISA) Server allows offsite access to content returned in search results by providing an Active Directory authenticated external web interface. (If you got this far, odds are you know what that means.)
  • Wednesday
  • July 30
  • 2008

Enterprise Search: Stating the case for a legal services field program

Whatever you do, please don’t call it a “brief bank.”

Language choices have powerful effects, so it does matter what one calls things, to good or ill effect. And for some 40 years legal services field programs have sought the holy grail of a “brief bank.” Having worked in five different field programs and two support centers in five states over 35+ years, I can personally attest that every one of those organizations thought they had or wanted or envisioned or aspired in some way to a “brief bank.” As if.

There are legions of reasons why, in practice, the brief-bank model never really works for most field programs. Among those are program management and resource priorities that obstruct it or at least don’t value it; lack of a commonly understood and shared purpose among its target users (you know, those pesky “employees”) why it matters to have such a model; and an impractical — or at least poorly designed — approach to creating and maintaining the model (you know, like, no one is really responsible to make it happen and/or actually find the time or resources to maintain the damn thing, whatever form it takes).

Within the legal services community, the notion of a “brief bank” long ago morphed into something akin to a vestigial organ: Not entirely useless or without function, but pretty much something no longer used as it once was. If ever it was. And even by its own self-referential term as a “bank,” one gets the message that this is a model for something that one does not actually use on a daily or regular basis. Rather, things of apparent value are placed there for storage, for safe-keeping, for later retrieval but for good reason not readily accessible because they must be secured. You can count on it being there. You can bank on it.

Actually, you cannot. Because the real purpose for which it exists, more often than not, is typically useless. The old-school model “brief bank” was a collection of hard-copy documents stored in your individual office file cabinet (or that pile of folders over there, in the corner of your office); or down the hall somewhere in a different cabinet maybe maintained by someone else (or in a pile of folders that the “someone” would label and organize “by the end of the week”). On a good day (OK, on a really good day), you or someone else could remember which document was about what and where it was located. On most days, not so much. And with the emergence in the last 20 years of the digital-document work style to which we are now accustomed, the “brief bank” has become a case or project folder on your local or a shared network drive. You know, something like our Auburn Office shared directory, in all its indigenous glory:

Example of a shared drive

Surely, this is an advocate’s digital paradise, right, all there but for the taking? … if you can remember what is there … where it is … and find it. (“Oh, what you’re looking for is in a different office? I’ll get back to you.”) And that’s one of our smaller offices. (However, you’ve got to love the use of caps here, sort of a poor person’s metadata model for attributing value to some files.) I thought to illustrate here the four times as large, charmingly nuanced (née dystopic) horizontal and vertical structure of our flagship Sacramento Office, but it was too vast in dimension to use as a visual example. But you get the point.

I must admit, I cringed a touch when reading the fifth “Purpose Served: Knowledge Management” element in LSC’s recent recommendations on baseline technologies for legal services field programs. Stating “what should be in place,” it invokes “pleading and brief banks” as its primary concrete paradigm. As I was saying, language is powerful and apparently the choice of this concrete terminology in the more abstract context of “knowledge management” has not changed. It should.

The concrete challenge for legal services program is not to create a “pleading and brief bank.” The challenge is to identify and organize and manage and make “findable” a wide range of documents and other files that have shared value within the organization. (“Sample pleadings and briefs” are only one piece of that paradigm.) Within that larger framework, the LSC baseline technology recommendation regarding the need for knowledge management is right on the money.

And Legal Services of Northern California (LSNC) is a typical example of this challenge within the post-merger world of legal services. The structural scale and geographic reach of and substantive range of advocacy by LSNC exacerbates a fundamental dilemma all modern legal services field programs suffer: How does one make it fast, easy and intuitive for program staff to find and access all the different types of “knowledge content” within the four walls of the organization?

Within LSNC’s organizational structure there is a wide range of substantive advocacy and administrative expertise, specialization and skill sets, all of which are sources for shared information and knowledge. By “information” I mean that the organization has a variety of documents and other digital data types — most commonly, these are word processing files, PDF documents, spreadsheets, presentation files, HTML pages and client databases — that have content the organization perceives as valued and useful. By “knowledge” I mean that the information exists in a context that offers understanding. A “usable” document offers the promise of shared knowledge because it brings understanding of the information it contains from one person to another.

But there’s the rub: What does a non-profit organization like LSNC do to bring to the surface the usable knowledge of all, i.e., all the specifically identified and valued, usable content wherever it exists within the limits of the organization that can and should be shared and available to other LSNC staff? That is the core question the Findability Project will attempt to answer in a practical way that works for a legal services field program.

The LSNC approach is to build a network infrastructure that supports enterprise search, premised on deployment of a Google Search Appliance. It is also premised on thrashing out practical ways to identify, organize and maintain the valued documents and other files that will be the target of enterprise search. It also premised, as importantly, on figuring out as “user friendly” a way as we can to ensure LSNC staff use the system, want to use the system, know why they would want to use the system … to find what they need.

Hence, the Findability Project.