The dev blog for Justin Duewel-Zahniser's Chapbook poetry sharing web app project. I'm going to post random stuff here that's too big for the dev list and hopefully get conversations going with testers/users.

Rails Issue w/ Highlight and Excerpt TextHelper Methods

Here’s a weekend Rails discovery that I meant to post about yesterday. I was working on the search support in Chapbook using Ferret and was referencing the Acts_as_Ferret Tutorial from Rails Envy. I jumped down to section on Highlighting because I wanted to introduce highlighting in to the result set for ease of use. Here’s what I read:

The requirement to do this, however, is that you must have your search fields stored as I showed above.

So, what was shown above?

If you take a look inside one of your search indexes right now, believe it or not, you would not see your data. By default acts_as_ferret does not store your data in a recoverable form, it just indexes it.

“What if my data is small and I want to store it in the index?” I hear you ask.

Good question grasshopper. If your data is small, or you only really care about one field of information, you can get a speed bonus by storing the data in the index itself.

Note the bit about your data being small. I’m really uncomfortable with altering how my data is stored to a format which appears to imply danger down the road related to data size in exchange for search highlighting. There was no information about what the scale limitation or long-term effects would be, but I was nervous about doing this just for highlighting.

Rails has some TextHelper methods called “highlight” and “excerpt” which provide pretty much what you might guess by the name. So I decided to try these out as an alternative which would not involve indexing or data growth dangers.

I ran in to a problem. If I searched for “dog” and looked at the result set I would see “dogged” in the highlighting. Not good. So I took a look at the source in the Rails API:

71:       def highlight(text, phrases, highlighter = '\1')
72: if text.blank? || phrases.blank?
73: text
74: else
75: match = Array(phrases).map { |p| Regexp.escape(p) }.join('|')
76: text.gsub(/(#{match})/i, highlighter)
77: end
78: end
101:       def excerpt(text, phrase, radius = 100, excerpt_string = "...")
102: if text.nil? || phrase.nil? then return end
103: phrase = Regexp.escape(phrase)
104:
105: if found_pos = text.chars =~ /(#{phrase})/i
106: start_pos = [ found_pos - radius, 0 ].max
107: end_pos = [ found_pos + phrase.chars.length + radius, text.chars.length ].min
108:
109: prefix = start_pos > 0 ? excerpt_string : ""
110: postfix = end_pos 111:
112: prefix + text.chars[start_pos..end_pos].strip + postfix
113: else
114: nil
115: end
116: end 

Look at those regular expressions. Look again. Hm. They’re too simple. If fact, they won’t highlight correctly on full words and they won’t highlight or excerpt correctly in the case of punctuation following a word (e.g. “… the dog.”). Ew.

But that’s a solvable problem. I’m not sure of the most Railsy way to solve the problem, but I created alternate helpers in my poem helper which upgrade the regexps to be a bit smarter. I wonder if I should submit these, or go ahead and override the methods in my app. Not sure what else I might break or whether someone would reject my changes because the lack of rigor around the matching is desirable for some use. Anyway, here’s the code.

module PoemsHelper def search_excerpt(text, phrase, radius = 100, excerpt_string = “…”) if text.nil? || phrase.nil? then return end phrase = Regexp.escape(phrase) if found_pos = text.chars =~ /(\W+#{phrase}\W+)/i start_pos = [ found_pos - radius, 0 ].max end_pos = [ found_pos + phrase.chars.length + radius, text.chars.length ].min prefix = start_pos > 0 ? excerpt_string : “” postfix = end_pos \1’) if text.blank? || phrases.blank? text else match = Array(phrases).map { |p| Regexp.escape(p) }.join(‘|’) text.gsub(/(\W+#{match}\W+)/i, highlighter) end endend
Pretty much just some \W+s thrown in there and it works great. And I didn’t have to change my indexes. Using “search_*” feels dirty, though.
Comments (View)
blog comments powered by Disqus