Better alternative for the regex.MatchString and regex.ReplaceAllString

I am trying to identify if there are any invalid characters in my strings for that I am using package regexp methods MatchString and then ReplaceAllString but his is causing high cpu for me, so it there any better alternative for this methods which will help me improve the performance.

reInvalidAnnotationCharacters = regexp.MustCompile(`[^a-zA-Z0-9_]`)
func fixAnnotationKey(key string) string {
	if reInvalidAnnotationCharacters.MatchString(key) {
		// only allocate for ReplaceAllString if we need to
		key = reInvalidAnnotationCharacters.ReplaceAllString(key, "_")
	}

	return key
}

Nothing in your code hints to this. Did you profile the application that uses this function?

yes I did the CPU pprof and it pointed out that this code only using maximum of CPU.

@shreyas_darwhatkar is it MatchString or ReplaceAllString that’s using up too much CPU? Where are you calling this function? Can you memoize results with a map[string]string?

yes the profiling shows that the MatchString or ReplaceAllString using up too much CPU,
I am actually calling this fixAnnotationKey in a for loop
how can I use map[string]string to memoise the results also would’nt that affect the memory?

Caching results would affect memory usage, but the result could go either way; you’d need to benchmark/profile it. You said you’re using this function in a loop. Are the input strings ever repeated? If so, you might actually save memory by caching the results. For example, if your loop is something like this right now:

inputs := []string{
    "test?",
    "test?",
    "test?",
}

outputs := make([]string, len(inputs))

for i, input := range inputs {
  outputs[i] = fixAnnotationKey(input)
}

Then fixAnnotationKey will create three separate "test_" strings in memory. If you cached the results, you could re-use that resulting “test_” string:

type fixAnnotationsState struct {
    memo map[string]string
}

func (s fixAnnotationsState) fix(key string) string {
    if value, ok := s.memo[key]; ok {
        return value
    }
	if reInvalidAnnotationCharacters.MatchString(key) {
		// only allocate for ReplaceAllString if we need to
		value = reInvalidAnnotationCharacters.ReplaceAllString(key, "_")
	}

    s.memo[key] = value
	return key
}

Then the loop could be something like:

inputs := []string{
    "test?",
    "test?",
    "test?",
}

outputs := make([]string, len(inputs))

fixer := fixAnnotationsState{make(map[string]string, len(inputs))}

for i, input := range inputs {
  outputs[i] = fixer.fix(input)
}

fixer.memo = nil  // now GC can reclaim the map.

Also, for your particular regular expression and how you’re using ReplaceAllString, you can switch the implementation of fixAnnotationKey to use strings.Map instead:

func fixAnnotationKey(key string) string {
	return strings.Map(func(r rune) rune {
		switch {
		case '0' <= r && r <= '9':
			fallthrough
		case 'A' <= r && r <= 'Z':
			fallthrough
		case 'a' <= r && r <= 'z':
			return r
		default:
			return '_'
		}
	}, key)
}

Even though this does not cut-down on allocations, it results in about an 80% speed-up when I compare the two functions against the text of Moby â– â– â– â–  that I found on the Internet.

If I add memoization, the strings.Map-based implementation gets a little bit slower, but the regular expression-based implementation speeds up to the same speed as the strings.Map implementation:

regex : 74.725386ms
string.Map : 12.360646ms
memo(regex) : 50.877408ms
memo(string.Map) : 30.786012ms

regex : 86.295341ms
string.Map : 13.819604ms
memo(regex) : 15.419945ms
memo(string.Map) : 15.442404ms

regex : 75.028974ms
string.Map : 12.5782ms
memo(regex) : 14.493085ms
memo(string.Map) : 14.277293ms

regex : 79.612822ms
string.Map : 12.577042ms
memo(regex) : 12.368686ms
memo(string.Map) : 15.611474ms
2 Likes

thanks @skillian this is very helpful

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.