Newbie concurrent database requests for same data

mikelgws · January 17, 2020, 7:09pm

The data in my database is quite large and multiple users are likely to access the same data regularly, so I’m trying to implement a basic cache structure. So the idea would be that

User A requests dataset with params (x,y,z) - this usually takes 30sec - 1 min
While the data is being fetched, User B also requests dataset (x,y,z)
The code should know that it’s already working on getting that, wait for A’s request to complete, and then both Users will get their result back.

Likewise, User C could come in 30 min later (within the cache time limit) and get the (x,y,z) result instantaneously, as it’s already been retrieved. My code already handles User A and User C perfectly, as the database result does get cached.

I am about 3 days into programming go, so I’m quite a newbie. I think this should be done through Mutex, locks paired with cache, but I’m not quite sure…

Here’s some pseudo-go-code. This works perfectly for the User C case, where the data has already been loaded and we’re just getting it from the cache. It’s the B case that has be stumped…

func queryHandler(w http.ResponseWriter, r *http.Request) {
  // retrieves x,y and z parameters from the request 
  
  // creates a query for the database based on the parameters provided 
  // and returns a key, which identifies this query 
  query,key := formQuery(x,y,z)

  if (!cache.Has(key)) {
    queryDatabaseAndCacheTheResult(query,key)
  }

  // at this point, whether the result was already in the cache
  // or whether we got it from the database, we should have the data 

  processDataForDisplay(w,cache.Get(key))
  
}

Sibert · January 18, 2020, 7:09am

Could you explain further why 2 users not can fetch the same data? What is the purpose of limit access?

NobbZ · January 18, 2020, 9:23am

From the initial post, it seems that the request is expensive and takes up to a minute. So I totally understand the desire to cache subsequent queries.

Though I have to be honest, I currently have no idea how to solve the problem exactly.

I’d probably start with a second “cache” which simply maps queries to a sync.Cond, on which a Broadcast() is called once the initial query has been processed and put into the main cache. Remaining go routines then can just from the second cache.

Though I have never used sync.Cond, it might work differently than I understood.

geosoft1 · January 18, 2020, 12:07pm

Well, I made a project who did something closer and my aproach was as following:

I used a map for storing users cache under a key each of them holding a structure with received data in some convenient format.
A goroutine periodically fetch the data from the source (in your case the database) and put the results in the map[user_key].
An API expose the map content to users at request.

With this technique you will have the same consistent data accesible to N users concurently at some moment in the past (based on the periodicity of the requests).

mikelgws · January 18, 2020, 3:40pm

The access isn’t limited. The only requirement is that we don’t want to make extra database trips. If we’re already fetching a large dataset, I don’t want to start fetching the data a 2nd or 3rd time.

mikelgws · January 18, 2020, 3:45pm

I get the idea about the second cache. I tried something like this where the first one sets a kind of temporary “loading” state. The 2nd process sees the loading state, so it knows to not go and get the data, but then the issue I had, is how does it do the “waiting” part. It should sort of stay on hold, until the “loading” state goes away, then it will know that the first actual data cache, has the data now.

calmh · January 18, 2020, 3:52pm

The way I would do this is with a separate RWMutex per cache key. When you need to access the data, you:

Grab a read lock for the corresponding cache key.
- Grab the cache data.
- Release the read lock.
Check the data from the cache – is it valid and up to date? Then we are done here. Return the data if it was OK, otherwise continue onwards.

This part of the process can handle multiple concurrent readers.

But lets assume the cache wasn’t primed, so we continue:

Grab a write lock for the cache key.
- Check the cache? Is it valid and up to date? Then we are done here. (Someone might have taken this path concurrently with us and fixed up the cache in between our release and lock – we are user “B” from the original description.)
- Assuming it’s still uncached, perform the query, while still holding the write lock. We are user “A” from above. Anyone who comes in and asks for the data while we are querying ends up waiting for the write lock and then taking the “B” branch above.
- Update the cache,
- Release the lock
Return the data, which is now also cached.

Of course you then also need a map[string]*sync.RWMutex // cache key -> lock which needs it’s own locking as well. It’s locks all the way down.

Sibert · January 19, 2020, 6:07am

It sounds that this could be a candidate for “materialized views” on the sql server? Let the server take care of the caching…

geosoft1 · January 19, 2020, 1:43pm

I think is not so simple. Basically the server cache by default the requests but when the data from databases is changed the cache must be refreshed, usually this happen at the next request which leads to low performance. The idea is other party to keep and update periodically the cache.

Sibert · January 19, 2020, 4:20pm

You are correct. The materialized view must be automatically updated by triggers in order to keep the view up to date.

mikelgws · January 20, 2020, 12:47pm

Thanks Jakob for the suggestion. I think you got the point. I’ll do some more reading on mutexes, but could you possible throw in some pseudocode or link to an example on steps like “grab a read lock”, “release the read lock”, “grab a write lock” and “release the write lock”.

The part that has me puzzled is: Process A grabs the write lock and starts fetching the data. Process B starts to get the same data but discovers the write lock is already set. What does the code look like at this point, where B is waiting for A to release the lock?

calmh · January 22, 2020, 8:38am

Something like this

// var cacheLocksLock sync.Mutex protects cacheLocks
// var cacheLocks map[string]*sync.RWMutex protects individual cache keys
// var cacheKey string is the key for the current request

cacheLocksLock.Lock()
rwm, ok := cacheLocks[cacheKey]
if !ok { create new lock, set it in the map and `rwm` }
cacheLocksLock.Unlock()

// Try a fast cache read and hope for the best.
rwm.RLock()
// Multiple routines can be inside the read lock at the same time,
// as long as noone has the write lock.
data := cacheFor(cacheKey)
if data != nil {
  rwm.RUnlock()
  return data
}
rmw.RUnlock()

// OK, we might need to do the query ourselves.
rwm.Lock()
// Only one routine can be inside the write lock.
data := cacheFor(cacheKey)
if data != nil {
  // someone else took the same path as us but managed to grab the lock first
  // and did the query for is. great!
  rwm.Unlock()
  return data
}

// We need to grab the data
data = performQuery()
setCachedData(cacheKey, data)
rmw.Unlock()

return data

mikelgws · January 23, 2020, 12:35pm

Thanks Jakob. I tried a solution with the sync.WaitGroup object that seemed to work for the most part but had a few exceptions, and when I run go with -race it’s throwing some messages. I tried implementing this and was getting a bunch of nil pointer exceptions. I think perhaps my implementation of "create new lock, set it in the map and rwm" was not quite right.

system · April 22, 2020, 12:36pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.