Task #5674 (closed)
Opened 13 years ago
Closed 13 years ago
Bug: web dead with "connector_pool_size" in log
Reported by: | jamoore | Owned by: | cneves |
---|---|---|---|
Priority: | minor | Milestone: | OMERO-Beta4.3.2 |
Component: | Web | Version: | n.a. |
Keywords: | n.a. | Cc: | atarkowska, cneves |
Resources: | n.a. | Referenced By: | n.a. |
References: | n.a. | Remaining Time: | 0.0d |
Sprint: | 2011-07-21 (1) |
Description
Mon, 30 May 2011 13:25:53 webgateway ERROR Critical error during connect, retrying after _purge Mon, 30 May 2011 13:25:53 webgateway INFO reached connector_pool_size (70), size after purge: (1) Mon, 30 May 2011 13:25:53 webgateway ERROR Critical error during connect, retrying after _purge Mon, 30 May 2011 13:25:53 webgateway INFO reached connector_pool_size (70), size after purge: (1) Mon, 30 May 2011 13:25:53 webgateway ERROR Critical error during connect, retrying after _purge Mon, 30 May 2011 13:25:53 webgateway INFO reached connector_pool_size (70), size after purge: (1) Mon, 30 May 2011 13:25:53 webgateway ERROR Critical error during connect, retrying after _purge Mon, 30 May 2011 13:25:53 webgateway INFO reached connector_pool_size (70), size after purge: (1)
See also: https://www.openmicroscopy.org/community/viewtopic.php?f=5&t=464
Change History (26)
comment:1 Changed 13 years ago by jmoore
- Cc cxallan added
- Priority changed from critical to blocker
- Sprint set to 2011-06-02 (13)
comment:2 Changed 13 years ago by jmoore
NB in order to fix this, it is necessary to:
- Stop the application pool in IIS
- Restart the server
- Start the application pool.
Trying to stop and start just the application pool leads to an error.
comment:3 Changed 13 years ago by atarkowska
- Cc atarkowska added; cneves-x removed
- Owner changed from atarkowska to cneves-x
comment:4 Changed 13 years ago by atarkowska
I assume #5663 is related.
comment:5 Changed 13 years ago by cxallan
- Cc cneves-x added; cxallan removed
- Owner changed from cneves-x to cxallan
- Remaining Time set to 0.5
- Status changed from new to accepted
comment:6 Changed 13 years ago by cxallan
- Owner changed from cxallan to cneves-x
comment:7 Changed 13 years ago by cneves
This is more related to #5663 than what is reported here, but if one logs in, waits over 10 minutes and issues a refresh the connection times out (as expected) which triggers a session join. This line in the omero.gateway:
s = self.c.joinSession(self._sessionUuid)
Knowing self.c is an omero.client instance and self._sessionUuid is a valid session id (I assume, as omero.sessions.timeout=86400000 which should give a 24h timeout) , triggers executeCheckPasswordRO/RW with the session id in place of the user credentials. I'm not sure this if this is expected, but I'd assume not.
2011-05-31 22:22:59,652 INFO [ ome.services.util.ServiceHandler] (l.Server-0) Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRO(087e99ab-5929-4b68-8204-e73b10a38e4b) 2011-05-31 22:22:59,652 INFO [ ome.services.util.ServiceHandler] (l.Server-0) Args: [null, ome.tools.spring.InternalServiceFactory@1d414c5] 2011-05-31 22:22:59,652 INFO [ ome.security.basic.EventHandler] (l.Server-0) Auth: user=0,group=0,event=null(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302 2011-05-31 22:22:59,652 INFO [ org.perf4j.TimingLogger] (l.Server-0) start[1306876979652] time[0] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$8.doWork] 2011-05-31 22:22:59,652 INFO [ ome.services.util.ServiceHandler] (l.Server-0) Rslt: null 2011-05-31 22:22:59,652 INFO [ ome.services.util.ServiceHandler] (l.Server-0) Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRW(087e99ab-5929-4b68-8204-e73b10a38e4b) 2011-05-31 22:22:59,652 INFO [ ome.services.util.ServiceHandler] (l.Server-0) Args: [null, ome.tools.spring.InternalServiceFactory@cd590b] 2011-05-31 22:22:59,730 INFO [ ome.security.basic.EventHandler] (l.Server-0) Auth: user=0,group=0,event=270628(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302 2011-05-31 22:22:59,746 INFO [ ome.security.auth.LdapPasswordProvider] (l.Server-0) Default choice on create user: 087e99ab-5929-4b68-8204-e73b10a38e4b (ome.conditions.ApiUsageException: Cannot find unique DistinguishedName: found=0) 2011-05-31 22:22:59,746 INFO [ org.perf4j.TimingLogger] (l.Server-0) start[1306876979652] time[94] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$9.doWork] 2011-05-31 22:22:59,746 INFO [ ome.services.util.ServiceHandler] (l.Server-0) Rslt: false 2011-05-31 22:22:59,886 INFO [ ome.services.util.ServiceHandler] (l.Server-1) Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRO(087e99ab-5929-4b68-8204-e73b10a38e4b) 2011-05-31 22:22:59,886 INFO [ ome.services.util.ServiceHandler] (l.Server-1) Args: [null, ome.tools.spring.InternalServiceFactory@178cb51] 2011-05-31 22:22:59,886 INFO [ ome.security.basic.EventHandler] (l.Server-1) Auth: user=0,group=0,event=null(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302 2011-05-31 22:22:59,886 INFO [ org.perf4j.TimingLogger] (l.Server-1) start[1306876979886] time[0] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$8.doWork] 2011-05-31 22:22:59,886 INFO [ ome.services.util.ServiceHandler] (l.Server-1) Rslt: null 2011-05-31 22:22:59,886 INFO [ ome.services.util.ServiceHandler] (l.Server-1) Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRW(087e99ab-5929-4b68-8204-e73b10a38e4b) 2011-05-31 22:22:59,886 INFO [ ome.services.util.ServiceHandler] (l.Server-1) Args: [null, ome.tools.spring.InternalServiceFactory@1855788] 2011-05-31 22:22:59,933 INFO [ ome.security.basic.EventHandler] (l.Server-1) Auth: user=0,group=0,event=270629(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302 2011-05-31 22:22:59,949 INFO [ ome.security.auth.LdapPasswordProvider] (l.Server-1) Default choice on create user: 087e99ab-5929-4b68-8204-e73b10a38e4b (ome.conditions.ApiUsageException: Cannot find unique DistinguishedName: found=0) 2011-05-31 22:22:59,949 WARN [ ome.security.auth.LoginAttemptListener] (l.Server-1) 2 failed logins for 087e99ab-5929-4b68-8204-e73b10a38e4b. Throttling for 3000 2011-05-31 22:23:02,949 INFO [ org.perf4j.TimingLogger] (l.Server-1) start[1306876979886] time[3063] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$9.doWork] 2011-05-31 22:23:02,949 INFO [ ome.services.util.ServiceHandler] (l.Server-1) Rslt: false
The above is Blitz-0.log from the moment we hit refresh on the browser, below is the same for OMEROweb.log
Tue, 31 May 2011 22:22:59 webgateway DEBUG :: (session) root localhost 4064 Tue, 31 May 2011 22:22:59 webgateway DEBUG :: (request) None None None Tue, 31 May 2011 22:22:59 webgateway DEBUG trying stored connection with userAgent: OMERO.web ckey: S:087e99ab-5929-4b68-8204-e73b10a38e4b#1 Tue, 31 May 2011 22:22:59 webgateway DEBUG [('S:087e99ab-5929-4b68-8204-e73b10a38e4b#1', <webclient.webclient_gateway.OmeroWebGateway object at 0x03786CF0>)] Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG Traceback (most recent call last): File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1278, in keepAlive return self.c.sf.keepAlive(self._proxies['admin']._obj) File "C:\OMERO-CURRENT\lib\python\omero_API_ice.py", line 312, in keepAlive return _M_omero.api.ServiceFactory._op_keepAlive.invoke(self, ((proxy, ), _ctx)) ConnectionLostException: Ice.ConnectionLostException: recv() returned zero Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG ... lost, reconnecting Tue, 31 May 2011 22:22:59 webgateway INFO Failed keepalive for connection S:087e99ab-5929-4b68-8204-e73b10a38e4b#1 Tue, 31 May 2011 22:22:59 webgateway DEBUG :: (session) root localhost 4064 Tue, 31 May 2011 22:22:59 webgateway DEBUG :: (request) None None None Tue, 31 May 2011 22:22:59 webgateway DEBUG trying stored connection with userAgent: OMERO.web ckey: S:087e99ab-5929-4b68-8204-e73b10a38e4b#1 Tue, 31 May 2011 22:22:59 webgateway DEBUG [] Tue, 31 May 2011 22:22:59 webgateway DEBUG creating new connection with "S:" (False) Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG Connect attempt, sUuid=087e99ab-5929-4b68-8204-e73b10a38e4b, group=None, self.sUuid=None Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG connected? False Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG going for joinSession Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG Error: exception ::Glacier2::PermissionDeniedException { reason = Password check failed } Tue, 31 May 2011 22:22:59 webgateway DEBUG created new connection <webclient.webclient_gateway.OmeroWebGateway object at 0x03999270> Tue, 31 May 2011 22:22:59 webgateway DEBUG connection failed with provided login information, bail out Tue, 31 May 2011 22:22:59 webgateway DEBUG :: (session) root localhost 4064 Tue, 31 May 2011 22:22:59 webgateway DEBUG :: (request) None None None Tue, 31 May 2011 22:22:59 webgateway DEBUG trying stored connection with userAgent: OMERO.web ckey: S:087e99ab-5929-4b68-8204-e73b10a38e4b#1 Tue, 31 May 2011 22:22:59 webgateway DEBUG [] Tue, 31 May 2011 22:22:59 webgateway DEBUG creating new connection with "S:" (False) Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG Connect attempt, sUuid=087e99ab-5929-4b68-8204-e73b10a38e4b, group=None, self.sUuid=None Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG connected? False Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG going for joinSession Tue, 31 May 2011 22:23:02 blitz_gateway DEBUG Error: exception ::Glacier2::PermissionDeniedException { reason = Password check failed } Tue, 31 May 2011 22:23:02 webgateway DEBUG created new connection <webclient.webclient_gateway.OmeroWebGateway object at 0x03999390> Tue, 31 May 2011 22:23:02 webgateway DEBUG connection failed with provided login information, bail out
I added some debug to omero.gateway, so I know the error comes from the joinSession call:
logger.debug("going for joinSession") s = self.c.joinSession(self._sessionUuid) logger.debug("going for detachOnDestroy") s.detachOnDestroy() logger.debug('joinSession(%s)' % self._sessionUuid)
comment:8 Changed 13 years ago by jburel
- Sprint changed from 2011-06-02 (13) to 2011-06-16 (14)
Moved from sprint 2011-06-02 (13)
comment:9 Changed 13 years ago by jmoore
Looking into some other threads, we may want to turn on tracing:
comment:10 Changed 13 years ago by Carlos Neves <carlos@…>
(In [3e571409c1f8ab078df922915767776d9035572d/ome.git] on branch develop) adding some extra debug of webgateway (see #5674)
comment:11 Changed 13 years ago by cneves
The lock up we had today seems to not have anything to do with the original issue, which just makes everything more fun...
Basically IIS was accepting requests and pushing them to the correct worker which would then push them to the isapi_wsgi bridge talking to django. Thing is, requests got started and appeared in the open requests for IIS, but never reach django. There were 51 open requests when I looked, I triggered a few more and they all got caught by IIS but never processed. No log entries exist for these hanging requests.
I took note of all python process PIDs and issued an iisreset, after which the server became responsive again. The python PIDs are the same thus I assumed django was not restarted, but on further inspection it seems that the wsgi process does not appear as python in the process manager, rather as w3wp.exe, the IIS worker. So back to square one, I've been trying very hard to kill the server but no luck yet.
comment:12 Changed 13 years ago by Carlos Neves <carlos@…>
(In [f3a7b67fc6ff86b6872d22d9f24dc6abd282ecc9/ome.git] on branch develop) Forcing single threading on isapi_wsgi handler by using thread locks (see #5674)
comment:13 Changed 13 years ago by cneves
There is either a misunderstanding or something wrong with the isapi_wsgi wrapper or the way IIS calls it. I was assuming, out of all that I read, that the worker was single threaded as it was using the isapi_wsgi.ISAPISimpleHandler, as opposed to the threaded pool handler, but further inspection made it very clear that there was a single worker process being handled in a multithreaded fashion.
We don't currently support multithreading on the python blitz gateway (as per #5724) and I could not find a way to force that on IIS, so in order to mimic our deployment environments on Mac and Linux I added a threading.Lock() to the wsgi handler, forcing it to be single threaded.
To prevent severe performance degradation an extra step needs to be taken on the IIS configuration, and that is add more workers. The exact value for this needs some experimentation, 5 is a good default on linux. The w2k8 server has been patched, configured and restarted so you can try to lock it again at your leisure.
comment:14 Changed 13 years ago by Carlos Neves <carlos@…>
(In [a1a50797ada2530a887566fa7f1649ffae956e3f/ome.git] on branch develop) Handling settings.py loading of config.xml lock, which on windows was locking IIS instead of retrying (see #5674)
comment:15 Changed 13 years ago by cneves
It turns out today's lock is actually reproducible. The culprit is:
Fri, 10 Jun 2011 13:49:01 root ERROR Exception while loading configuration... Traceback (most recent call last): File "c:\omero-current\lib\python\omeroweb\settings.py", line 81, in <module> CONFIG_XML.close() File "c:\omero-current\lib\python\omero\config.py", line 228, in close self._close_lock() File "c:\omero-current\lib\python\omero\config.py", line 87, in _close_lock os.remove("%s.lock" % self.filename) WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\omero-current\\etc\\grid\\config.xml.lock'
So portalock should always throw LockException?, but it doesn't on the win2k8 server we're testing on (if this is the windows version, just windows being windows, python on windows or the specific python we are using I didn't check), instead throwing WindowsError?.
I then created a view that throws this error and it got caught by django and pushed to the browser as an error page, no harm done to IIS, so whatever weird exceptions the running web apps throw should not kill IIS and just be handled by django's catch-all.
Finally I ran a fresh IIS worker set, exercised it a bit so all workers were running, recycled the App Pool (which reloads the workers but using the same process for each) and ran one call that raised a WindowsError? just before CONFIG_XML.close(). As one would assume at this point, an unhandled exception on the initialization stage, or module import, that gets leaked up the stack onto the wsgi app will in fact lock the worker in question, so that explains this and the previous incarnation of the issue.
I have handled this particular case only so as to not hide any further problem we may have on windows, but we should consider a catch-all at the isapi_wsgi handler level which is a cleaner solution than putting windows specific hacks into generic sources.
We may also consider changing portalock to make sure the exception thrown is consistent across operating systems.
I'm leaving the ticket open pending more tests to the system, and the win2k8 server is all patched up so please test it!
comment:16 Changed 13 years ago by jmoore
Carlos, either let's fix portalocker as a part of this ticket, or please open a new ticket before closing this one. Thanks.
comment:17 Changed 13 years ago by Carlos Neves <carlos@…>
(In [525d3f378b63b2dcfe76a8009d635f2cabf098c5/ome.git] on branch develop) Trying to work around windows error on removing ConfigXml? lock file (see #5674).
comment:18 Changed 13 years ago by cneves
Last commit implements a retry on the lock file removal part of ConfigXml?._close_lock. On more detailed inspection I see that portalock is not the one throwing the WindowsException?, but the os.remove call ConfigXml?._close_lock does right after releasing the file lock.
I'm assuming that the error is due to a race condition so I'm applying the wait one second and retry up to 10 seconds rule stolen from the enclosing settings.py call that catches the LockException? and WindowsError?. I couldn't trigger the error to happen again so I'm not sure the fix works as expected, but I did leave enough logging in place to assert after the fact, so please hit the windows server if you can find the time.
Also, I'm using the omero.util.concurrency.get_event tool blindly, to mimic what settings.py is doing, but this may not be required or wanted, so fell free to speak up if you feel I did wrong here.
I am leaving the exception handling for WindowsError? on settings.py for the moment, but it should be removed once we verify this fix.
comment:19 Changed 13 years ago by jmoore
With the locking, we can evaluate if the retry is necessary/helpful. It may just be that we have to leave the lock file in Windows (as long as that doesn't cause any problems).
Use of omero.util.concurrency.get_event is required/wanted, so that on any call to shutdown or sys.exit, the thread releases.
comment:20 Changed 13 years ago by Carlos Neves <carlos@…>
(In [b910070367a38e21e609fc1738ca7bb082be57dc/ome.git] on branch develop) Removing retry on failed os.remove of ConfigXml?, just ignoring the error now (see #5674).
comment:21 Changed 13 years ago by cneves
- Priority changed from blocker to minor
The latest patch seems to be working fine:
Mon, 13 Jun 2011 16:25:02 root 4052 ERROR Exception while loading configuration retrying... Traceback (most recent call last): File "c:\omero-current\lib\python\omeroweb\settings.py", line 84, in <module> CONFIG_XML = omero.config.ConfigXml(CONFIG_XML) File "c:\omero-current\lib\python\omero\config.py", line 58, in __init__ portalocker.lock(self.lock, portalocker.LOCK_NB|portalocker.LOCK_EX) File "C:\OMERO-CURRENT\lib\python\portalocker.py", line 93, in lock raise LockException(LockException.LOCK_FAILED, exc_value[2]) LockException: (1, 'The process cannot access the file because another process has locked a portion of the file.') Mon, 13 Jun 2011 16:25:02 ConfigXml 7700 ERROR Failed to removed lock file, ignoring Traceback (most recent call last): File "c:\omero-current\lib\python\omero\config.py", line 92, in _close_lock os.remove("%s.lock" % self.filename) WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\omero-current\\etc\\grid\\config.xml.lock' (snip) Mon, 13 Jun 2011 16:25:03 webgateway 7700 DEBUG INIT Mon, 13 Jun 2011 16:25:03 views-admin 7700 INFO INIT '7700' Mon, 13 Jun 2011 16:25:03 views-web 7700 INFO INIT '7700' Mon, 13 Jun 2011 16:25:03 webgateway 7700 DEBUG :: (session) root localhost 4064 Mon, 13 Jun 2011 16:25:03 webgateway 7700 DEBUG :: (request) None None None Mon, 13 Jun 2011 16:25:03 webgateway 7700 DEBUG trying stored connection with userAgent: OMERO.web ckey: S:9de8fbd9-f9b7-4ff1-906b-f96f2d6d2a7f#1 Mon, 13 Jun 2011 16:25:03 webgateway 7700 DEBUG [] Mon, 13 Jun 2011 16:25:03 webgateway 7700 DEBUG creating new connection with ckey "S:", sUuid "9de8fbd9-f9b7-4ff1-906b-f96f2d6d2a7f" (False) Mon, 13 Jun 2011 16:25:03 ConfigXml 4052 ERROR Failed to removed lock file, ignoring Traceback (most recent call last): File "c:\omero-current\lib\python\omero\config.py", line 92, in _close_lock os.remove("%s.lock" % self.filename) WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\omero-current\\etc\\grid\\config.xml.lock' Mon, 13 Jun 2011 16:25:03 root 4052 DEBUG ADMINS = [] (source:default)
The top exception gets retried one second after and succeeds, the bottom one is on the removal of the lock file and gets ignored as the config file has been read and the existence of the lock file doesn't prevent the next lock to succeed.
The only thing left is to remove the exception handling of the WindowsError? on settings.py, as it no longer makes sense. To avoid introducing more variables at this stage I'll leave this for after release, so I'm moving the ticket from blocker to minor for later processing.
comment:22 Changed 13 years ago by cneves
- Remaining Time changed from 0.5 to 0.1
comment:23 Changed 13 years ago by jburel
- Milestone changed from OMERO-Beta4.3 to OMERO-Beta4.3.1
- Sprint changed from 2011-06-16 (14) to 2011-06-30 (1)
Moved from sprint 2011-06-16 (14)
comment:24 Changed 13 years ago by jburel
- Milestone changed from OMERO-Beta4.3.1 to OMERO-Beta4.3.2
- Sprint changed from 2011-07-07 (1) to 2011-07-21 (1)
Moved from sprint 2011-07-07 (1)
comment:25 Changed 13 years ago by Carlos Neves <carlos@…>
(In [2ebfba1f2c419b801e74381f83a45deb03ce0a70/ome.git] on branch develop) Cleaning up settings.py (see #5674)
comment:26 Changed 13 years ago by cneves
- Remaining Time changed from 0.1 to 0
- Resolution set to fixed
- Status changed from accepted to closed
Seems to happen periodically. This is an excerpt from OMEROweb.log after setting omero.web.debug before the last restart: