Context Navigation

← Previous Ticket
Next Ticket →

Warning: Can't synchronize with repository "(default)" (/home/git/ome.git does not appear to be a Git repository.). Look in the Trac log for more information.

Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Task #5674 (closed)

Opened 13 years ago

Closed 13 years ago

Bug: web dead with "connector_pool_size" in log

Reported by:	jamoore	Owned by:	cneves
Priority:	minor	Milestone:	OMERO-Beta4.3.2
Component:	Web	Version:	n.a.
Keywords:	n.a.	Cc:	atarkowska, cneves
Resources:	n.a.	Referenced By:	n.a.
References:	n.a.	Remaining Time:	0.0d
Sprint:	2011-07-21 (1)

Description

Mon, 30 May 2011 13:25:53 webgateway   ERROR    Critical error during connect, retrying after _purge
Mon, 30 May 2011 13:25:53 webgateway   INFO     reached connector_pool_size (70), size after purge: (1)
Mon, 30 May 2011 13:25:53 webgateway   ERROR    Critical error during connect, retrying after _purge
Mon, 30 May 2011 13:25:53 webgateway   INFO     reached connector_pool_size (70), size after purge: (1)
Mon, 30 May 2011 13:25:53 webgateway   ERROR    Critical error during connect, retrying after _purge
Mon, 30 May 2011 13:25:53 webgateway   INFO     reached connector_pool_size (70), size after purge: (1)
Mon, 30 May 2011 13:25:53 webgateway   ERROR    Critical error during connect, retrying after _purge
Mon, 30 May 2011 13:25:53 webgateway   INFO     reached connector_pool_size (70), size after purge: (1)

See also: https://www.openmicroscopy.org/community/viewtopic.php?f=5&t=464

References

Referenced by:
← User Story (#5432): Testing Phase II Bugs

Change History (26)

comment:1 Changed 13 years ago by jmoore

Cc cxallan added
Priority changed from critical to blocker
Sprint set to 2011-06-02 (13)

Seems to happen periodically. This is an excerpt from OMEROweb.log after setting omero.web.debug before the last restart:

Mon, 30 May 2011 16:01:39 blitz_gateway INFO     closed connecion (uuid=d7a0a39b-90dc-41f8-9c75-aea5c39d4263)
Mon, 30 May 2011 16:01:45 webgateway   DEBUG    trying stored connection with userAgent: OMERO.web  ckey: S:
Mon, 30 May 2011 16:01:45 webgateway   DEBUG    [('S:7574169c-e148-4c58-8dbe-550cb0a559a5#1', <webclient.webclient_gateway.OmeroWebGateway object at 0x04145470>), ('S:3b8675d9-d490-4263-9b85-df796a4f4b80#None', <webclient.webclient_gateway.OmeroWebGateway object at 0x03B44190>), ('S:ff70fb61-2b5d-4273-8d35-16d5bc284642#None', <webclient.webclient_gateway.OmeroWebGateway object at 0x0426CFD0>), ('S:889ad82f-90c3-4bec-afbe-cb28ddfab1b1#1', <webclient.webclient_gateway.OmeroWebGateway object at 0x03B37C70>), ('S:8a7e4b02-4be8-4589-9f40-ec8ec74ec50e#None', <webclient.webclient_gateway.OmeroWebGateway object at 0x03C50370>), ('S:0045ba80-472c-41f0-a1f9-6c783fc3195c#None', <webclient.webclient_gateway.OmeroWebGateway object at 0x03A69950>)]
Mon, 30 May 2011 16:01:45 webgateway   DEBUG    creating new connection with "S:" (False)
Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 192, in _createConnection
    blitzcon = client_wrapper(username, passwd, host=host, port=port, group=None, try_super=try_super, secure=secure, anonymous=anonymous, useragent=useragent)
  File "c:\OMERO-CURRENT\lib\python\omero\__init__.py", line 24, in client_wrapper
    return omero.gateway.BlitzGateway(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webclient\webclient_gateway.py", line 110, in __init__
    super(OmeroWebGateway, self).__init__(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1167, in __init__
    self._resetOmeroClient()
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1452, in _resetOmeroClient
    self.c = omero.client(pmap=['--Ice.Config='+','.join(self.ice_config)])
  File "c:\OMERO-CURRENT\lib\python\omero\__init__.py", line 28, in client
    return omero.clients.BaseClient(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 128, in __init__
    self._initData(id)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 230, in _initData
    self.__ic = Ice.initialize(id)
  File "c:\Ice-3.3.1\python\Ice.py", line 340, in initialize
    communicator = IcePy.Communicator(args, data)
EndpointParseException: exception ::Ice::EndpointParseException
{
    str = ssl  -p 4064 -h <"omero.host" not set>
}

Mon, 30 May 2011 16:01:45 webgateway   ERROR    Critical error during connect, retrying after _purge
Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 192, in _createConnection
    blitzcon = client_wrapper(username, passwd, host=host, port=port, group=None, try_super=try_super, secure=secure, anonymous=anonymous, useragent=useragent)
  File "c:\OMERO-CURRENT\lib\python\omero\__init__.py", line 24, in client_wrapper
    return omero.gateway.BlitzGateway(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webclient\webclient_gateway.py", line 110, in __init__
    super(OmeroWebGateway, self).__init__(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1167, in __init__
    self._resetOmeroClient()
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1452, in _resetOmeroClient
    self.c = omero.client(pmap=['--Ice.Config='+','.join(self.ice_config)])
  File "c:\OMERO-CURRENT\lib\python\omero\__init__.py", line 28, in client
    return omero.clients.BaseClient(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 128, in __init__
    self._initData(id)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 230, in _initData
    self.__ic = Ice.initialize(id)
  File "c:\Ice-3.3.1\python\Ice.py", line 340, in initialize
    communicator = IcePy.Communicator(args, data)
EndpointParseException: exception ::Ice::EndpointParseException
{
    str = ssl  -p 4064 -h <"omero.host" not set>
}

Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 217, in _purge
    c.seppuku(softclose=True)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1327, in seppuku
    r = self.c.sf.getSessionService().getReferenceCount(self._sessionUuid)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 881, in __getattr__
    return self.getSession()
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 346, in getSession
    raise omero.ClientError("No session available")
ClientError: No session available

Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 217, in _purge
    c.seppuku(softclose=True)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1327, in seppuku
    r = self.c.sf.getSessionService().getReferenceCount(self._sessionUuid)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 881, in __getattr__
    return self.getSession()
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 346, in getSession
    raise omero.ClientError("No session available")
ClientError: No session available

Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 217, in _purge
    c.seppuku(softclose=True)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1327, in seppuku
    r = self.c.sf.getSessionService().getReferenceCount(self._sessionUuid)
  File "c:\OMERO-CURRENT\lib\python\omero_API_ice.py", line 240, in getSessionService
    return _M_omero.api.ServiceFactory._op_getSessionService.invoke(self, ((), _ctx))
ConnectionLostException: Ice.ConnectionLostException:
recv() returned zero

Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 217, in _purge
    c.seppuku(softclose=True)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1327, in seppuku
    r = self.c.sf.getSessionService().getReferenceCount(self._sessionUuid)
  File "c:\OMERO-CURRENT\lib\python\omero_API_ice.py", line 240, in getSessionService
    return _M_omero.api.ServiceFactory._op_getSessionService.invoke(self, ((), _ctx))
ConnectionLostException: Ice.ConnectionLostException:
recv() returned zero

Mon, 30 May 2011 16:01:45 webgateway   INFO     reached connector_pool_size (70), size after purge: (2)
Mon, 30 May 2011 16:01:45 webgateway   DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webgateway\views.py", line 192, in _createConnection
    blitzcon = client_wrapper(username, passwd, host=host, port=port, group=None, try_super=try_super, secure=secure, anonymous=anonymous, useragent=useragent)
  File "c:\OMERO-CURRENT\lib\python\omero\__init__.py", line 24, in client_wrapper
    return omero.gateway.BlitzGateway(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omeroweb\webclient\webclient_gateway.py", line 110, in __init__
    super(OmeroWebGateway, self).__init__(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1167, in __init__
    self._resetOmeroClient()
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1452, in _resetOmeroClient
    self.c = omero.client(pmap=['--Ice.Config='+','.join(self.ice_config)])
  File "c:\OMERO-CURRENT\lib\python\omero\__init__.py", line 28, in client
    return omero.clients.BaseClient(*args, **kwargs)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 128, in __init__
    self._initData(id)
  File "c:\OMERO-CURRENT\lib\python\omero\clients.py", line 230, in _initData
    self.__ic = Ice.initialize(id)
  File "c:\Ice-3.3.1\python\Ice.py", line 340, in initialize
    communicator = IcePy.Communicator(args, data)
EndpointParseException: exception ::Ice::EndpointParseException
{
    str = ssl  -p 4064 -h <"omero.host" not set>
}

comment:2 Changed 13 years ago by jmoore

NB in order to fix this, it is necessary to:

Stop the application pool in IIS
Restart the server
Start the application pool.

Trying to stop and start just the application pool leads to an error.

comment:3 Changed 13 years ago by atarkowska

Cc atarkowska added; cneves-x removed
Owner changed from atarkowska to cneves-x

comment:4 Changed 13 years ago by atarkowska

I assume #5663 is related.

comment:5 Changed 13 years ago by cxallan

Cc cneves-x added; cxallan removed
Owner changed from cneves-x to cxallan
Remaining Time set to 0.5
Status changed from new to accepted

comment:6 Changed 13 years ago by cxallan

Owner changed from cxallan to cneves-x

comment:7 Changed 13 years ago by cneves

This is more related to #5663 than what is reported here, but if one logs in, waits over 10 minutes and issues a refresh the connection times out (as expected) which triggers a session join. This line in the omero.gateway:

     s = self.c.joinSession(self._sessionUuid)

Knowing self.c is an omero.client instance and self._sessionUuid is a valid session id (I assume, as omero.sessions.timeout=86400000 which should give a 24h timeout) , triggers executeCheckPasswordRO/RW with the session id in place of the user credentials. I'm not sure this if this is expected, but I'd assume not.

2011-05-31 22:22:59,652 INFO  [        ome.services.util.ServiceHandler] (l.Server-0)  Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRO(087e99ab-5929-4b68-8204-e73b10a38e4b)
2011-05-31 22:22:59,652 INFO  [        ome.services.util.ServiceHandler] (l.Server-0)  Args:	[null, ome.tools.spring.InternalServiceFactory@1d414c5]
2011-05-31 22:22:59,652 INFO  [         ome.security.basic.EventHandler] (l.Server-0)  Auth:	user=0,group=0,event=null(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302
2011-05-31 22:22:59,652 INFO  [                 org.perf4j.TimingLogger] (l.Server-0) start[1306876979652] time[0] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$8.doWork]
2011-05-31 22:22:59,652 INFO  [        ome.services.util.ServiceHandler] (l.Server-0)  Rslt:	null
2011-05-31 22:22:59,652 INFO  [        ome.services.util.ServiceHandler] (l.Server-0)  Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRW(087e99ab-5929-4b68-8204-e73b10a38e4b)
2011-05-31 22:22:59,652 INFO  [        ome.services.util.ServiceHandler] (l.Server-0)  Args:	[null, ome.tools.spring.InternalServiceFactory@cd590b]
2011-05-31 22:22:59,730 INFO  [         ome.security.basic.EventHandler] (l.Server-0)  Auth:	user=0,group=0,event=270628(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302
2011-05-31 22:22:59,746 INFO  [  ome.security.auth.LdapPasswordProvider] (l.Server-0) Default choice on create user: 087e99ab-5929-4b68-8204-e73b10a38e4b (ome.conditions.ApiUsageException: Cannot find unique DistinguishedName: found=0)
2011-05-31 22:22:59,746 INFO  [                 org.perf4j.TimingLogger] (l.Server-0) start[1306876979652] time[94] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$9.doWork]
2011-05-31 22:22:59,746 INFO  [        ome.services.util.ServiceHandler] (l.Server-0)  Rslt:	false
2011-05-31 22:22:59,886 INFO  [        ome.services.util.ServiceHandler] (l.Server-1)  Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRO(087e99ab-5929-4b68-8204-e73b10a38e4b)
2011-05-31 22:22:59,886 INFO  [        ome.services.util.ServiceHandler] (l.Server-1)  Args:	[null, ome.tools.spring.InternalServiceFactory@178cb51]
2011-05-31 22:22:59,886 INFO  [         ome.security.basic.EventHandler] (l.Server-1)  Auth:	user=0,group=0,event=null(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302
2011-05-31 22:22:59,886 INFO  [                 org.perf4j.TimingLogger] (l.Server-1) start[1306876979886] time[0] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$8.doWork]
2011-05-31 22:22:59,886 INFO  [        ome.services.util.ServiceHandler] (l.Server-1)  Rslt:	null
2011-05-31 22:22:59,886 INFO  [        ome.services.util.ServiceHandler] (l.Server-1)  Executor.doWork -- ome.services.sessions.SessionManagerImpl.executeCheckPasswordRW(087e99ab-5929-4b68-8204-e73b10a38e4b)
2011-05-31 22:22:59,886 INFO  [        ome.services.util.ServiceHandler] (l.Server-1)  Args:	[null, ome.tools.spring.InternalServiceFactory@1855788]
2011-05-31 22:22:59,933 INFO  [         ome.security.basic.EventHandler] (l.Server-1)  Auth:	user=0,group=0,event=270629(Sessions),sess=1b36bae7-9bab-4392-aed7-84c971191302
2011-05-31 22:22:59,949 INFO  [  ome.security.auth.LdapPasswordProvider] (l.Server-1) Default choice on create user: 087e99ab-5929-4b68-8204-e73b10a38e4b (ome.conditions.ApiUsageException: Cannot find unique DistinguishedName: found=0)
2011-05-31 22:22:59,949 WARN  [  ome.security.auth.LoginAttemptListener] (l.Server-1) 2 failed logins for 087e99ab-5929-4b68-8204-e73b10a38e4b. Throttling for 3000
2011-05-31 22:23:02,949 INFO  [                 org.perf4j.TimingLogger] (l.Server-1) start[1306876979886] time[3063] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$9.doWork]
2011-05-31 22:23:02,949 INFO  [        ome.services.util.ServiceHandler] (l.Server-1)  Rslt:	false

The above is Blitz-0.log from the moment we hit refresh on the browser, below is the same for OMEROweb.log

Tue, 31 May 2011 22:22:59 webgateway   DEBUG    :: (session) root localhost 4064
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    :: (request) None None None
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    trying stored connection with userAgent: OMERO.web  ckey: S:087e99ab-5929-4b68-8204-e73b10a38e4b#1
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    [('S:087e99ab-5929-4b68-8204-e73b10a38e4b#1', <webclient.webclient_gateway.OmeroWebGateway object at 0x03786CF0>)]
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    Traceback (most recent call last):
  File "c:\OMERO-CURRENT\lib\python\omero\gateway\__init__.py", line 1278, in keepAlive
    return self.c.sf.keepAlive(self._proxies['admin']._obj)
  File "C:\OMERO-CURRENT\lib\python\omero_API_ice.py", line 312, in keepAlive
    return _M_omero.api.ServiceFactory._op_keepAlive.invoke(self, ((proxy, ), _ctx))
ConnectionLostException: Ice.ConnectionLostException:
recv() returned zero

Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    ... lost, reconnecting
Tue, 31 May 2011 22:22:59 webgateway   INFO     Failed keepalive for connection S:087e99ab-5929-4b68-8204-e73b10a38e4b#1
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    :: (session) root localhost 4064
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    :: (request) None None None
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    trying stored connection with userAgent: OMERO.web  ckey: S:087e99ab-5929-4b68-8204-e73b10a38e4b#1
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    []
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    creating new connection with "S:" (False)
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    Connect attempt, sUuid=087e99ab-5929-4b68-8204-e73b10a38e4b, group=None, self.sUuid=None
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    connected? False
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    going for joinSession
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    Error: exception ::Glacier2::PermissionDeniedException
{
    reason = Password check failed
}
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    created new connection <webclient.webclient_gateway.OmeroWebGateway object at 0x03999270>
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    connection failed with provided login information, bail out
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    :: (session) root localhost 4064
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    :: (request) None None None
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    trying stored connection with userAgent: OMERO.web  ckey: S:087e99ab-5929-4b68-8204-e73b10a38e4b#1
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    []
Tue, 31 May 2011 22:22:59 webgateway   DEBUG    creating new connection with "S:" (False)
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    Connect attempt, sUuid=087e99ab-5929-4b68-8204-e73b10a38e4b, group=None, self.sUuid=None
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    connected? False
Tue, 31 May 2011 22:22:59 blitz_gateway DEBUG    going for joinSession
Tue, 31 May 2011 22:23:02 blitz_gateway DEBUG    Error: exception ::Glacier2::PermissionDeniedException
{
    reason = Password check failed
}
Tue, 31 May 2011 22:23:02 webgateway   DEBUG    created new connection <webclient.webclient_gateway.OmeroWebGateway object at 0x03999390>
Tue, 31 May 2011 22:23:02 webgateway   DEBUG    connection failed with provided login information, bail out

I added some debug to omero.gateway, so I know the error comes from the joinSession call:

                   logger.debug("going for joinSession")
                    s = self.c.joinSession(self._sessionUuid)
                    logger.debug("going for detachOnDestroy")
                    s.detachOnDestroy()
                    logger.debug('joinSession(%s)' % self._sessionUuid)

comment:8 Changed 13 years ago by jburel

Sprint changed from 2011-06-02 (13) to 2011-06-16 (14)

Moved from sprint 2011-06-02 (13)

comment:9 Changed 13 years ago by jmoore

Looking into some other threads, we may want to turn on tracing:

http://groups.google.com/group/isapi_wsgi-dev/browse_thread/thread/a5ea9a4c92700a2f/84642cec1bd85d91?lnk=raot

comment:10 Changed 13 years ago by Carlos Neves <carlos@…>

(In [3e571409c1f8ab078df922915767776d9035572d/ome.git] on branch develop) adding some extra debug of webgateway (see #5674)

comment:11 Changed 13 years ago by cneves

The lock up we had today seems to not have anything to do with the original issue, which just makes everything more fun...

Basically IIS was accepting requests and pushing them to the correct worker which would then push them to the isapi_wsgi bridge talking to django. Thing is, requests got started and appeared in the open requests for IIS, but never reach django. There were 51 open requests when I looked, I triggered a few more and they all got caught by IIS but never processed. No log entries exist for these hanging requests.

I took note of all python process PIDs and issued an iisreset, after which the server became responsive again. The python PIDs are the same thus I assumed django was not restarted, but on further inspection it seems that the wsgi process does not appear as python in the process manager, rather as w3wp.exe, the IIS worker. So back to square one, I've been trying very hard to kill the server but no luck yet.

comment:12 Changed 13 years ago by Carlos Neves <carlos@…>

(In [f3a7b67fc6ff86b6872d22d9f24dc6abd282ecc9/ome.git] on branch develop) Forcing single threading on isapi_wsgi handler by using thread locks (see #5674)

comment:13 Changed 13 years ago by cneves

There is either a misunderstanding or something wrong with the isapi_wsgi wrapper or the way IIS calls it. I was assuming, out of all that I read, that the worker was single threaded as it was using the isapi_wsgi.ISAPISimpleHandler, as opposed to the threaded pool handler, but further inspection made it very clear that there was a single worker process being handled in a multithreaded fashion.

We don't currently support multithreading on the python blitz gateway (as per #5724) and I could not find a way to force that on IIS, so in order to mimic our deployment environments on Mac and Linux I added a threading.Lock() to the wsgi handler, forcing it to be single threaded.

To prevent severe performance degradation an extra step needs to be taken on the IIS configuration, and that is add more workers. The exact value for this needs some experimentation, 5 is a good default on linux. The w2k8 server has been patched, configured and restarted so you can try to lock it again at your leisure.

comment:14 Changed 13 years ago by Carlos Neves <carlos@…>

(In [a1a50797ada2530a887566fa7f1649ffae956e3f/ome.git] on branch develop) Handling settings.py loading of config.xml lock, which on windows was locking IIS instead of retrying (see #5674)

comment:15 Changed 13 years ago by cneves

It turns out today's lock is actually reproducible. The culprit is:

Fri, 10 Jun 2011 13:49:01 root         ERROR    Exception while loading configuration...
Traceback (most recent call last):
  File "c:\omero-current\lib\python\omeroweb\settings.py", line 81, in <module>
    CONFIG_XML.close()
  File "c:\omero-current\lib\python\omero\config.py", line 228, in close
    self._close_lock()
  File "c:\omero-current\lib\python\omero\config.py", line 87, in _close_lock
    os.remove("%s.lock" % self.filename)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\omero-current\\etc\\grid\\config.xml.lock'

So portalock should always throw LockException?, but it doesn't on the win2k8 server we're testing on (if this is the windows version, just windows being windows, python on windows or the specific python we are using I didn't check), instead throwing WindowsError?.

I then created a view that throws this error and it got caught by django and pushed to the browser as an error page, no harm done to IIS, so whatever weird exceptions the running web apps throw should not kill IIS and just be handled by django's catch-all.

Finally I ran a fresh IIS worker set, exercised it a bit so all workers were running, recycled the App Pool (which reloads the workers but using the same process for each) and ran one call that raised a WindowsError? just before CONFIG_XML.close(). As one would assume at this point, an unhandled exception on the initialization stage, or module import, that gets leaked up the stack onto the wsgi app will in fact lock the worker in question, so that explains this and the previous incarnation of the issue.

I have handled this particular case only so as to not hide any further problem we may have on windows, but we should consider a catch-all at the isapi_wsgi handler level which is a cleaner solution than putting windows specific hacks into generic sources.

We may also consider changing portalock to make sure the exception thrown is consistent across operating systems.

I'm leaving the ticket open pending more tests to the system, and the win2k8 server is all patched up so please test it!

comment:16 Changed 13 years ago by jmoore

Carlos, either let's fix portalocker as a part of this ticket, or please open a new ticket before closing this one. Thanks.

comment:17 Changed 13 years ago by Carlos Neves <carlos@…>

(In [525d3f378b63b2dcfe76a8009d635f2cabf098c5/ome.git] on branch develop) Trying to work around windows error on removing ConfigXml? lock file (see #5674).

comment:18 Changed 13 years ago by cneves

Last commit implements a retry on the lock file removal part of ConfigXml?._close_lock. On more detailed inspection I see that portalock is not the one throwing the WindowsException?, but the os.remove call ConfigXml?._close_lock does right after releasing the file lock.

I'm assuming that the error is due to a race condition so I'm applying the wait one second and retry up to 10 seconds rule stolen from the enclosing settings.py call that catches the LockException? and WindowsError?. I couldn't trigger the error to happen again so I'm not sure the fix works as expected, but I did leave enough logging in place to assert after the fact, so please hit the windows server if you can find the time.

Also, I'm using the omero.util.concurrency.get_event tool blindly, to mimic what settings.py is doing, but this may not be required or wanted, so fell free to speak up if you feel I did wrong here.

I am leaving the exception handling for WindowsError? on settings.py for the moment, but it should be removed once we verify this fix.

comment:19 Changed 13 years ago by jmoore

With the locking, we can evaluate if the retry is necessary/helpful. It may just be that we have to leave the lock file in Windows (as long as that doesn't cause any problems).

Use of omero.util.concurrency.get_event is required/wanted, so that on any call to shutdown or sys.exit, the thread releases.

comment:20 Changed 13 years ago by Carlos Neves <carlos@…>

(In [b910070367a38e21e609fc1738ca7bb082be57dc/ome.git] on branch develop) Removing retry on failed os.remove of ConfigXml?, just ignoring the error now (see #5674).

comment:21 Changed 13 years ago by cneves

Priority changed from blocker to minor

The latest patch seems to be working fine:

Mon, 13 Jun 2011 16:25:02 root         4052 ERROR    Exception while loading configuration retrying...
Traceback (most recent call last):
  File "c:\omero-current\lib\python\omeroweb\settings.py", line 84, in <module>
    CONFIG_XML = omero.config.ConfigXml(CONFIG_XML)
  File "c:\omero-current\lib\python\omero\config.py", line 58, in __init__
    portalocker.lock(self.lock, portalocker.LOCK_NB|portalocker.LOCK_EX)
  File "C:\OMERO-CURRENT\lib\python\portalocker.py", line 93, in lock
    raise LockException(LockException.LOCK_FAILED, exc_value[2])
LockException: (1, 'The process cannot access the file because another process has locked a portion of the file.')
Mon, 13 Jun 2011 16:25:02 ConfigXml    7700 ERROR    Failed to removed lock file, ignoring
Traceback (most recent call last):
  File "c:\omero-current\lib\python\omero\config.py", line 92, in _close_lock
    os.remove("%s.lock" % self.filename)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\omero-current\\etc\\grid\\config.xml.lock'

(snip)

Mon, 13 Jun 2011 16:25:03 webgateway   7700 DEBUG    INIT
Mon, 13 Jun 2011 16:25:03 views-admin  7700 INFO     INIT '7700'
Mon, 13 Jun 2011 16:25:03 views-web    7700 INFO     INIT '7700'
Mon, 13 Jun 2011 16:25:03 webgateway   7700 DEBUG    :: (session) root localhost 4064
Mon, 13 Jun 2011 16:25:03 webgateway   7700 DEBUG    :: (request) None None None
Mon, 13 Jun 2011 16:25:03 webgateway   7700 DEBUG    trying stored connection with userAgent: OMERO.web  ckey: S:9de8fbd9-f9b7-4ff1-906b-f96f2d6d2a7f#1
Mon, 13 Jun 2011 16:25:03 webgateway   7700 DEBUG    []
Mon, 13 Jun 2011 16:25:03 webgateway   7700 DEBUG    creating new connection with ckey "S:", sUuid "9de8fbd9-f9b7-4ff1-906b-f96f2d6d2a7f" (False)
Mon, 13 Jun 2011 16:25:03 ConfigXml    4052 ERROR    Failed to removed lock file, ignoring
Traceback (most recent call last):
  File "c:\omero-current\lib\python\omero\config.py", line 92, in _close_lock
    os.remove("%s.lock" % self.filename)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\omero-current\\etc\\grid\\config.xml.lock'
Mon, 13 Jun 2011 16:25:03 root         4052 DEBUG    ADMINS = [] (source:default)

The top exception gets retried one second after and succeeds, the bottom one is on the removal of the lock file and gets ignored as the config file has been read and the existence of the lock file doesn't prevent the next lock to succeed.

The only thing left is to remove the exception handling of the WindowsError? on settings.py, as it no longer makes sense. To avoid introducing more variables at this stage I'll leave this for after release, so I'm moving the ticket from blocker to minor for later processing.

comment:22 Changed 13 years ago by cneves

Remaining Time changed from 0.5 to 0.1

comment:23 Changed 13 years ago by jburel

Milestone changed from OMERO-Beta4.3 to OMERO-Beta4.3.1
Sprint changed from 2011-06-16 (14) to 2011-06-30 (1)

Moved from sprint 2011-06-16 (14)

comment:24 Changed 13 years ago by jburel

Milestone changed from OMERO-Beta4.3.1 to OMERO-Beta4.3.2
Sprint changed from 2011-07-07 (1) to 2011-07-21 (1)

Moved from sprint 2011-07-07 (1)

comment:25 Changed 13 years ago by Carlos Neves <carlos@…>

(In [2ebfba1f2c419b801e74381f83a45deb03ce0a70/ome.git] on branch develop) Cleaning up settings.py (see #5674)

comment:26 Changed 13 years ago by cneves

Remaining Time changed from 0.1 to 0
Resolution set to fixed
Status changed from accepted to closed

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

Download in other formats: