We have an ongoing issue with the eRI’s identity service which is causing a variety of inconsistent user experiences on the platform (and may yet be behind other problems that we haven’t linked yet). This issue is (unfortunately) ongoing since at least Q4'24, we recently realised there is a gap in our Statuspage communications for this, hence the degraded state now recorded against the Identity Service.
We have been working with vendor support from Red Hat over the last few months and are in the process of engaging them to do further analysis (and assist with fixes and/or mitigations).
This issue results in symptoms such as: - OnDemand service being slow to load and launch sessions. - Globus file listing timeouts. - The filesystem feeling slow, commands such as "ls -l" take a long time. - The above symptom is because full group resolution takes a long time (~1 minute or more) for an initial (non-cached) query and then completes quickly (until local caches expire). This may result in users experiencing slow/inconsistent performance for IO heavy workloads when there is group resolution involved in the file access or metadata operations. In some cases this may be mitigated by using numeric user and group IDs instead (e.g. “ls -n”). There are many different shell commands and interactions that might experience this issue. - Dataset and Project access inconsistencies on different nodes. In some cases the local caches are populated with incomplete data (due to upstream timeouts) which then results in a machine having an incomplete group resolution for an impacted user. This might be experienced by the user as an inability to access data in a Dataset that they are a member of.
If you are experiencing any of these issues or something like them, please do still report them to support so we can effectively track the impact and look at mitigations for your particular issue.