drop support for compute capability <= 7.0 for newer cuDNN versions#170
drop support for compute capability <= 7.0 for newer cuDNN versions#170bedroge wants to merge 1 commit intoEESSI:mainfrom
Conversation
|
Ultimately we could make the same kind of lookup table as for CUDA. Initially I started working on it: but it's a lot of work, and as mentioned, it's not really clear what is supported and what is not. We could also consider an more simple lookup table with just the min+max supported CCs per X.YZ version? But then again, https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html says that 12.1 is not supported, the binaries do seem to indicate that it's supported, so it's very confusing and unclear... |
| cuda_ccs_string = re.sub(r'[a-zA-Z]', '', cuda_ccs_string).replace(',', '_') | ||
| # Also replace periods, those are not officially supported in environment variable names | ||
| var=f"EESSI_IGNORE_CUDNN_{cudnn_ver}_CC_{cuda_ccs_string}".replace('.', '_') | ||
| errmsg = f"EasyConfigs using cuDNN {cudnn_ver} or older are not supported for (all) requested Compute " |
There was a problem hiding this comment.
I think this is wrong: in your case the cuDNN is too new, not too old, right?
|
My 2 cents:
|
|
I just feel like a lookup table is a lot of work to set up and to maintain, while (according to the docs) the supported CCs don't change that often. Also, wouldn't the sanity check still catch unsupported CCs, as it did for CC 7.0 in EESSI/software-layer#1410? So whenever we run into this, we can mark those as unsupported in the hooks (and if necessary, change the if statement to something else if there are going to be too many combinations)? |
|
Hm, I don't think it's too bad to maintain - but admittedly it may be easier for CUDA than for cuDNN since we can just query the list from The fact that it doesn't do so for CC 11.0 may be a minor detail, since the CUDA sanity check will then indeed report that this is also invalid. The only downside of not including that case (and maybe also an upper limit) right away is that when sites install this with Anyway, I'm also ok in leaving that out for now. If you can have a look at my (minor) review comment, I'll see if I can test the PR locally - and merge it if it works as expected. |
This one is a little bit more tricky as CUDA itself, as the list of supported compute capabilities in the docs (https://docs.nvidia.com/deeplearning/cudnn/backend/v9.19.0/reference/support-matrix.html) don't really match what running
cuobjdumpon the binaries shows. Also, there seem to be some gaps in the matrix, and I wonder if that's really correct.So for now I've chosen an easier approach by just checking if we're building with a newer cuDNN and compute capability <= 7.0, and in that case I do the same thing as what @casparvl implemented for CUDA. In order to check if cuDNN is used as dependency, I've generalized Caspar's
get_cuda_versioninto aget_dependency_software_versionfunction.Tested this locally with EESSI-extend and the cuDNN from EESSI/software-layer#1410 on a V100 (CC 7.0) and RTX PRO 6000 (CC 12.0f), and got the expected result: on the RTX PRO 6000 I get a full cuDNN installation, while for the V100 I get the following output during the build:
and a module file that has: