fix(c): handle non-ASCII driver paths in GetDriverInfo on Windows#4006
fix(c): handle non-ASCII driver paths in GetDriverInfo on Windows#4006amoeba wants to merge 2 commits intoapache:mainfrom
Conversation
| // Create a UTF-8 encoded string to simulate what Python or another caller would | ||
| // pass |
There was a problem hiding this comment.
So the issue is that Python is treating the path as a string when it shouldn't?
There was a problem hiding this comment.
I'm not sure it's that.
I initially started out Claude Code on it and it zero'd in on,
diff --git a/c/driver_manager/adbc_driver_manager.cc b/c/driver_manager/adbc_driver_manager.cc
index 15d98c00b..5bb4a34d1 100644
--- a/c/driver_manager/adbc_driver_manager.cc
+++ b/c/driver_manager/adbc_driver_manager.cc
@@ -704,7 +704,12 @@ struct ManagedLibrary {
}
// First try to treat the given driver name as a path to a manifest or shared library
+#ifdef _WIN32
+ // On Windows, decode UTF-8 to wide string to properly handle non-ASCII paths
+ std::filesystem::path driver_path(Utf8Decode(std::string(driver_name)));
+#else
std::filesystem::path driver_path(driver_name);
+#endif
const bool allow_relative_paths = load_options & ADBC_LOAD_FLAG_ALLOW_RELATIVE_PATHS;
if (driver_path.has_extension()) {
if (driver_path.is_relative() && !allow_relative_paths) {So I started to see if I could prove it. I looked at the sequence of calls that leads there from Python and it looks like we pass a Python str() and treat it as a C string in the C++ side. I also built adbc_driver_manager with debug symbols and when I run,
>>> dbapi.connect(driver="项目\some.dll")and stop on a breakpoint in the C++ driver manager's AdbcDatabaseInit, I see,
There was a problem hiding this comment.
The problem is that paths aren't UTF-8, so decoding it seems very wrong.
There was a problem hiding this comment.
Paths should be bytestrings (with some restrictions), so if Python provided UTF-8 that we have to decode into the native codepage, then presumably Python should've provided the raw bytestring in the first place
There was a problem hiding this comment.
Is there a codepath I'm not seeing where we use a Path object? I'm assuming that's what you're referring to when you say paths are bytestrings. All I see are str involved in the code paths here and I guess I'm assuming those turn into UTF-8 encoded const char* on the C side... I'll look again tomorrow
There was a problem hiding this comment.
Talking about the Python side. I'm assuming this should not be converting to str, but should be using bytes?
arrow-adbc/python/adbc_driver_sqlite/adbc_driver_sqlite/__init__.py
Lines 63 to 93 in 7b38cf4

This is just a draft to verify my test fails on CI like it does locally. Will update PR body before merge.
Ref #3970