You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in repositories and projects_with_repository_data) has columns messed, and so completely junk
#2887
Open
KOLANICH opened this issue
Nov 22, 2021
· 4 comments
The screenshots are from LibreOffice, but other software sees the data as messed too.
Also some rows contain junk:
projects file seems to be OK.
Other files haven't been tested.
The text was updated successfully, but these errors were encountered:
KOLANICH
changed the title
The data in the latest (1.6.0-2020-01-12) exported dataset (in repositories and projects_with_repository_data) has columns messed, and so completely junk
The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in repositories and projects_with_repository_data) has columns messed, and so completely junk
Nov 22, 2021
The reason for the issue you refer to is a simple shift problem, at least for the projects_with_repository_fields file which can be resolved by simply loading the file into a Pandas or Dask dataframe (in Python) with index_col=False attribute or any equivalent of this behavior in other languages.
The problem is that they are not uniformly shifted. Some lines are shifted by one amount, another lines by another amount, so for different lines the same colums contain different data (at least as exploration in LO Calc has showed) and to fix the data nontrivial logic is needed, which will likely won't work reliably. So the data is completely junk.
Also, I am not going to use pandas, pandas is damn slow. I gonna use a custom importer in C++ using Ben Strasser's fastest CSV parsing lib (the schema is defined in compile time).
Well, isn't that beauty of open source; you work on making it better if you can? Anyways, I would like to leave you with one of my favorite quotes:
"Everyone in open source is doing everyone else a favor to varying levels of commitment. We should treat one another accordingly.”
You are right. But I am out of capacity to work on this project too. In fact I am not even sure that these datasets gonna be useful for the study at all.
The screenshots are from LibreOffice, but other software sees the data as messed too.
Also some rows contain junk:
projects
file seems to be OK.Other files haven't been tested.
The text was updated successfully, but these errors were encountered: