The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk #2887

KOLANICH · 2021-11-22T17:17:06Z

The screenshots are from LibreOffice, but other software sees the data as messed too.

Also some rows contain junk:

projects file seems to be OK.

Other files haven't been tested.

The text was updated successfully, but these errors were encountered:

ftarlaci · 2021-12-30T16:17:49Z

The reason for the issue you refer to is a simple shift problem, at least for the projects_with_repository_fields file which can be resolved by simply loading the file into a Pandas or Dask dataframe (in Python) with index_col=False attribute or any equivalent of this behavior in other languages.

KOLANICH · 2021-12-30T20:44:41Z

The problem is that they are not uniformly shifted. Some lines are shifted by one amount, another lines by another amount, so for different lines the same colums contain different data (at least as exploration in LO Calc has showed) and to fix the data nontrivial logic is needed, which will likely won't work reliably. So the data is completely junk.

Also, I am not going to use pandas, pandas is damn slow. I gonna use a custom importer in C++ using Ben Strasser's fastest CSV parsing lib (the schema is defined in compile time).

ftarlaci · 2022-01-03T19:41:20Z

Well, isn't that beauty of open source; you work on making it better if you can? Anyways, I would like to leave you with one of my favorite quotes:
"Everyone in open source is doing everyone else a favor to varying levels of commitment. We should treat one another accordingly.”

Good luck.

KOLANICH · 2022-01-04T07:11:53Z

You are right. But I am out of capacity to work on this project too. In fact I am not even sure that these datasets gonna be useful for the study at all.

Good luck.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk #2887

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk #2887

KOLANICH commented Nov 22, 2021 •

edited

Loading

ftarlaci commented Dec 30, 2021

KOLANICH commented Dec 30, 2021

ftarlaci commented Jan 3, 2022

KOLANICH commented Jan 4, 2022

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in repositories and projects_with_repository_data) has columns messed, and so completely junk #2887

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in repositories and projects_with_repository_data) has columns messed, and so completely junk #2887

Comments

KOLANICH commented Nov 22, 2021 • edited Loading

ftarlaci commented Dec 30, 2021

KOLANICH commented Dec 30, 2021

ftarlaci commented Jan 3, 2022

KOLANICH commented Jan 4, 2022

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk #2887

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in `repositories` and `projects_with_repository_data`) has columns messed, and so completely junk #2887

KOLANICH commented Nov 22, 2021 •

edited

Loading