coopy-users Mailing List for coopy
Brought to you by:
eshuy
You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
(5) |
Dec
|
---|
From: joe p. <tr...@gm...> - 2010-11-24 13:39:43
|
Paul, Thanks for pointing that out. I think this last output is more natural (intuitive) looking than the previous one. cheers, Joe On Mon, Nov 22, 2010 at 10:03 PM, Paul Fitzpatrick <pau...@al...> wrote: > > Another quick follow up. I upgraded the row matching algorithm to fall back > on a more powerful (if slightly slower) method when the existing method > isn't making convincing progress. The human-readable diff for your example > is now: > > dtbl: human-readable table difference format version 0.3 > > column names are: COLUMN1 COLUMN2 COLUMN3 > > update row: > where COLUMN1,COLUMN3 = 1111,2 > set COLUMN2 = 1111 -> xxxx > > delete row: > remove 4444 4444 1 > > delete row: > remove 4444 4444 2 > > insert row: > add 2222 2222 1 > > insert row: > add 5555 5555 2 > > update row: > where COLUMN1,COLUMN3 = 6666,1 > set COLUMN2 = 6666 -> xxxx > > Cheers, > Paul > > On 11/21/2010 05:40 PM, Paul Fitzpatrick wrote: >> >> Hi Joe, >> >> Thanks for posting this. Your test case highlighted a few problems with >> COOPY. >> >> * The omitted row 1111,1111,1 was a flat out bug. I've committed a >> fix for that bug, and added this case to the regression tests - thank >> you! With the fix, an ssdiff-sspatch sequence at least produces the >> expected result. >> >> * COOPY currently has trouble when there are sets of rows that have no >> real distinguishing characteristics. Your "local" csv file is >> difficult, since there are pairs of rows that differ only by a single >> isolated digit. This is why the "diff" given involves basically >> deleting the original file and inserting the new one. >> >> To your question of how COOPY aligns/joins the rows from the two >> tables. For your case, it fails to, so this is hypothetical :-). >> However, here's a brief sketch of the procedure. >> >> * We take three tables, P, L, and R. L is your local table, R is your >> remote table, P is a pivot/parent table which for ssdiff is by default >> equal to L. >> * We try to recover a mapping from rows in L to rows in P. For the diff >> case, it is trivial, L=P. >> * We try to recover a mapping from rows in P to rows in R. Columns may >> have been added/deleted/reordered/renamed/garbled, so the process is, >> for each row in one table, to take all string fragments of text up to a >> threshold length, and dump them into a hash table (tagged with their >> origin). String fragments that appear in multiple rows get discounted. >> For each row in the the second table, we accumulate hits against the >> hash table, then decide on whether a match has been achieved. >> * Once rows are matched, we look at mapping from columns in P to columns >> in R. The process here is similar, if simpler. >> * The mapping from L to R is determined via P - for ssdiff, this is >> trivial. >> >> The procedure is ironically particularly prone to failure on artificial >> test cases with small numbers of columns and rows. However, I expect at >> least your test case should be handled soon, through an iterative step >> where row mappings are re-estimated after column mappings have been fixed. >> >> Cheers, >> Paul >> >> >>> >>> Hello, >>> >>> [COOPY 0.4.0 running on OS X.6] >>> >>> I'm trying to understand the results from ssdiff. I have two csv files: >>> >>> local: >>> >>> COLUMN1,COLUMN2,COLUMN3 >>> 1111,1111,1 >>> 1111,1111,2 >>> 4444,4444,1 >>> 4444,4444,2 >>> 6666,6666,1 >>> 6666,6666,2 >>> >>> modified: >>> >>> COLUMN1,COLUMN2,COLUMN3 >>> 1111,1111,1 >>> 1111,xxxx,2 >>> 2222,2222,1 >>> 5555,5555,2 >>> 6666,xxxx,1 >>> 6666,6666,2 >>> >>> If I run this: >>> >>> ssdiff --format-human local.csv modified.csv >>> >>> I get this: >>> >>> column names are: COLUMN1 COLUMN2 COLUMN3 >>> >>> delete row: >>> remove 1111 1111 1 >>> >>> delete row: >>> remove 1111 1111 2 >>> >>> delete row: >>> remove 4444 4444 1 >>> >>> delete row: >>> remove 4444 4444 2 >>> >>> delete row: >>> remove 6666 6666 1 >>> >>> delete row: >>> remove 6666 6666 2 >>> >>> update row: >>> where COLUMN1,COLUMN2,COLUMN3 = COLUMN1,COLUMN2,COLUMN3 >>> set = >>> >>> insert row: >>> add 1111 xxxx 2 >>> >>> insert row: >>> add 2222 2222 1 >>> >>> insert row: >>> add 5555 5555 2 >>> >>> insert row: >>> add 6666 xxxx 1 >>> >>> insert row: >>> add 6666 6666 2 >>> >>> >>> I don't quite understand those results. Why was this row deleted, >>> without being added back? >>> >>> 1111,1111,1 >>> >>> It appears on both sides. In general, how does COOPY align (join) the >>> rows from the two tables. >>> >>> cheers, >>> >>> Joe >>> >> >> >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2& L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today >> http://p.sf.net/sfu/msIE9-sfdev2dev >> _______________________________________________ >> Coopy-users mailing list >> Coo...@li... >> https://lists.sourceforge.net/lists/listinfo/coopy-users >> > > |
From: Paul F. <pau...@al...> - 2010-11-23 03:03:20
|
Another quick follow up. I upgraded the row matching algorithm to fall back on a more powerful (if slightly slower) method when the existing method isn't making convincing progress. The human-readable diff for your example is now: dtbl: human-readable table difference format version 0.3 column names are: COLUMN1 COLUMN2 COLUMN3 update row: where COLUMN1,COLUMN3 = 1111,2 set COLUMN2 = 1111 -> xxxx delete row: remove 4444 4444 1 delete row: remove 4444 4444 2 insert row: add 2222 2222 1 insert row: add 5555 5555 2 update row: where COLUMN1,COLUMN3 = 6666,1 set COLUMN2 = 6666 -> xxxx Cheers, Paul On 11/21/2010 05:40 PM, Paul Fitzpatrick wrote: > Hi Joe, > > Thanks for posting this. Your test case highlighted a few problems with > COOPY. > > * The omitted row 1111,1111,1 was a flat out bug. I've committed a > fix for that bug, and added this case to the regression tests - thank > you! With the fix, an ssdiff-sspatch sequence at least produces the > expected result. > > * COOPY currently has trouble when there are sets of rows that have no > real distinguishing characteristics. Your "local" csv file is > difficult, since there are pairs of rows that differ only by a single > isolated digit. This is why the "diff" given involves basically > deleting the original file and inserting the new one. > > To your question of how COOPY aligns/joins the rows from the two > tables. For your case, it fails to, so this is hypothetical :-). > However, here's a brief sketch of the procedure. > > * We take three tables, P, L, and R. L is your local table, R is your > remote table, P is a pivot/parent table which for ssdiff is by default > equal to L. > * We try to recover a mapping from rows in L to rows in P. For the diff > case, it is trivial, L=P. > * We try to recover a mapping from rows in P to rows in R. Columns may > have been added/deleted/reordered/renamed/garbled, so the process is, > for each row in one table, to take all string fragments of text up to a > threshold length, and dump them into a hash table (tagged with their > origin). String fragments that appear in multiple rows get discounted. > For each row in the the second table, we accumulate hits against the > hash table, then decide on whether a match has been achieved. > * Once rows are matched, we look at mapping from columns in P to columns > in R. The process here is similar, if simpler. > * The mapping from L to R is determined via P - for ssdiff, this is trivial. > > The procedure is ironically particularly prone to failure on artificial > test cases with small numbers of columns and rows. However, I expect at > least your test case should be handled soon, through an iterative step > where row mappings are re-estimated after column mappings have been fixed. > > Cheers, > Paul > > >> Hello, >> >> [COOPY 0.4.0 running on OS X.6] >> >> I'm trying to understand the results from ssdiff. I have two csv files: >> >> local: >> >> COLUMN1,COLUMN2,COLUMN3 >> 1111,1111,1 >> 1111,1111,2 >> 4444,4444,1 >> 4444,4444,2 >> 6666,6666,1 >> 6666,6666,2 >> >> modified: >> >> COLUMN1,COLUMN2,COLUMN3 >> 1111,1111,1 >> 1111,xxxx,2 >> 2222,2222,1 >> 5555,5555,2 >> 6666,xxxx,1 >> 6666,6666,2 >> >> If I run this: >> >> ssdiff --format-human local.csv modified.csv >> >> I get this: >> >> column names are: COLUMN1 COLUMN2 COLUMN3 >> >> delete row: >> remove 1111 1111 1 >> >> delete row: >> remove 1111 1111 2 >> >> delete row: >> remove 4444 4444 1 >> >> delete row: >> remove 4444 4444 2 >> >> delete row: >> remove 6666 6666 1 >> >> delete row: >> remove 6666 6666 2 >> >> update row: >> where COLUMN1,COLUMN2,COLUMN3 = COLUMN1,COLUMN2,COLUMN3 >> set = >> >> insert row: >> add 1111 xxxx 2 >> >> insert row: >> add 2222 2222 1 >> >> insert row: >> add 5555 5555 2 >> >> insert row: >> add 6666 xxxx 1 >> >> insert row: >> add 6666 6666 2 >> >> >> I don't quite understand those results. Why was this row deleted, >> without being added back? >> >> 1111,1111,1 >> >> It appears on both sides. In general, how does COOPY align (join) the >> rows from the two tables. >> >> cheers, >> >> Joe >> > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2& L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Coopy-users mailing list > Coo...@li... > https://lists.sourceforge.net/lists/listinfo/coopy-users > |
From: Paul F. <pau...@gm...> - 2010-11-22 15:10:36
|
Quick follow up, Your testcase also showed up a spurious, do-nothing "update" hunk: >> update row: >> where COLUMN1,COLUMN2,COLUMN3 = COLUMN1,COLUMN2,COLUMN3 >> set = I've committed a fix for this. The diff is now (in CSV format for brevity): dtbl,csv,version,0.4, column,name,COLUMN1,COLUMN2,COLUMN3 row,delete,1111,1111,1 row,delete,1111,1111,2 row,delete,4444,4444,1 row,delete,4444,4444,2 row,delete,6666,6666,1 row,delete,6666,6666,2 row,insert,1111,1111,1 row,insert,1111,xxxx,2 row,insert,2222,2222,1 row,insert,5555,5555,2 row,insert,6666,xxxx,1 row,insert,6666,6666,2 There's obviously still room for improvement though :-) Best, Paul On Sun, Nov 21, 2010 at 5:40 PM, Paul Fitzpatrick <pau...@al...> wrote: > Hi Joe, > > Thanks for posting this. Your test case highlighted a few problems with > COOPY. > > * The omitted row 1111,1111,1 was a flat out bug. I've committed a > fix for that bug, and added this case to the regression tests - thank > you! With the fix, an ssdiff-sspatch sequence at least produces the > expected result. > > * COOPY currently has trouble when there are sets of rows that have no > real distinguishing characteristics. Your "local" csv file is > difficult, since there are pairs of rows that differ only by a single > isolated digit. This is why the "diff" given involves basically > deleting the original file and inserting the new one. > > To your question of how COOPY aligns/joins the rows from the two > tables. For your case, it fails to, so this is hypothetical :-). > However, here's a brief sketch of the procedure. > > * We take three tables, P, L, and R. L is your local table, R is your > remote table, P is a pivot/parent table which for ssdiff is by default > equal to L. > * We try to recover a mapping from rows in L to rows in P. For the diff > case, it is trivial, L=P. > * We try to recover a mapping from rows in P to rows in R. Columns may > have been added/deleted/reordered/renamed/garbled, so the process is, > for each row in one table, to take all string fragments of text up to a > threshold length, and dump them into a hash table (tagged with their > origin). String fragments that appear in multiple rows get discounted. > For each row in the the second table, we accumulate hits against the > hash table, then decide on whether a match has been achieved. > * Once rows are matched, we look at mapping from columns in P to columns > in R. The process here is similar, if simpler. > * The mapping from L to R is determined via P - for ssdiff, this is trivial. > > The procedure is ironically particularly prone to failure on artificial > test cases with small numbers of columns and rows. However, I expect at > least your test case should be handled soon, through an iterative step > where row mappings are re-estimated after column mappings have been fixed. > > Cheers, > Paul > >> Hello, >> >> [COOPY 0.4.0 running on OS X.6] >> >> I'm trying to understand the results from ssdiff. I have two csv files: >> >> local: >> >> COLUMN1,COLUMN2,COLUMN3 >> 1111,1111,1 >> 1111,1111,2 >> 4444,4444,1 >> 4444,4444,2 >> 6666,6666,1 >> 6666,6666,2 >> >> modified: >> >> COLUMN1,COLUMN2,COLUMN3 >> 1111,1111,1 >> 1111,xxxx,2 >> 2222,2222,1 >> 5555,5555,2 >> 6666,xxxx,1 >> 6666,6666,2 >> >> If I run this: >> >> ssdiff --format-human local.csv modified.csv >> >> I get this: >> >> column names are: COLUMN1 COLUMN2 COLUMN3 >> >> delete row: >> remove 1111 1111 1 >> >> delete row: >> remove 1111 1111 2 >> >> delete row: >> remove 4444 4444 1 >> >> delete row: >> remove 4444 4444 2 >> >> delete row: >> remove 6666 6666 1 >> >> delete row: >> remove 6666 6666 2 >> >> update row: >> where COLUMN1,COLUMN2,COLUMN3 = COLUMN1,COLUMN2,COLUMN3 >> set = >> >> insert row: >> add 1111 xxxx 2 >> >> insert row: >> add 2222 2222 1 >> >> insert row: >> add 5555 5555 2 >> >> insert row: >> add 6666 xxxx 1 >> >> insert row: >> add 6666 6666 2 >> >> >> I don't quite understand those results. Why was this row deleted, >> without being added back? >> >> 1111,1111,1 >> >> It appears on both sides. In general, how does COOPY align (join) the >> rows from the two tables. >> >> cheers, >> >> Joe > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today > http://p.sf.net/sfu/msIE9-sfdev2dev > _______________________________________________ > Coopy-users mailing list > Coo...@li... > https://lists.sourceforge.net/lists/listinfo/coopy-users > |
From: Paul F. <pau...@al...> - 2010-11-21 22:40:56
|
Hi Joe, Thanks for posting this. Your test case highlighted a few problems with COOPY. * The omitted row 1111,1111,1 was a flat out bug. I've committed a fix for that bug, and added this case to the regression tests - thank you! With the fix, an ssdiff-sspatch sequence at least produces the expected result. * COOPY currently has trouble when there are sets of rows that have no real distinguishing characteristics. Your "local" csv file is difficult, since there are pairs of rows that differ only by a single isolated digit. This is why the "diff" given involves basically deleting the original file and inserting the new one. To your question of how COOPY aligns/joins the rows from the two tables. For your case, it fails to, so this is hypothetical :-). However, here's a brief sketch of the procedure. * We take three tables, P, L, and R. L is your local table, R is your remote table, P is a pivot/parent table which for ssdiff is by default equal to L. * We try to recover a mapping from rows in L to rows in P. For the diff case, it is trivial, L=P. * We try to recover a mapping from rows in P to rows in R. Columns may have been added/deleted/reordered/renamed/garbled, so the process is, for each row in one table, to take all string fragments of text up to a threshold length, and dump them into a hash table (tagged with their origin). String fragments that appear in multiple rows get discounted. For each row in the the second table, we accumulate hits against the hash table, then decide on whether a match has been achieved. * Once rows are matched, we look at mapping from columns in P to columns in R. The process here is similar, if simpler. * The mapping from L to R is determined via P - for ssdiff, this is trivial. The procedure is ironically particularly prone to failure on artificial test cases with small numbers of columns and rows. However, I expect at least your test case should be handled soon, through an iterative step where row mappings are re-estimated after column mappings have been fixed. Cheers, Paul > Hello, > > [COOPY 0.4.0 running on OS X.6] > > I'm trying to understand the results from ssdiff. I have two csv files: > > local: > > COLUMN1,COLUMN2,COLUMN3 > 1111,1111,1 > 1111,1111,2 > 4444,4444,1 > 4444,4444,2 > 6666,6666,1 > 6666,6666,2 > > modified: > > COLUMN1,COLUMN2,COLUMN3 > 1111,1111,1 > 1111,xxxx,2 > 2222,2222,1 > 5555,5555,2 > 6666,xxxx,1 > 6666,6666,2 > > If I run this: > > ssdiff --format-human local.csv modified.csv > > I get this: > > column names are: COLUMN1 COLUMN2 COLUMN3 > > delete row: > remove 1111 1111 1 > > delete row: > remove 1111 1111 2 > > delete row: > remove 4444 4444 1 > > delete row: > remove 4444 4444 2 > > delete row: > remove 6666 6666 1 > > delete row: > remove 6666 6666 2 > > update row: > where COLUMN1,COLUMN2,COLUMN3 = COLUMN1,COLUMN2,COLUMN3 > set = > > insert row: > add 1111 xxxx 2 > > insert row: > add 2222 2222 1 > > insert row: > add 5555 5555 2 > > insert row: > add 6666 xxxx 1 > > insert row: > add 6666 6666 2 > > > I don't quite understand those results. Why was this row deleted, > without being added back? > > 1111,1111,1 > > It appears on both sides. In general, how does COOPY align (join) the > rows from the two tables. > > cheers, > > Joe |
From: joe p. <tr...@gm...> - 2010-11-21 13:11:27
|
Hello, [COOPY 0.4.0 running on OS X.6] I'm trying to understand the results from ssdiff. I have two csv files: local: COLUMN1,COLUMN2,COLUMN3 1111,1111,1 1111,1111,2 4444,4444,1 4444,4444,2 6666,6666,1 6666,6666,2 modified: COLUMN1,COLUMN2,COLUMN3 1111,1111,1 1111,xxxx,2 2222,2222,1 5555,5555,2 6666,xxxx,1 6666,6666,2 If I run this: ssdiff --format-human local.csv modified.csv I get this: column names are: COLUMN1 COLUMN2 COLUMN3 delete row: remove 1111 1111 1 delete row: remove 1111 1111 2 delete row: remove 4444 4444 1 delete row: remove 4444 4444 2 delete row: remove 6666 6666 1 delete row: remove 6666 6666 2 update row: where COLUMN1,COLUMN2,COLUMN3 = COLUMN1,COLUMN2,COLUMN3 set = insert row: add 1111 xxxx 2 insert row: add 2222 2222 1 insert row: add 5555 5555 2 insert row: add 6666 xxxx 1 insert row: add 6666 6666 2 I don't quite understand those results. Why was this row deleted, without being added back? 1111,1111,1 It appears on both sides. In general, how does COOPY align (join) the rows from the two tables. cheers, Joe |
From: Paul F. <pau...@al...> - 2010-09-30 01:03:00
|
A new release of coopy has been made. * Column order is better preserved in complex merges. * Lots more regression tests to make sure merging stays working. * Documentation. Cheers, Paul |