4

I have a large csv file that starts this way :

codeRegion,nomEPCI,codeDepartement,nomCommune,populationTotale,titre,objet,site_web,libelle_objet_social1,libelle_objet_social2,siret,numero_waldec,nombreAnneesExistence,date_creation,nombreAnneesDerniereDeclaration,date_derniere_declaration,position_activite,date_dissolution,codeCommune,adresse_siege_complement,adresse_siege_numero_voie,adresse_siege_type_voie,adresse_siege_libelle_voie,adresse_siege_distribution,adresse_siege_code_postal,nom_declarant,adresse_gestion_complement_association,adresse_gestion_complement_geo,adresse_gestion_libelle_voie,adresse_gestion_distribution_facturation,adresse_gestion_code_postal,adresse_gestion_achemine,adresse_gestion_pays,civilite_declarant,codeEPCI
01,CA CAP Excellence,971,Abymes,54049,ABSOLU MAS/KA,"organiser des manifestations sociales et culturelles ainsi que des activités de loisirs","",Action socio-culturelle,"Clubs de loisirs, relations","",W9G2015492,0.5808219178082191,2020-02-05,0.5808219178082191,2020-02-05,A,0001-01-01,97101,62 Résidence les Lavandes,62,RES,Boisripeaux,_Boisripeaux,97139,"","",RÉSIDENCE LES LAVANDES,62  RESIDENCE BOISRIPEAUX,"",97139,LES ABYMES,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,AMBITION PLUS DE TERRASSON,proposer des activités sportives et culturelles,"",Action socio-culturelle,"Sports, activités de plein air","",W9G2015457,0.6356164383561644,2020-01-16,0.6356164383561644,2020-01-16,A,0001-01-01,97101,Maison Andreopa Marcel,3,CHEM1,"Route de Terrasson, Rue Albert Léogane",_97139,97139,"","",MAISON ANDREOPA MARCEL,3  CHEMIN ROUTE DE TERRASSO,"",97139,LES ABYMES,FRANCE,PF,200018653
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION SYNDICALE DE LA RESIDENCE MORNE CARUEL,gérer et d'entretenir les espaces communs cette mission ne comporte pas la possibilité daliéner les espaces indivis si ce n'est au profit de la commune,"",Actions de sensibilisation et d'éducation à l'environnement et au développement durable,"","",W9G2015446,14.797260273972602,2005-11-21,14.797260273972602,2005-11-21,A,0001-01-01,97101,residence morne caruel,"","","",_,97139,"",residence morne caruel,"","","",97139,Les   Abymes,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION SPORTIVE DU LYCEE POLYVALENT CHEVALIER DE SAINT-GEORGES,"organiser et développer en prolongement de l'éducation physique et sportive donnée pendant les heures de scolarité, l'initiation et la pratique sportive pour les élèves qui y adhèrent elle représente l'établissement dans les épreuves sportives scolaires","",Activités de plein air (dont saut à l'élastique),"Centres de loisirs, clubs de loisirs multiples","",W9G2002394,19.87123287671233,2000-10-26,4.742465753424658,2015-12-09,A,0001-01-01,97101,"","",BD,des Héros,_Baimbridge,97139,"","",LYCéE POLYVALENT CHEVALIER SAINT-GEORG,BOULEVARD DES HéROS,BAIMBRIDGE,97139,ABYMES,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,"LOISIRS, COMPETITIONS, CLUB (LO. CO. CLUB)","participer aux competitions cyclistes
 organiser des manifesta- tions sportives a caractere promotionnel.","",Activités de plein air (dont saut à l'élastique),"Centres de loisirs, clubs de loisirs multiples","",W9G2010818,28.747945205479454,1991-12-13,27.378082191780823,1993-04-26,A,0001-01-01,97101,"","","","",_,97139,"","","","","",97139,LES ABYMES,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION ETUDIANTE DE TOURISME ET DE LOISIRS.( A.T.O.L.),réunir les anciens élèves ayant eu une formation de tourisme et de loisirs et de promouvoir les actions du brevet de technicien supérieur de tourisme et de loisir,"","Amicales, personnel d’établissements scolaires ou universitaires","Syndicats d'initiative, offices de tourisme, salons du tourisme","",W9G2011048,24.53698630136986,1996-02-27,24.53150684931507,1996-02-29,A,0001-01-01,97101,"","","",Ecole superieure des cadres et techniciens,_Route de la rocade grand-camp,97139,"","","",Ecole superieure des cadres et technic,Route de la rocade grand-camp,97139,LES ABYMES,FRANCE,PM,200018653

It's fifth line of data has a description, between double quotes, containing a new line. It works perfectly with Excel or LibreCalc.

I need to retain only the lines that are starting by a specific French region. For example here, the '01' one (Guadeloupe).
I execute :

# Get the CSV header
head -n1 associations_touristiques.csv > associations_touristiques_gua.csv

# Extract data of region '01'
cat associations_touristiques.csv | grep -a '^01,' >> associations_touristiques_gua.csv

But my final csv is broken.

codeRegion,nomEPCI,codeDepartement,nomCommune,populationTotale,titre,objet,site_web,libelle_objet_social1,libelle_objet_social2,siret,numero_waldec,nombreAnneesExistence,date_creation,nombreAnneesDerniereDeclaration,date_derniere_declaration,position_activite,date_dissolution,codeCommune,adresse_siege_complement,adresse_siege_numero_voie,adresse_siege_type_voie,adresse_siege_libelle_voie,adresse_siege_distribution,adresse_siege_code_postal,nom_declarant,adresse_gestion_complement_association,adresse_gestion_complement_geo,adresse_gestion_libelle_voie,adresse_gestion_distribution_facturation,adresse_gestion_code_postal,adresse_gestion_achemine,adresse_gestion_pays,civilite_declarant,codeEPCI
01,CA CAP Excellence,971,Abymes,54049,ABSOLU MAS/KA,"organiser des manifestations sociales et culturelles ainsi que des activités de loisirs","",Action socio-culturelle,"Clubs de loisirs, relations","",W9G2015492,0.5808219178082191,2020-02-05,0.5808219178082191,2020-02-05,A,0001-01-01,97101,62 Résidence les Lavandes,62,RES,Boisripeaux,_Boisripeaux,97139,"","",RÉSIDENCE LES LAVANDES,62  RESIDENCE BOISRIPEAUX,"",97139,LES ABYMES,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,AMBITION PLUS DE TERRASSON,proposer des activités sportives et culturelles,"",Action socio-culturelle,"Sports, activités de plein air","",W9G2015457,0.6356164383561644,2020-01-16,0.6356164383561644,2020-01-16,A,0001-01-01,97101,Maison Andreopa Marcel,3,CHEM1,"Route de Terrasson, Rue Albert Léogane",_97139,97139,"","",MAISON ANDREOPA MARCEL,3  CHEMIN ROUTE DE TERRASSO,"",97139,LES ABYMES,FRANCE,PF,200018653
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION SYNDICALE DE LA RESIDENCE MORNE CARUEL,gérer et d'entretenir les espaces communs cette mission ne comporte pas la possibilité daliéner les espaces indivis si ce n'est au profit de la commune,"",Actions de sensibilisation et d'éducation à l'environnement et au développement durable,"","",W9G2015446,14.797260273972602,2005-11-21,14.797260273972602,2005-11-21,A,0001-01-01,97101,residence morne caruel,"","","",_,97139,"",residence morne caruel,"","","",97139,Les   Abymes,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION SPORTIVE DU LYCEE POLYVALENT CHEVALIER DE SAINT-GEORGES,"organiser et développer en prolongement de l'éducation physique et sportive donnée pendant les heures de scolarité, l'initiation et la pratique sportive pour les élèves qui y adhèrent elle représente l'établissement dans les épreuves sportives scolaires","",Activités de plein air (dont saut à l'élastique),"Centres de loisirs, clubs de loisirs multiples","",W9G2002394,19.87123287671233,2000-10-26,4.742465753424658,2015-12-09,A,0001-01-01,97101,"","",BD,des Héros,_Baimbridge,97139,"","",LYCéE POLYVALENT CHEVALIER SAINT-GEORG,BOULEVARD DES HéROS,BAIMBRIDGE,97139,ABYMES,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,"LOISIRS, COMPETITIONS, CLUB (LO. CO. CLUB)","participer aux competitions cyclistes
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION ETUDIANTE DE TOURISME ET DE LOISIRS.( A.T.O.L.),réunir les anciens élèves ayant eu une formation de tourisme et de loisirs et de promouvoir les actions du brevet de technicien supérieur de tourisme et de loisir,"","Amicales, personnel d’établissements scolaires ou universitaires","Syndicats d'initiative, offices de tourisme, salons du tourisme","",W9G2011048,24.53698630136986,1996-02-27,24.53150684931507,1996-02-29,A,0001-01-01,97101,"","","",Ecole superieure des cadres et techniciens,_Route de la rocade grand-camp,97139,"","","",Ecole superieure des cadres et technic,Route de la rocade grand-camp,97139,LES ABYMES,FRANCE,PM,200018653
01,CA CAP Excellence,971,Abymes,54049,ASSOCIATION AMICALE DES AGENTS D'ENTRETIEN DE LA MUNICIPALITE DES ABYMES.,"realiser des rencontres et echanges d'ordre culturel, sportif, social avec toutes categories de personnel communal et autres associations ou groupements.","",Association du personnel d'une entreprise (hors caractère syndical),"Comités de défense et d'animation de quartier, association locale ou municipale","",W9G2012061,31.56986301369863,1989-02-16,31.56986301369863,1989-02-16,A,0001-01-01,97101,"","","","",_,97139,"","","","","",97139,LES ABYMES,FRANCE,PM,200018653

Tthe line 5 is truncated after cyclistes.

What the proper way to make cat command not taking into account a newline when it is inside double quotes,
and is it needed to change the grep command too after that, and if so : how ?

7
  • 1
    Do you have access to csvgrep (from the Python-based csvkit)? Commented Sep 23, 2020 at 12:48
  • 5
    I would suggest you to try to write the questions so that we have a minimal/shorter example. I believe there is a large energy barrier to parse it (see "scroll the horizontal bar") that would drive off many readers, although I find your question interesting.
    – Quasímodo
    Commented Sep 23, 2020 at 12:49
  • @steeldriver No I do not have access to csvgrep. Commented Sep 23, 2020 at 13:04
  • 3
    In addition to what @Quasímodo said, when asking about categorizing data it would be helpful to include at least one case from each category (i.e. in this case, one or more records that do not match your search pattern) Commented Sep 23, 2020 at 13:15
  • 3
    @Marc yes it is possible, just make up something minimal that's representative of your real data with respect to the problem specific you're asking for help with. See for example unix.stackexchange.com/q/610962/133219.
    – Ed Morton
    Commented Sep 23, 2020 at 20:17

2 Answers 2

13

Using csvgrep from the csvkit package to pull out all records that has a codeRegion value containing the string 01:

csvgrep -c codeRegion -m 01 file.csv

This is using a proper CSV parser, so there will no issues with newlines or commas in properly quoted fields.

The -c option selects the column that we'd like to investigate, by number or by name, and -m designates the string to match with. One could also use -r to match with a regular expression, e.g. -r '^01$' to avoid matching strings where 01 is a substring (as in 011). See csvgrep --help.

2
  • I tried an apt-get install csvgrep but csvgrep package wasn't among those available to my Debian. Commented Sep 23, 2020 at 13:38
  • 4
    @Marc the package is called csvkit
    – muru
    Commented Sep 23, 2020 at 13:46
10
awk '/^01/||n%2{print;n+=gsub(/"/,"&")}' file

For each line,

  • /^01/||n%2 If line begins with 01 or n (initally zero) is odd,
    • print Print it
    • n+=gsub(/"/,"&") increment n by the return value of the gsub function.
      This replaces every double-quote /"/ with itself "&". That would be pointless, indeed, but it also returns the number of substitutions made, so it is a way of counting the number of double-quotes in the line.

Notice that if the n is odd (n%2) the line does not have a closing double-quote, so it keeps printing until n is even, regardless of whether there is a /^01/ match on the next lines.

A side-by-side diff for you:

$ diff -yW 30 <(cat file) <(awk '/^01/||n%2{print;n+=gsub(/"/,"&")}' file)
04,xde        <
01,abc"         01,abc"
cd              cd
as"             as"
02,dsad       <
03,1ad"       <
01,as,"as       01,as,"as
us"             us"
02,s          <
01,a            01,a
4
  • Impressive ! It's working ! I am quite unable to understand awk syntax, but I'll check this more in detail and thank you again. Commented Sep 23, 2020 at 13:37
  • Note that this theoretically fails with a quote-count of 2^53 + 1, because awk numbers are double precision floating point. Which is the first integer that an IEEE 754 float is incapable of representing exactly?. Above that point, only even integers are representable. (Unlike if n was a fixed-width integer, in which case it would just wrap and that would be fine.) But fortunately 2^53 is huge enough not to be a practical concern. And maybe awk interpreters use integers internally if you never do any division or any non-integer stuff, I didn't check. Commented Sep 23, 2020 at 22:31
  • This will fail if your flavour of CSV allows backslash-escaped double quotes inside quoted fields. "abc\"def". Could be easily fixed using a lookbehind assertion, but I don't know any AWK interpreter that supports them. So if this is a concern, you have to resort to something ugly like /(^")|([^\\]")/.
    – TooTea
    Commented Sep 24, 2020 at 14:52
  • @TooTea I also point out that there is no account of single-quoted double-quotes in this answer, which simply answers what the question asks and nothing more. For a tool that will offer more versatility, see Kusalananda's answer.
    – Quasímodo
    Commented Sep 24, 2020 at 17:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .