Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cohere API as available language model #395

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

rdnfn
Copy link

@rdnfn rdnfn commented Mar 10, 2023

This PR adds the Cohere API as an available language model in lm_eval/models/cohere_lm.py, addressing https://github.com/EleutherAI/lm-eval2/issues/12. It also includes minimal test for the new model in tests/test_models.py.

Key differences between Cohere (v4.1.0) and OpenAI (v0.27.1) Python APIs:

  • Cohere does not accept tokens (instead only strings) directly as the prompt via the co.generate( ) method, but can return tokens and their loglikelihoods.
  • The cohere API does not appear to have anything like the logprobs argument in the OpenAI API. Thus, a second greedy API call is necessary to check if original continuation was greedy.
  • The cohere API does not accept empty contexts. In order to still be able to process this type of request a newline token is used as context to be able to check if greedy.

Open questions:

  • I am not sure what tokenizer cohere.ai uses, thus I only use their own tokeniser API -- it's likely more cost-effective/faster to do the tokenisation locally however.
  • Why does the gpt3.py LM class do loglikelihood in descending order of overall tokenlength, but the reverse seems to be true for greedy generation? (the _collate(x) function returns a negative token legnth in the first case, and positive in the latter)

Other notes:

  • Noticed that the gitignore file is currently very short, had to add multiple standard artifacts to it. Could consider using standard gitignore file for python in full rewrite, e.g. https://github.com/github/gitignore/blob/main/Python.gitignore
  • Note that I have not done more extensive testing with a proper Benchmark as I don't have any free credits from Cohere.

Thanks for taking a look at this!

Could consider using standard gitignore file for python in full rewrite, e.g. https://github.com/github/gitignore/blob/main/Python.gitignore
@rdnfn rdnfn changed the title aux: add egg-info to gitignore Mar 10, 2023
@rdnfn rdnfn marked this pull request as ready for review March 22, 2023 12:56
@StellaAthena
Copy link
Member

Thank you for this contribution! We collaborate with Cohere and can look at getting some credits for testing.

Cohere does not accept tokens (instead only strings) directly as the prompt via the co.generate( ) method, but can return tokens and their loglikelihoods.

The cohere API does not appear to have anything like the logprobs argument in the OpenAI API. Thus, a second greedy API call is necessary to check if original continuation was greedy.

This is weird, I rather thought they did. I’m going to investigate this a bit.

The cohere API does not accept empty contexts. In order to still be able to process this type of request a newline token is used as context to be able to check if greedy.

I think it’s probably a better idea to raise an exception and inform the user that the CohereAPI doesn’t permit this. I’m worried that using a new-line token as a substitute for an empty context can introduce hidden gotchas into other people’s experiments.

I am not sure what tokenizer cohere.ai uses, thus I only use their own tokeniser API -- it's likely more cost-effective/faster to do the tokenisation locally however.

I think this is likely the most robust approach, especially if they start using different tokenizers for different models the way OpenAI does.

Why does the gpt3.py LM class do loglikelihood in descending order of overall tokenlength, but the reverse seems to be true for greedy generation? (the _collate(x) function returns a negative token legnth in the first case, and positive in the latter)

I’m not sure, I’ll find out. @jon-tow @haileyschoelkopf do you happen to know?

@rdnfn
Copy link
Author

rdnfn commented Mar 22, 2023

Great, thanks for the quick response @StellaAthena!

The cohere API does not appear to have anything like the logprobs argument in the OpenAI API. Thus, a second greedy API call is necessary to check if original continuation was greedy.

About this: I raised this in the cohere discord (here) and the dual-call (with one greedy) was the method suggested there.

I am not sure what tokenizer cohere.ai uses, thus I only use their own tokeniser API -- it's likely more cost-effective/faster to do the tokenisation locally however.

I think this is likely the most robust approach, especially if they start using different tokenizers for different models the way OpenAI does.

Sounds good. I will need to make some hanges to fully remove the original tokenizer used in the GPT-3 model.

The cohere API does not accept empty contexts. In order to still be able to process this type of request a newline token is used as context to be able to check if greedy.

I think it’s probably a better idea to raise an exception and inform the user that the CohereAPI doesn’t permit this. I’m worried that using a new-line token as a substitute for an empty context can introduce hidden gotchas into other people’s experiments.

Makes sense, I will change this. Maybe I should leave the adding-a-newline hack as an option when instantiating the model class, but disable it by default? I can't really judge how often this edge case would be relevant - just saw it in the tests for other models.

@jon-tow
Copy link
Member

jon-tow commented Mar 23, 2023

Why does the gpt3.py LM class do loglikelihood in descending order of overall tokenlength, but the reverse seems to be true for greedy generation? (the _collate(x) function returns a negative token legnth in the first case, and positive in the latter)

I'm not entirely sure, but I'd guess that if you're worried about timeouts you'd want to make completion calls on shorter prompts first to avoid any network issues that could result from slow sampling on large sequences. This way you also get to cache results immediately.

@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Mar 23, 2023

Just catching up with this discussion--re: the empty context issue:

I'm pretty sure that in other models the solution to empty input context is to pass in only the EOT token instead of an empty string. The times when you want to input an empty string as context is for perplexity calculation right? Sorry if I've missed something, will proceed to read through your code more fully, this was just my first instinct here.

@rdnfn
Copy link
Author

rdnfn commented Mar 24, 2023

@haileyschoelkopf good point about the EOT token being passed, I hadn't noticed this code before! However, for the cohere API, we give the prompt as a string and not as tokens. So we don't use the encoded context but rather the original string. That's why the code you mentioned doesn't actually change much in the cohere class. I am not sure if it could be applied.

But it's good that you're mentioning the use-case for word-perplexity. Maybe then the is_greedy return is really not relevant for empty contexts? Because the is_greedy value is the only reason why we would need to add another token to the empty context - we can compute the logprobs without this hack. Maybe we should compute the lobprobs normally and return None for is_greedy?

EDIT: I have now fixed some things (in the commits below):

  1. I updated the tokenisation procedure to use the Cohere API and removed remaining GPT-3 code, also avoided tokenisation where not necessary
  2. I tried to make all the pre-commit checks pass (but note that I am not sure if codespell will pass, there are a number of files not touched by this PR that fail)
  3. I also updated the previously failing "build" CI check to install cohere deps

EDIT 2:

  1. I now removed the old hack with adding a newline for empt context, and instead added a parameter disable_is_greedy_computation (defaults to False).
This replaces old transformer tokenizer with API calls.
Also removes tokenisation were not strictly necessary.
Removes one of two newlines at end of file
@LauraRuis
Copy link

I'm happy to test this code with my Cohere credits which I obtained for a research project with Stella. Are there standardised more extensive tests that are used for the harness that I can run?

@rdnfn
Copy link
Author

rdnfn commented Mar 28, 2023

That would be brilliant, thanks @LauraRuis! What tests would be good/necessary/recommended to run @haileyschoelkopf? Not familiar enough to make a judgement on this myself.

@haileyschoelkopf
Copy link
Contributor

Thanks @LauraRuis for the help!

How would

perplexity task:
wikitext

loglikelihood / mcqa:
lambada_openai
piqa
arc_easy
arc_challenge

greedy gen:
gsm8k
(another generation task of your choice if you'd like?)

Sound? If this is too many to run / too many credits required then not a problem! Just wanted to try to hit at least one task of each type.

@LauraRuis
Copy link

I'm getting a notimplementederror from line 198 in base.py

...
 File "/Users/laura/Documents/Projects/lm-evaluation-harness/lm_eval/base.py", line 198, in loglikelihood_rolling
    prefix_token=self.eot_token_id,
  File "/Users/laura/Documents/Projects/lm-evaluation-harness/lm_eval/models/cohere_lm.py", line 69, in eot_token_id
    raise NotImplementedError()

I called:

python main.py \
    --model cohere \
    --tasks wikitext \
    --check_integrity

Am I calling the wrong thing perhaps @rdnfn?

unlike the gpt3 model we need a custom loglikelihood_rolling
 implementation
 because we don't know the eot token, thus we use a separate prefix token.
 Preferred setting separate prefix token to avoid confusion.

 This also adds the capability to
 _loglikelihood_tokens to use tokens instead of str requests
@rdnfn
Copy link
Author

rdnfn commented Apr 4, 2023

Thanks for catching this error @LauraRuis, and sorry for the delay! I missed a detail about the rolling_loglikelihood() method. The last two commits should fix this issue. I added and ran a minimal test, but would be great if you would be able to rerun your more extensive test!

Note that this fix now relies again on the newline hack: a newline string ("\n") is used as the context (only) when computing the loglikelihood of the first token in a rolling_loglikelihood computation. I can't use the end-of-text token because I could not figure out what Cohere's end-of-text token is in either string of numerical form.

@LauraRuis
Copy link

Thanks @rdnfn ! However, now I'm getting an internal server error:

line 519, in _check_response
    raise CohereAPIError(
cohere.error.CohereAPIError: internal server error, this has been reported to our developers

I think it has something to do with the fact that the input exceeds the maximum length and needs to be truncated. If I only pass the context instead of context + continuation in line 219 in cohere_lm.py like so:

response = self.cohere_client.generate(
                   model=self.model,  # "medium" or "xlarge"
                   prompt=context,
                   max_tokens=0,
                   temperature=0.0,
                   return_likelihoods="ALL",
                   truncate=self.truncate,
               )

It does work.

@rdnfn
Copy link
Author

rdnfn commented Apr 9, 2023

@LauraRuis thanks for running the test again! That's a quite confusing error! The truncation should be set by the truncate=self.truncate option in client.generate(). By default I set this to "START" which should automatically discard the start of the input according to their docs.

Is there any chance that the eval harness overwrites my default to be something else (I set it with a truncate argument during model initialisation)? Or otherwise would you be able to potentially extract a single failure case (prompt + continuation) for me to investigate?

@LauraRuis
Copy link

No it's set to "START" as you say.

Here's the first context + continuation that immediately fails:

experience a sense of peace in a setting, in contrast to traditionally bleak graveyards. Recommendations given by Arthur William Hill, the Assistant Director of the Royal Botanical Gardens at Kew enabled the Commission to develop cemetery layouts and architectural structures that took into account the placement of suitable plant life. Combining structural and horticultural elements was not unfamiliar to the Commission's architects. Sir Edwin Lutyens furthered his long-standing working relationship with horticulturist Gertrude Jekyll, whose devotion to traditional cottage garden plants and roses greatly influenced the appearance of the cemeteries. Where possible, indigenous plants were utilised to enhance sentimental associations with the gardens of home.
Variety in texture, height and timing of floral display were equally important horticultural considerations. The beds around each headstone are planted with a mixture of floribunda roses and herbaceous perennials. Low-growing plants are chosen for areas immediately in front of headstones, ensuring that inscriptions are not obscured and preventing soil from splashing back during rain. In cemeteries where there are pedestal grave markers, dwarf varieties of plants are used instead.
The absence of any form of paving between the headstone rows contributes to the simplicity of the cemetery designs. Lawn paths add to the garden ambiance, and are irrigated during the dry season in countries where there is insufficient rain. Where irrigation is inappropriate or impractical, dry landscaping is an ecological alternative favoured by the Commission's horticulturists, as is the case in Iraq. Drier areas require a different approach not only for lawns, but also to plants and styles of planting. Similarly, there are separate horticultural considerations in tropical climates. When many cemeteries are concentrated within a limited area, like along the Western Front or Gallipoli peninsula, mobile teams of gardeners operate from a local base. Elsewhere, larger cemeteries have their own dedicated staff while small cemeteries are usually tended by a single gardener working part-time.

== Organisation ==


=== Commissioners ===

The affairs of the CWGC are overseen by a Board of Commissioners. The president of the board is Prince Edward, Duke of Kent, the chairman is United Kingdom Secretary of State for Defence Michael Fallon and the vice-chairman Vice-Admiral Tim Laurence. The members are: the High Commissioner for New Zealand to the United Kingdom Lockwood Smith, the High Commissioners of Australia to the United Kingdom Alexander Downer, the Acting High Commissioner of the Republic of South Africa to the United Kingdom Obed Mlaba, the High Commissioner for India to the United Kingdom Ranjan Mathai, the High Commissioner for Canada to the United Kingdom Gordon Campbell, Hew Strachan, Keith Simpson, Kevan Jones, Edward Chaplin, Robert Fox, Ros Kelly and Lieutenant General Bill Rollo. Victoria Wallace is the Director-General of the CWGC and serves as secretary. The board also has an Honorary Artistic Adviser, Peter Inskip.

=== Functional structure ===

The CWGC is headquartered in Maidenhead, England. Offices or agencies that are each responsible for a specific geographical area manage the worldwide affairs of the organisation. They are:
France Area is headed by a director and is responsible for France (including the island of Corsica), Monaco and Switzerland.
Northern Europe Area, headed by a director and responsible for Austria, Belgium, Czech Republic, Denmark, Estonia, Germany, Hungary, Latvia, Lithuania, Luxembourg, Netherlands, Norway, Poland and Sweden.
United Kingdom Area, headed by a director and responsible for Channel Islands, Faroe Islands, Iceland, Ireland, Isle of Man and the United Kingdom
Mediterranean Area headed by a director and responsible for Albania, Algeria, Azerbaijan, Azores, Bahrain, Canary Islands, Croatia, Cyprus, Egypt, Gibraltar, Greece, Israel and Palestine, Italy, Jordan, Lebanon, Libya, Macedonia, Madeira, Malta, Mauritania, Morocco, Oman, Portugal, San Marino, Saudi Arabia, Serbia, Spain, Syria, Tunisia, Turkey, United Arab Emirates and Yemen
Canadian Agency is headed by a secretary-general and responsible for Canada, the entire Americas (including the Caribbean)
Australia, managed by the Office of Australian War Graves in the Australian Department of Veterans Affairs on behalf of the CWGC, is responsible for Australia, Norfolk Island, Papua New Guinea and the Solomon Islands
New Zealand, managed by the New Zealand Ministry of Culture and Heritage on behalf of the CWGC, is responsible for New Zealand, New Caledonia, Samoa, Society Islands, Tonga and Vanuatu
South Africa Agency is headed by a secretary and is responsible for Republic of South Africa, Namibia, Saint Helena and Ascension Island
Africa, Asia and Pacific Area is headed by a director and is responsible for areas not covered by any of the other bodies.

=== Financing ===

The CWGC's work is funded predominantly by grants from the governments of the six member states. In the fiscal year 2012 / 13, these grants amounted to £ 58.6 million of the organisation's £ 66.5 million of income. This equates to an approximate cost of C $ 85 per commemorated war dead. The contribution from each country is proportionate to the number of graves the CWGC maintains on behalf of that country. The percentage of total annual contributions for which each country is responsible is United Kingdom 78.4 %, Canada 10.1 %, Australia 6.1 %, New Zealand 2.1 %, South Africa 2.1 % and India 1.2 %.

== Ongoing projects and issues ==


=== War Graves Photographic Project ===

A project is underway to photograph the graves of and memorials to all service personnel from 1914 to the present day and make the images available to the public. The work is being carried out by The War Graves Photographic Project in conjunction with the CWGC. As of August 2013, the project has recorded 1.7 million photographs for posterity.

=== Reburials and identifications ===

Immediately following the First World War, the British Army remained responsible for the exhumation of remains. The Western Front was divided into sectors and combed for bodies by 12-man exhumation units. Between the Armistice and September 1921, the exhumation units reburied 204,695 bodies. After 1921, no further widespread search for bodies was undertaken and in February 1921 responsibility of the cemeteries was transferred to the Commission. Despite the rigorous searches, bodies continued to be discovered in numbers. In the three years following the conclusion of the general search 38,000 bodies were discovered. In the mid 1920s, 20 to 30 bodies were being discovered weekly.
The discovery of remains of First and Second World War casualties remains a common occurrence with approximately 30 bodies discovered annually. For example, in 2006 eight bodies of Canadian soldiers from the 78th Battalion (Winnipeg Grenadiers), CEF were discovered in a backyard in Hallu, France. In April 2013, the remains of four British soldiers discovered by a French farmer clearing land with metal detector in 2009 were re-interred at H.A.C. Cemetery near Arras, France. In March 2014, the remains of 20 Commonwealth and 30 German soldiers were discovered in Vendin-le-Vieil, France with the Commonwealth soldiers being subsequently reburied at Loos British Cemetery.
When the remains of a Commonwealth soldier from the First or Second World War is discovered the Commission is notified and a Commission burial officer tries to collect any associated artifacts that may help in identify the remains. The details are then registered and archived at the Commission's headquarters. the collection of evidence can include artifacts with the remains, anthropological data and DNA. The archival records of the commission are open to the public to permit individuals to conduct their own research. Investigation of archival records by members of the public periodically result in the identification of previously buried casualties. In December 2013, it was discovered that Second Lieutenant Philip Frederick Cormack, who was previously commemorated on the Arras Flying Services Memorial, had in fact been buried in a French military cemetery in Machelen, East-Flanders in Belgium. Sergeant Leonard Maidment was identified in 2013 after a visitor to Marfaux British Cemetery discovered a headstone of an unknown sergeant with the Hampshire Regiment killed on 20 July 1918 and was subsequently able to show that only one sergeant from that regiment had been killed in France on that date.

=== Vandalism ===

Cemeteries, including those of war dead, are targets for vandalism. The gravestones, cemeteries and buildings of the Commission are no exception. The Commission believes that graffiti and damage to stonework are usually the pursuits partaken by young people, noting the number of incidents increases when schoolchildren are on school holidays. Determined thieves will also steal the bronze swords off the Cross of Sacrifice, which are now replaced with identical ones made in fibreglass.
The vandalism of Commission cemeteries has also been connected to the participation of Commonwealth countries in contemporary conflicts. In the 1970s, in The Troubles, Commission cemeteries in Ireland experienced vandalism. Vandals defaced the central memorial of the Étaples Military Cemetery in northern France with anti-British and anti-American graffiti on 20 March 2003 immediately after the beginning of the Iraq War. On 9 May 2004, thirty-three headstones were demolished in the Gaza cemetery, which contains 3,691 graves, allegedly in retaliation for the Abu Ghraib prisoner abuse scandal. On 24 February 2012, during the Libyan Civil War, an Islamist militia damaged over 200 headstones in the Benghazi war cemetery as well as the central memorial.

@StellaAthena
Copy link
Member

StellaAthena commented Apr 9, 2023

@rdnfn In case you’re unaware, you can pass arbitrary names arguments to the model constructor using --model_args

I agree with you that it seems like passing START should prevent the exact error that seems to be happening. @LauraRuis what happens if instead of calling the cohere API you modify the code to print out the command? And then run it independently? Also maybe if you could post the command here, others can help debug.

@LauraRuis
Copy link

LauraRuis commented Apr 10, 2023

The command I'm running is:

python main.py \
    --model cohere \
    --tasks wikitext \
    --check_integrity

If I print out the command and try it independently it fails after a really long time with a code 500 (internal server error):

raise CohereAPIError(
cohere.error.CohereAPIError: internal server error, this has been reported to our developers

You can reproduce this error more quickly by setting max_retries to a smaller number (i.e. 1 or 10) where cohere.Client is initialised in line 63 in cohere_lm.py. I've let it run with the current default (max_retries=100) and then it just fails with the same error after 100 tries. Additionally, if you pass context or continuation separately it works.

To me it seems like truncating is the issue. E.g. if I run a prompt that has 1 token too many without the truncate set like so:

test_prompt = "This is a test." * 410
response = self.cohere_client.generate(
    model=self.model,  # "medium" or "xlarge"
    prompt=test_prompt[:-1],
    max_tokens=0,
    temperature=0.0,
    return_likelihoods="ALL",
    # truncate=self.truncate,
)
print(response)

The response is:

cohere.error.CohereAPIError: too many tokens: total number of tokens in the prompt cannot exceed 2048 - received 2049. Try using a shorter prompt, or enabling prompt truncating. See https://docs.cohere.ai/reference/generate for more details.

And if I run with the truncate set to "START" which should handle this, I get the internal server error again:

test_prompt = "This is a test." * 410
response = self.cohere_client.generate(
    model=self.model,  # "medium" or "xlarge"
    prompt=test_prompt[:-1],
    max_tokens=0,
    temperature=0.0,
    return_likelihoods="ALL",
    truncate=self.truncate,
)
print(response)
raise CohereAPIError(
cohere.error.CohereAPIError: internal server error, this has been reported to our developers

If I take this out of the code and just run independently, the same thing happens. Changing to truncate="END" does not help. Once the text doesn't need to be truncated anymore, it does work again.

@rdnfn
Copy link
Author

rdnfn commented Apr 10, 2023

Thanks @StellaAthena and @LauraRuis for looking into this further!

I was able to reproduce the error with an example similar to Laura's. This seems to be a bug inside Cohere's API and I reported it on Cohere's discord (see here). Let's see if they might come up with a quick fix for this issue.

I would first wait for their response. But if there is no quick fix from their side, we could change max_tokens=1 and discard the logprob of the last token - this seems to fix their API issue.

@StellaAthena
Copy link
Member

Okay, looks like we're in a holding pattern until Cohere fixes their API then. Thanks for looking into this.

@rdnfn
Copy link
Author

rdnfn commented Apr 19, 2023

Good news: cohere have now confirmed that the bug should be fixed. Some prelim tests on my side confirmed this. @LauraRuis could you try to run the tests again?

@CLAassistant
Copy link

CLAassistant commented Apr 23, 2023

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

@LauraRuis are we good to merge?

@LauraRuis
Copy link

Sorry, running tests on the tasks Hailey suggested now. They are incredibly slow so will come back here once done. (Incredibly slow meaning each many API requests take >>a minute)

@StellaAthena
Copy link
Member

Sorry, running tests on the tasks Hailey suggested now. They are incredibly slow so will come back here once done. (Incredibly slow meaning each many API requests take >>a minute)

Yikes. Is this an inefficiency on our end or a limitation of the API?

@LauraRuis
Copy link

API's side. Also, the code now works for the first two calls in Wikitext, but he same error is back for the third call. We'll have to go back to Cohere and see what's going on.

Error:

raise CohereAPIError(
cohere.error.CohereAPIError: internal server error, this has been reported to our developers
@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Jun 7, 2023

Following up on this PR! Did we ever hear back from Cohere? If not, I can test out this PR again myself since Cohere provides trial API access.

@rdnfn
Copy link
Author

rdnfn commented Jun 9, 2023

Thanks for following up @haileyschoelkopf! We are currently still waiting to hear back from Cohere on their discord, just sent a follow-up reminder myself over there (you can find the bug report conversation here, requires access to "cohere community" discord).

EDIT: Good news! The issues on cohere's side have now been resolved and with two small fixes this all appears to work now on wikitext. Will run some more tests soon (beyond wikitext) to confirm.

also prevent tokenisation of strings that too long
@StellaAthena
Copy link
Member

@rdnfn Following up, how did the tests go?

@rdnfn
Copy link
Author

rdnfn commented Dec 20, 2023

Hi @StellaAthena thanks for following-up! I was unable to run further tests at the time for unrelated reasons, and unfortunately no longer have time/capacity to further debug and test this implementation. Happy for somebody else to pick this PR up if they are keen on the Cohere integration, otherwise we can close the PR if there is no further interest.

@StellaAthena
Copy link
Member

Hi @StellaAthena thanks for following-up! I was unable to run further tests at the time for unrelated reasons, and unfortunately no longer have time/capacity to further debug and test this implementation. Happy for somebody else to pick this PR up if they are keen on the Cohere integration, otherwise we can close the PR if there is no further interest.

I'm pretty keen on it and can see about handing off testing to someone else. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants