Hey Cameron, love the depth of the article! I have a question for you regarding the retrieval and fine-tuning article: what are your thoughts on OpenAI releasing the ability for users to build custom GPTs? Does that do away with fine-tuning? Have you seen or tested their effectiveness? Thank you!
The functionality is really cool/useful, but I definitely don't think it does away with finetuning! Practitioners will still want to build smaller/customized models that they can host in-house and specialize over their own data. Depending upon the use case, people might not be comfortable with using centralized/proprietary models that they don't have control over.
How do you know there is not BEIR contamination and generally data contamination in the synthetic data generated by GPT-4?
The fine-tuned LLM (in your last/second to last paper) that used a mixture of synthetic data and other data that ended up beating some BEIR benchmark was surprising, and I’m wondering if that is a fair benchmark.
Great point! This is definitely possible. We probably need to standardize the reporting of contamination metrics as a community to ensure that we are seeing actual performance benefits and not simply training on the test set. However, this is somewhat difficult when GPT-4's training dataset is unknown/proprietary (we can only estimate contamination by downloading a ton of data from the internet).
I typically use tesseract, but I know that the performance can be behind certain proprietary solution (e.g., Azure OCR API). I think OCR systems are rapidly improving in the last year, so I'm sure there will be more open-source systems being released soon.
Hey Cameron, love the depth of the article! I have a question for you regarding the retrieval and fine-tuning article: what are your thoughts on OpenAI releasing the ability for users to build custom GPTs? Does that do away with fine-tuning? Have you seen or tested their effectiveness? Thank you!
The functionality is really cool/useful, but I definitely don't think it does away with finetuning! Practitioners will still want to build smaller/customized models that they can host in-house and specialize over their own data. Depending upon the use case, people might not be comfortable with using centralized/proprietary models that they don't have control over.
Thank you for your answer!
How do you know there is not BEIR contamination and generally data contamination in the synthetic data generated by GPT-4?
The fine-tuned LLM (in your last/second to last paper) that used a mixture of synthetic data and other data that ended up beating some BEIR benchmark was surprising, and I’m wondering if that is a fair benchmark.
Great point! This is definitely possible. We probably need to standardize the reporting of contamination metrics as a community to ensure that we are seeing actual performance benefits and not simply training on the test set. However, this is somewhat difficult when GPT-4's training dataset is unknown/proprietary (we can only estimate contamination by downloading a ton of data from the internet).
> The first step in applying DocLLM is to pass a document through an optical character recognition (ORC) system.
Do you have any suggestion for reliable open source OCR system?
btw there is a typo in your article
I typically use tesseract, but I know that the performance can be behind certain proprietary solution (e.g., Azure OCR API). I think OCR systems are rapidly improving in the last year, so I'm sure there will be more open-source systems being released soon.
Another notable article to keep me abreast of what's going on in the field. Thanks a lot Cameron!
Of course! Glad you liked the article :)