Parikshit Sondhi

Menlo Park, California, United States Contact Info
2K followers 500+ connections

Join to view profile

About

Experienced researcher with 15+ years of experience working on machine learning problems…

Articles by Parikshit

Activity

Join now to see all activity

Experience & Education

  • AI Strategy Consultants LLC.

View Parikshit’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • Data Science for Retail and E-Commerce (DSRE) (Organizer)

    Siam Data Mining (SDM)

    Faizan Javed, Mohammad Al Hasan, Parikshit Sondhi, B. Aditya Prakash, Mohit Sharma

    See publication
  • Workshop on Natural Language Processing in E-Commerce (Organizer)

    COLING, EcomNLP

    Huasha Zhao, Parikshit Sondhi, Nguyen Bach, Sanjika Hewavitharana, Yifan He, Luo Si, Heng Ji

    See publication
  • Empirical Analysis of Impact of Query-Specific Customization of nDCG: A Case-Study with Learning-to-Rank Methods

    CIKM 2020

    Shubhra (Santu) K Karmaker, Parikshit Sondhi, ChengXiang Zhai

    In most existing works, nDCG is computed for a fixed cutoff , i.e., nDCG@k and some fixed discounting coefficient. Such a conventional query-independent way to compute nDCG does not accurately reflect the utility of search results perceived by an individual user and is thus non-optimal. In this paper, we conduct a case study of the impact of using query-specific nDCG on the choice of the optimal Learning-to-Rank (LETOR)…

    Shubhra (Santu) K Karmaker, Parikshit Sondhi, ChengXiang Zhai

    In most existing works, nDCG is computed for a fixed cutoff , i.e., nDCG@k and some fixed discounting coefficient. Such a conventional query-independent way to compute nDCG does not accurately reflect the utility of search results perceived by an individual user and is thus non-optimal. In this paper, we conduct a case study of the impact of using query-specific nDCG on the choice of the optimal Learning-to-Rank (LETOR) methods, particularly to see whether using a query-specific nDCG would lead to a different conclusion about the relative performance of multiple LETOR methods than using the conventional query-independent nDCG would otherwise. Our initial results show that the relative ranking of LETOR methods using query-specific nDCG can be dramatically different from those using the query-independent nDCG at the individual query level, suggesting that query-specific nDCG may be useful in order to obtain more reliable conclusions in retrieval experiments.

    See publication
  • Modeling Sequential Online Interactive Behaviors with Temporal Point Process

    CIKM 2018

    Renqin Cai, Xueying Bai, Yuling Shi, Zhenrui Wang, Parikshit Sondhi and Hongning Wang. 2018. Modeling Sequential Online Interactive Behaviors with Temporal Point Process. In Proceedings of CIKM ’18. ACM, Turin, Italy. (Accepted)

  • A Taxonomy of Queries for E-commerce Search

    SIGIR 2018

    Understanding the search tasks and search behavior of users is necessary for optimizing search engine results. While much work has been done on understanding the users in Web search, little knowledge is available about the search tasks and behavior of users in the E-Commerce (E-Com) search applications. In this paper, we conduct the first empirical study of the queries and search behaviors of users in E-Com search by analyzing a search log data set from a major E-Com search engine. The…

    Understanding the search tasks and search behavior of users is necessary for optimizing search engine results. While much work has been done on understanding the users in Web search, little knowledge is available about the search tasks and behavior of users in the E-Commerce (E-Com) search applications. In this paper, we conduct the first empirical study of the queries and search behaviors of users in E-Com search by analyzing a search log data set from a major E-Com search engine. The analysis results show that E-Com queries can be categorized into roughly five categories, each with distinctive search behaviors: (1) Exploration Queries are short vague queries that a user may use initially in exploring the product space. (2) Targeted Purchase Queries are queries used by users to purchase items that they are generally familiar with, thus without much decision making. (3) Major-Item Shopping Queries are used by users to shop for a major item which is often relatively expensive and thus requires some serious exploration, but typically in a limited scope of choices. (4) Minor-Item Shopping Queries are used by users to shop for minor items that are generally not very expensive, but still require some exploration of choices. (5) Hard- Choice Shopping Queries are used by users who want to deeply explore all the candidate products before finalizing the choice often appropriate when multiple products must be carefully compared with each other. These five categories form a taxonomy for E-Com queries and can shed light on how we may develop customized search technologies for each type of search queries to improve search engine utility.

    Other authors
    See publication
  • Mining E-Commerce query relations using customer interaction networks

    The Web Conference (WWW), 2018

    Customer Interaction Networks (CINs) are a natural framework for representing and mining customer interactions with E-Commerce search engines. Customer interactions begin with the submission of a query formulated based on an initial product intent, followed by a sequence of product engagement and query reformulation actions. Engagement with a product (eg. clicks), signals its relevance to the customer’s product intent. Reformulation to a new query indicates either dissatisfaction with current…

    Customer Interaction Networks (CINs) are a natural framework for representing and mining customer interactions with E-Commerce search engines. Customer interactions begin with the submission of a query formulated based on an initial product intent, followed by a sequence of product engagement and query reformulation actions. Engagement with a product (eg. clicks), signals its relevance to the customer’s product intent. Reformulation to a new query indicates either dissatisfaction with current results, or an evolution in the customer’s product intent. Analyzing such interactions within and across sessions, enables us to discover various query-query and query-product relationships.

    In this work, we begin by studying the properties of a real-world customer interaction network developed using Walmart.com’s product search logs. We observe that CINs exhibit significantly different properties compared to other real world networks (e.g. WWW, social networks etc.), making it possible to mine intent relationships between queries, based purely on their structural information. In particular, we show that one can formulate the problem of clustering queries with similar intents, as a community detection task on CINs. Our results show that existing community detection methods already do a good job at identifying intent based query clusters, without using any textual features. We further identify their limitations and propose improved methods for the task. Finally we show how these relations can be exploited to a) significantly improve search quality for poorly performing queries, and b) identify the most influential (aka. Critical) queries whose search quality is crucial in enabling an E-Commerce search engine satisfy the most customers. Via extensive experiments, we show that our CIN based methods significantly outperform existing baselines in practice.

    Other authors
    See publication
  • On Application of Learning to Rank for E-Commerce Search

    SIGIR 2017

    Details our work on applying Letor methods to eCommerce search.

    See publication
  • Generative Feature Language Models for Mining Implicit Features from Customer Reviews.

    CIKM

    Online customer reviews are very useful for both helping consumers make buying decisions on products or services and providing business intelligence. However, it is a challenge for people to manually digest all the opinions buried in large amounts of review data, raising the need for automatic opinion summarization and analysis. One fundamental challenge in automatic opinion summarization and analysis is to mine implicit features, i.e., recognizing the features implicitly mentioned (referred…

    Online customer reviews are very useful for both helping consumers make buying decisions on products or services and providing business intelligence. However, it is a challenge for people to manually digest all the opinions buried in large amounts of review data, raising the need for automatic opinion summarization and analysis. One fundamental challenge in automatic opinion summarization and analysis is to mine implicit features, i.e., recognizing the features implicitly mentioned (referred to) in a review sentence. Existing approaches require many ad hoc manual parameter tuning, and are thus hard to optimize or generalize; their evaluation has only been done with Chinese review data. In this paper, we propose a new approach based on generative feature language models that can mine the implicit features more effectively through unsupervised statistical learning. The parameters are optimized automatically using an Expectation-Maximization algorithm. We also created eight new data sets to facilitate evaluation of this task in English. Experimental results show that our proposed approach is very effective for assigning features to sentences that do not explicitly mention the features, and outperforms the existing algorithms by a large margin.

    Other authors
    • Shubhra Kanti Karmaker Santu
    • ChengXiang Zhai
  • Resolving Healthcare Forum Posts via Similar Thread Retrieval

    ACM BCB

    Other authors
    • Chengxiang Zhai
    • Bruce Schatz
  • A Bayesian Framework for Modeling Price Preference in Product Search

    NIPS Workshop on Analysis of Rank Data 2014

    Product search is an emerging search application where optimization of search
    results relies critically on an accurate model of a user’s price preference. In this
    paper, we propose a Bayesian framework for modeling a user’s price preference
    with a particular focus on developing a smart price filter model for inferring a
    user’s price preference based on the user’s selection of price filters and optimizing
    ranking of products accordingly. Preliminary experiment results with…

    Product search is an emerging search application where optimization of search
    results relies critically on an accurate model of a user’s price preference. In this
    paper, we propose a Bayesian framework for modeling a user’s price preference
    with a particular focus on developing a smart price filter model for inferring a
    user’s price preference based on the user’s selection of price filters and optimizing
    ranking of products accordingly. Preliminary experiment results with product
    search log show promise of the framework, which opens up interesting opportunities
    for new research in the intersection of machine learning, information retrieval
    and economics.

    Other authors
    • Yinan Zhang
    • ChengXiang Zhai
    • Anjan Goswami
    See publication
  • A Constrained Hidden Markov Model Approach for Non-Explicit Citation Context Extraction

    SDM 2014

    In this paper we present a constrained hidden markov model based approach for extracting non-explicit citing sentences in research articles. Our method involves first independently training a separate HMM for each citation in the article being processed and then performing a constrained joint inference to label non-explicit citing sentences. Results on a standard test collection show that our method significantly outperforms the baselines and is comparable to the state of the art approaches.

    Other authors
    • ChengXiang Zhai
    See publication
  • Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites

    CIKM 2014

    Over the past few years, community QA websites (e.g. Yahoo! Answers) have become a useful platform for users to post questions and obtain answers. However, not all questions posted there receive informative answers or are answered in a timely manner. In this paper, we show that the answers to some of these questions are available in online domain-specific knowledge bases and propose an approach to automatically discover those answers. In the proposed approach, we would first mine appropriate…

    Over the past few years, community QA websites (e.g. Yahoo! Answers) have become a useful platform for users to post questions and obtain answers. However, not all questions posted there receive informative answers or are answered in a timely manner. In this paper, we show that the answers to some of these questions are available in online domain-specific knowledge bases and propose an approach to automatically discover those answers. In the proposed approach, we would first mine appropriate SQL query patterns by leveraging an existing collection of QA pairs, and then use the learned query patterns to answer previously unseen questions by returning relevant entities from the knowledge base. Evaluation on a collection of health domain questions from Yahoo! Answers shows that the proposed method is effective in discovering potential answers to user questions from an online medical knowledge base.

    Other authors
    • ChengXiang Zhai
    See publication
  • Autonomous agents for serving complex information needs

    University of Illinois

  • Exploiting Forum Thread Structures to Improve Thread Clustering

    ACM ICTIR

    Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art…

    Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art clustering methods (i.e., hierarchical agglomerative clustering, k-Means, and probabilistic latent semantic analysis) to cluster forum threads and study how to leverage the structure of threads to improve clustering accuracy. We propose three different methods for assigning weights to the posts in a forum thread to achieve more accurate representation of a thread. We evaluate all the methods on data collected from three different Linux forums for both within-forum and across-forum clustering. Our results show that the state of the art methods perform reasonably well for this task, but the performance can be further improved by exploiting thread structures. In particular, a parabolic weighting method that assigns higher weights for both beginning posts and end posts of a thread is shown to consistently outperform a standard clustering method.

    Other authors
    See publication
  • Leveraging medical thesauri and physician feedback for improving medical literature retrieval for case queries

    Journal of Medical Informatics Association (JAMIA 2012)

    This paper presents a study of methods for medical literature retrieval for case queries, in which the goal is to retrieve literature articles similar to a given patient case. In particular, it focuses on analyzing the performance of state-of-the-art general retrieval methods and improving them by the use of medical thesauri and physician feedback.

    Other authors
    See publication
  • Reliability prediction of webpages in the medical domain

    ECIR 2012

    In this paper, we study how to automatically predict reliability of web pages in the medical domain. Assessing reliability of online medical information is especially critical as it may potentially influence
    vulnerable patients seeking help online. Unfortunately, there are no automated systems currently available that can classify a medical webpage as being reliable, while manual assessment cannot scale up to process the large number of medical pages on the Web. We propose a supervised…

    In this paper, we study how to automatically predict reliability of web pages in the medical domain. Assessing reliability of online medical information is especially critical as it may potentially influence
    vulnerable patients seeking help online. Unfortunately, there are no automated systems currently available that can classify a medical webpage as being reliable, while manual assessment cannot scale up to process the large number of medical pages on the Web. We propose a supervised learning approach to automatically predict reliability of medical webpages. We developed a gold standard dataset using the standard reliability criteria defined by the Health on Net Foundation and systematically experimented with different link and content based feature sets. Our experiments show promising results with prediction accuracies of over 80%. We also show that our proposed prediction method is useful in applications such as reliability-based re-ranking and automatic website
    accreditation.

    Other authors
    • VG Vinod Vydiswaran
    • ChengXiang Zhai
    See publication
  • SympGraph: a framework for mining clinical notes through symptom relation graphs

    KDD 2012

    As an integral part of Electronic Health Records (EHRs), clinical
    notes pose special challenges for analyzing EHRs due to their unstructured
    nature. In this paper, we present a general mining framework SympGraph for modeling and analyzing symptom relationships in clinical notes.
    A SympGraph has symptoms as nodes and co-occurrence relations between symptoms as edges, and can be constructed automatically through extracting symptoms over sequences of clinical notes for a large number of…

    As an integral part of Electronic Health Records (EHRs), clinical
    notes pose special challenges for analyzing EHRs due to their unstructured
    nature. In this paper, we present a general mining framework SympGraph for modeling and analyzing symptom relationships in clinical notes.
    A SympGraph has symptoms as nodes and co-occurrence relations between symptoms as edges, and can be constructed automatically through extracting symptoms over sequences of clinical notes for a large number of patients. We present an important clinical application of SympGraph: symptom expansion, which can expand a given set of symptoms to other related symptoms by analyzing the underlying SympGraph structure. We further propose a matrix update algorithm which provides a significant computational saving for dynamic updates to the graph. Comprehensive evaluation on 1 million longitudinal clinical notes over 13K patients shows that static symptom expansion can successfully expand a set of known symptoms to a disease with high agreement rate with physician input (average precision 0.46), a 31% improvement over baseline co-occurrence based methods. The experimental results also show that the expanded symptoms can serve as useful features for improving AUC measure for disease diagnosis prediction, thus confirming the potential clinical value of our work.

    Other authors
    • Jimeng Sun
    • HangHang Tong
    • ChengXiang Zhai
    See publication
  • Comprehensive review of opinion summarization

    The abundance of opinions on the web has kindled the study of opinion summarization over the last few years. People have introduced various techniques and paradigms to solving this special task. This survey attempts to systematically investigate the different techniques and approaches used in opinion summarization. We provide a multi-perspective classification of the approaches used and highlight some of the key weaknesses of these approaches. This survey also covers evaluation techniques and…

    The abundance of opinions on the web has kindled the study of opinion summarization over the last few years. People have introduced various techniques and paradigms to solving this special task. This survey attempts to systematically investigate the different techniques and approaches used in opinion summarization. We provide a multi-perspective classification of the approaches used and highlight some of the key weaknesses of these approaches. This survey also covers evaluation techniques and data sets used in studying the opinion summarization problem. Finally, we provide insights into some of the challenges that are left to be addressed as this will help set the trend for future research in this area.

    Other authors
    See publication
  • Domain‐specific entity and relationship extraction from query logs

    ASIST 2010

    Extracting domain-specific entity-relationships is useful in a wide variety of applications. For example, knowledge of camera companies and their product hierarchies can help photography-related search engines greatly in improving search interfaces. In this paper we describe an unsupervised approach for extracting prominent domain specific entity-relationships from query logs. Our approach is complementary to other entity extraction methods. It first constructs a weighted directed graph with…

    Extracting domain-specific entity-relationships is useful in a wide variety of applications. For example, knowledge of camera companies and their product hierarchies can help photography-related search engines greatly in improving search interfaces. In this paper we describe an unsupervised approach for extracting prominent domain specific entity-relationships from query logs. Our approach is complementary to other entity extraction methods. It first constructs a weighted directed graph with query keywords as nodes and then prunes out edges not likely to represent useful relations. Experiments with multiple domains show promising results with over 80% precision.

    Other authors
    See publication
  • Medical Case-based Retrieval by Leveraging Medical Ontology and Physician Feedback: UIUC-IBM at ImageCLEF 2010

    CLEF 2010

    This paper reports the experiment results of the UIUC-IBM team in participating in the medical case retrieval task of ImageCLEF 2010. We experimented with multiple methods to leverage medical ontology and user (physician) feedback; both have worked very well, achieving the best retrieval
    performance among all the submissions.

    Other authors
    See publication
  • Reconstructing missing signals in multi-parameter physiologic data by mining the aligned contextual information

    Computing in cardiology

    The PhysioNet Challenge 2010 is to recover missing segments of a particular signal in the given multi-parameter physiologic data set. In this paper we propose a contextual information based framework to achieve robust reconstruction. For a given target signal that is to be reconstructed, our algorithm intelligently choose among three sub-algorithms to best recover the missing segments. Experiments are carried out on the Physionet/ CinC Challenge 2010 data sets. The results show that the…

    The PhysioNet Challenge 2010 is to recover missing segments of a particular signal in the given multi-parameter physiologic data set. In this paper we propose a contextual information based framework to achieve robust reconstruction. For a given target signal that is to be reconstructed, our algorithm intelligently choose among three sub-algorithms to best recover the missing segments. Experiments are carried out on the Physionet/ CinC Challenge 2010 data sets. The results show that the proposed method is particularly effective on signals that have well aligned contextual signals.

    Other authors
    See publication
  • Shallow information extraction from medical forum data

    COLING 2010

    We study a novel shallow information extraction problem that involves extracting sentences of a given set of topic categories from medical forum data. Given a corpus of medical forum documents, our goal is to extract two related types of sentences that describe a biomedical case (i.e., medical problem descriptions and medical treatment descriptions). Such an extraction task directly generates medical case descriptions that can be useful in many applications. We solve the problem using two…

    We study a novel shallow information extraction problem that involves extracting sentences of a given set of topic categories from medical forum data. Given a corpus of medical forum documents, our goal is to extract two related types of sentences that describe a biomedical case (i.e., medical problem descriptions and medical treatment descriptions). Such an extraction task directly generates medical case descriptions that can be useful in many applications. We solve the problem using two popular machine learning methods Support Vector Machines (SVM) and Conditional Random Fields (CRF). We propose novel features to improve the accuracy of extraction. Experiment results show that we can obtain an accuracy of up to 75%.

    Other authors
    • Manish Gupta
    • Julia Hockenmaier
    • ChengXiang Zhai
    See publication
  • Using query context models to construct topical search engines

    ACM IIiX 2010

    Today, if a website owner or blogger wants to provide a search interface on their web site, they have essentially two options: web search or site search. Site search is often too narrow and web search often too broad. We propose a context-specific alternative: the use of 'topical search engines' (TopS) providing results focused on a specific topic determined by the site owner. For example a photography blog could offer a search interface focused on photography.

    In this paper, we describe…

    Today, if a website owner or blogger wants to provide a search interface on their web site, they have essentially two options: web search or site search. Site search is often too narrow and web search often too broad. We propose a context-specific alternative: the use of 'topical search engines' (TopS) providing results focused on a specific topic determined by the site owner. For example a photography blog could offer a search interface focused on photography.

    In this paper, we describe a promising new approach to easily create such topical search engines with minimal manual effort. In our approach, whenever we have enough contextual information, we alter ambiguous topic related queries issued to a generic search engine by adding contextual keywords derived from (topic-specific) query logs; the altered queries help focus the search engine's results to the specific topic of interest. Our solution is deployed as a query wrapper, requiring no change in the underlying search engine.

    We present techniques to automatically extract queries related to a topic from a web click graph, identify suitable query contexts from these topical queries, and use these contexts to alter queries that are ambiguous or under-specified. We present statistics on three topical search engine prototypes we created. We then describe an evaluation study with the prototypes we developed in the areas of photography and automobiles. We conducted three tests comparing these prototypes to baseline engines with and without fixed query refinements. In each test, we obtained preference judgments from over a hundred participants. Users showed a strong preference for TopS prototypes in all three tests, with statistically significant preference differences ranging from 16% to 42%.

    Other authors
    See publication
  • Feature Construction Methods: A Survey

    University of Illinois

    A good feature representation is central to achieving high performance in any machine learning task. However manually defining a good feature set is often not feasible. Feature construction involves transforming a given set of input features to generate a new set of more powerful features which can then used for prediction. Several feature construction methods have been developed. In this paper we present a survey of past 20 years of research in the area. We describe the major issues involved…

    A good feature representation is central to achieving high performance in any machine learning task. However manually defining a good feature set is often not feasible. Feature construction involves transforming a given set of input features to generate a new set of more powerful features which can then used for prediction. Several feature construction methods have been developed. In this paper we present a survey of past 20 years of research in the area. We describe the major issues involved and discuss the manner in which various methods deal with them. While our understanding of feature construction has grown significantly over the years, a number of open challenges continue to remain.

    See publication
  • Question processing and clustering in INDOC: a biomedical question answering system

    EURASIP Journal on Bioinformatics and Systems Biology 2007

    The exponential growth in the volume of publications in the biomedical domain has made it impossible for an individual to keep pace with the advances. Even though evidence-based medicine has gained wide acceptance, the physicians are unable to access the relevant information in the required time, leaving most of the questions unanswered. This accentuates the need for fast and
    accurate biomedical question answering systems. In this paper we introduce INDOC—a biomedical question answering…

    The exponential growth in the volume of publications in the biomedical domain has made it impossible for an individual to keep pace with the advances. Even though evidence-based medicine has gained wide acceptance, the physicians are unable to access the relevant information in the required time, leaving most of the questions unanswered. This accentuates the need for fast and
    accurate biomedical question answering systems. In this paper we introduce INDOC—a biomedical question answering system based on novel ideas of indexing and extracting the answer to the questions posed. INDOC displays the results in clusters to help the user arrive at the most relevant set of documents quickly. Evaluation was done against the standard OHSUMED test collection. Our system achieves high accuracy and minimizes user effort.

    Other authors
    See publication

Patents

Honors & Awards

  • CIKM 2022 Tutorial Co-Chair

    CIKM

    Senior co-chair for the CIKM 2022 Tutorial Track.

  • Best Paper Award

    CIKM

    Awarded to our work:
    Empirical Analysis of Impact of Query-Specific Customization of nDCG: A Case-Study with Learning-to-Rank Methods

  • CIKM 2016 industry track co-chair

    -

    http://cikm2016.cs.iupui.edu/call-for-industry-papers/

  • Best performing search engine at the ImageCLEF Medical Case Retrieval Challenge

    CLEF 2010

    http://ceur-ws.org/Vol-1176/CLEF2010wn-ImageCLEF-SondhiEt2010.pdf

  • Indiant Institute of Technology, Institute Silver Medal for highest CGPA

    Indian Institute of Technology Roorkee

  • Recipient of the Ministry of Human Resource and Development Scholarship

    Ministry of Human Resource and Development, Govt. of India

  • Third position in Regional Mathematics Olympiad

    -

  • National Science Olympiad - All India Rank 77

    -

  • National Science Olympiad - All India Rank 43

    -

More activity by Parikshit

View Parikshit’s full profile

  • See who you know in common
  • Get introduced
  • Contact Parikshit directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Parikshit Sondhi

Add new skills with these courses