Patent Issued for Dynamic speech output configuration (USPTO 11398218): United Services Automobile Association

2022 AUG 16 (NewsRx) -- By a News Reporter-Staff News Editor at Insurance Daily News -- According to news reporting originating from Alexandria, Virginia, by NewsRx journalists, a patent by the inventors Barner, Robert Wilson (San Antonio, TX, US), Haslam, Justin Dax (San Antonio, TX, US), filed on April 25, 2019, was published online on July 26, 2022.

The assignee for this patent, patent number 11398218, is United Services Automobile Association (San Antonio, Texas, United States).

Reporters obtained the following quote from the background information supplied by the inventors: “Devices such as internet-of-things (IoT) devices, smart appliances, portable computing devices, and other types of devices may provide speech input and/or speech output capabilities that enable a user to provide input through voice commands and/or receive output as machine-generated speech. Traditionally, devices present speech output using a machine-generated “voice” that, though comprehensible, poorly approximates human speech patterns. Some devices, such as in-car navigation systems, provide speech output using pre-recorded segments of speech from celebrities or other persons. However, such traditional systems that use pre-recorded audio segments are not able to accommodate situations in which the text to be output as speech is not predetermined.”

In addition to obtaining background information on this patent, NewsRx editors also obtained the inventors’ summary information for this patent: “Implementations of the present disclosure are generally directed to performing text-to-speech output using a voice profile that approximates the voice of the user that composed the text to be output. More specifically, implementations are directed to dynamically determining a machine learning developed voice profile to be employed to present speech output for received text, and employing the voice profile to generate the speech output through a text-to-speech engine that is configured to provide output in different voices depending on the determined voice profile to be used. Implementations are further directed to dynamically modifying the speech output generation of a virtual agent based on feedback data that describes a response of a user to the speech output.

“In general, innovative aspects of the subject matter described in this specification can be embodied in methods that include operations of: receiving a message that includes text data from a sending user and, in response, dynamically determining a voice profile to be used by a text-to-speech (TTS) engine to present the text data as speech output, the voice profile including one or more attributes of a voice of the sending user; accessing the voice profile from data storage, the voice profile having been developed, using a machine learning algorithm, based on speech input from the sending user; and presenting at least a portion of the text data as speech output that is generated by the TTS engine employing the one or more attributes of the voice profile to approximate, in the speech output, the voice of the sending user.

“Implementations can optionally include one or more of the following features: the message includes a profile identifier (ID) corresponding to the voice profile to be used to present the text data as the speech output; accessing the voice profile includes using the profile ID to retrieve the voice profile from data storage; the message indicates a user identifier (ID) of the sending user; accessing the voice profile includes using the user ID to retrieve the voice profile from data storage; the user ID includes one or more of an email address, a telephone number, a social network profile name, and a gamer tag; the one or more attributes of the voice profile include one or more of a tone, a pitch, a register, a speed, and a timbre of the voice of the sending user; the operations further include developing the voice profile, using the machine learning algorithm, based on a plurality of iterations of the speech input from the sending user; during each iteration the voice profile is used by the TTS engine to generate test speech output and the voice profile is further developed based on a comparison of the test speech output to the speech input; and/or the speech output is presented by a virtual assistant (VA).

“Other implementations of any of the above aspects include corresponding systems, apparatus, and computer programs that are configured to perform the actions of the methods, encoded on computer storage devices. The present disclosure also provides a computer-readable storage medium coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein. The present disclosure further provides a system for implementing the methods provided herein. The system includes one or more processors, and a computer-readable storage medium coupled to the one or more processors having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with implementations of the methods provided herein.

“Implementations of the present disclosure provide one or more of the following technical advantages and improvements over traditional systems. By providing a system in which text data of messages is read in a machine-generated voice that approximates the voice of the sender, implementations provide a messaging experience that is more personal, individualized, and dynamically adapted to different senders compared to traditional systems that provide speech output using a single, generic, machine-generated voice. Through use of personalized speech output, implementations provide a message access experience that is less prone to confusion or user error compared to traditional systems, given that implementations enable a user to readily distinguish between messages sent by different users. Accordingly, implementations can avoid the expenditure of processing power, active memory, storage space, network bandwidth, and/or other computing resources that previously available, traditional systems expend to recover from user errors in erroneously addressed responses and mistakenly accessed messages. Implementations also provide advantages and improvements regarding improved user experience and improved safety. For example based on the customized voice output, a user can know who a message is from based on their voice, instead of needing to look at his or her device while driving. Also, implementations provide an improved user experience for visually impaired individuals, given that the system does not need to recite a “Message sender says” preamble prior to reading the message, as can be performed by traditional systems. Accordingly, implementations also avoid the expenditure of computing resources that traditional systems expend to provide additional output (e.g., visual and/or audio) to identify a sender of a message.”

The claims supplied by the inventors are:

“1. A computer-implemented method performed by at least one processor, the method comprising: receiving, by the at least one processor, a message that includes text data and, in response, dynamically selecting a voice profile to be used by a text-to-speech (TTS) engine to present the text data as speech output, the voice profile including data defining one or more attributes of a machine-generated voice which, when applied by the TTS engine approximate the voice of a particular human, wherein the one or more attributes include at least a pitch, tone, and speed associated with the voice of the particular human, and wherein the message is one of multiple messages as part of a conversation; presenting, by the at least one processor to a receiving user, at least a portion of the text data as speech output that is generated by the TTS engine employing the one or more attributes of the voice profile; obtaining, by the at least one processor, feedback data from the receiving user, the feedback data responsive to the receiving user’s impression of the speech output, wherein the feedback data comprises biometric data including one or more of the receiving users: heart rate, pulse, perspiration, respiration rate, eye movements, facial movements, facial expressions, or body movements, and wherein the biometric data is indicative of an emotional state of the receiving user during or following the presentation of the speech output; and dynamically modifying, during the conversation and by the at least one processor, at least one of the one or more attributes of the voice profile based on the feedback data from the receiving user by: detecting a mood or emotional state of the user based on the biometric data; and modifying the voice profile to adapt the voice profile to the mood or emotional state of the user.

“2. The method of claim 1, wherein: the message includes a profile identifier (ID) corresponding to the voice profile to be used to present the text data as the speech output; and selecting the voice profile includes using the profile ID to retrieve the voice profile from data storage.

“3. The method of claim 1, wherein: the message indicates a user identifier (ID) of a sending user; and selecting the voice profile includes using the user ID to retrieve the voice profile from data storage.

“4. The method of claim 3, wherein the user ID includes one or more of an email address, a telephone number, a social network profile name, and a gamer tag.

“5. The method of claim 1, wherein the one or more attributes of the voice profile further include one or more of a register, and a timbre of the machine-generated voice.

“6. The method of claim 1, wherein the voice profile is developed, using a machine learning algorithm, based on speech input from a sending user, and wherein, during each iteration: the voice profile is used by the TTS engine to generate test speech output; and the voice profile is further developed based on a comparison of the test speech output to the speech input.

“7. The method of claim 1, wherein the speech output is presented by a virtual assistant (VA).

“8. The method of claim 1, wherein the conversation includes a hybrid text and speech conversation in which a response by the receiving user is received as speech input.

“9. A system, comprising: at least one processor; and a memory communicatively coupled to the at least one processor, the memory storing instructions which, when executed by the at least one processor, cause the at least one processor to perform operations comprising: receiving a message that includes text data and, in response, dynamically selecting a voice profile to be used by a text-to-speech (TTS) engine to present the text data as speech output, the voice profile including one or more attributes of a machine-generated voice which, when applied by the TTS engine approximate the voice of a particular human, wherein the one or more attributes include at least a pitch, tone, and speed associated with the voice of the particular human, and wherein the message is one of multiple messages as part of a conversation; presenting, to a receiving user, at least a portion of the text data as speech output that is generated by the TTS engine employing the one or more attributes of the voice profile; obtaining, by the at least on processor, feedback data from the receiving user, the feedback data responsive to the receiving user’s impression of the speech output, wherein the feedback data comprises biometric data including one or more of the receiving users: heart rate, pulse, perspiration, respiration rate, eye movements, facial movements, facial expressions, or body movements, and wherein the biometric data is indicative of an emotional state of the receiving user during or following the presentation of the speech output; and dynamically modifying, during the conversation and by the at least one processor, at least one of the one or more attributes of the voice profile based on the feedback data from the receiving user by: detecting a mood or emotional state of the user based on the biometric data; and modifying the voice profile to adapt the voice profile to the mood or emotional state of the user.

“10. The system of claim 9, wherein: the message includes a profile identifier (ID) corresponding to the voice profile to be used to present the text data as the speech output; and selecting the voice profile includes using the profile ID to retrieve the voice profile from data storage.

“11. The system of claim 9, wherein: the message indicates a user identifier (ID) of a sending user; and selecting the voice profile includes using the user ID to retrieve the voice profile from data storage.

“12. The system of claim 11, wherein the user ID includes one or more of an email address, a telephone number, a social network profile name, and a gamer tag.

“13. The system of claim 9, wherein the one or more attributes of the voice profile further include one or more of a register, and a timbre of the machine-generated voice.

“14. The system of claim 9, wherein the voice profile is developed, using a machine learning algorithm, based on speech input from a sending user, and wherein, during each iteration: the voice profile is used by the TTS engine to generate test speech output; and the voice profile is further developed based on a comparison of the test speech output to the speech input.

“15. The System of claim 9, wherein the conversation includes a hybrid text and speech conversation in which a response by the receiving user is received as speech input.

“16. One or more non-transitory computer-readable media storing instructions which, when executed by at least one processor, cause the at least one processor to perform operations comprising: receiving a message that includes text data and, in response, dynamically selecting a voice profile to be used by a text-to-speech (TTS) engine to present the text data as speech output, the voice profile including data defining one or more attributes of a machine-generated voice which, when applied by the TTS engine approximate the voice of a particular human, wherein the one or more attributes include at least a pitch, tone, and speed associated with the voice of the particular human, and wherein the message is one of multiple messages as part of a conversation; presenting, to a receiving user, at least a portion of the text data as speech output that is generated by the TTS engine employing the one or more attributes of the voice profile; obtaining, by the at least on processor, feedback data from the receiving user, the feedback data responsive to the receiving user’s impression of the speech output, wherein the feedback data comprises biometric data including one or more of the receiving users: heart rate, pulse, perspiration, respiration rate, eye movements, facial movements, facial expressions, or body movements, and wherein the biometric data is indicative of an emotional state of the receiving user during or following the presentation of the speech output; and dynamically modifying, during the conversation and by the at least one processor, at least one of the one or more attributes of the voice profile based on the feedback data from the receiving user by: detecting a mood or emotional state of the user based on the biometric data; and modifying the voice profile to adapt the voice profile to the mood or emotional state of the user.

“17. The media of claim 16, wherein: the message includes a profile identifier (ID) corresponding to the voice profile to be used to present the text data as the speech output; and selecting the voice profile includes using the profile ID to retrieve the voice profile from data storage.

“18. The media of claim 16, wherein: the message indicates a user identifier (ID) of a sending user; and selecting the voice profile includes using the user ID to retrieve the voice profile from data storage.

“19. The media of claim 18, wherein the user ID includes one or more of an email address, a telephone number, a social network profile name, and a gamer tag.

“20. The media of claim 16, wherein the conversation includes a hybrid text and speech conversation in which a response by the receiving user is received as speech input.”

For more information, see this patent: Barner, Robert Wilson. Dynamic speech output configuration. U.S. Patent Number 11398218, filed April 25, 2019, and published online on July 26, 2022. Patent URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=11398218.PN.&OS=PN/11398218RS=PN/11398218

(Our reports deliver fact-based news of research and discoveries from around the world.)

Patent Issued for Dynamic speech output configuration (USPTO 11398218): United Services Automobile Association

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Patent Issued for Dynamic speech output configuration (USPTO 11398218): United Services Automobile Association

Advisor News

Annuity News

Health/Employee Benefits News

Life Insurance News

Sign in with your Insider Pro Account